simmediumhumanoidmetric · varies

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Description

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interrup

Source

http://arxiv.org/abs/2510.21817v1