← Back to Benchmarks
simmediumhumanoidmetric · varies
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
Description
Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interrup