← Back to Benchmarks
simmediumroboticsmetric · varies
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
Description
Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucia