← Back to Benchmarks
simmediumroboticsmetric · varies

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Description

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucia

Source

http://arxiv.org/abs/2604.05614v1