simmediumvision-robotmetric · varies

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Description

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we intro

Source

http://arxiv.org/abs/2602.17659v1