← Back to Benchmarks
simmediumvision-robotmetric · varies
When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Description
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we intro