← Back to Benchmarks
simmediummanipulation-datametric · varies

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Description

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information C

Source

http://arxiv.org/abs/2601.15197v5