← Back to Benchmarks
simmediummanipulation-datametric · varies
LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Description
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information C