← Back to Benchmarks
simmediumroboticsmetric · varies
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Description
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Dif