← Back to Benchmarks
simmediumroboticsmetric · varies

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Description

Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Dif

Source

http://arxiv.org/abs/2604.03191v1