← Back to Benchmarks
simmediumroboticsmetric · varies

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Description

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recover

Source

http://arxiv.org/abs/2603.19233v1