← Back to Benchmarks
simmediumroboticsmetric · varies

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Description

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations

Source

http://arxiv.org/abs/2603.19183v1