← Back to Benchmarks
simmediummanipulationmetric · varies
IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance
Description
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside.