simmediummanipulationmetric · varies

Learning to Accelerate Vision-Language-Action Models through Adaptive Visual Token Caching

Description

Vision-Language-Action (VLA) models have demonstrated remarkable generalization capabilities in robotic manipulation tasks, yet their substantial computational overhead remains a critical obstacle to real-world deployment. Improving inference efficiency is therefore essential for practical robotic applications. Existing acceleration methods often rely on heuristic or static strategies--such as rule-based token caching or pruning--that are decoupled from task objectives and fail to adapt to dynam

Source

http://arxiv.org/abs/2602.00686v1