simmediummanipulationmetric · varies

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Description

Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tas

Source

http://arxiv.org/abs/2601.16065v1