← Back to Benchmarks
simmediummanipulationmetric · varies

BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model

Description

Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationshi

Source

http://arxiv.org/abs/2602.20566v1