← Back to Benchmarks
simmediummanipulationmetric · varies

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Description

Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-bas

Source

http://arxiv.org/abs/2603.10469v1