← Back to Benchmarks
simmediummanipulationmetric · varies
DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference
Description
Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-bas