← Back to Benchmarks
simmediumnavigationmetric · varies

History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation

Description

Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems. Recent Vision-Language-Action (VLA) models have demonstrated strong navigation performance, but their high computational cost introduces latency that limits real-time deployment. We propose a training-free spatio-temporal vision token pruning framework tailored to VLA-based VLN. We apply spatial token selection to the

Source

http://arxiv.org/abs/2603.06480v1