← Back to Benchmarks
simmediumnavigationmetric · varies

VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness

Description

Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token posit

Source

http://arxiv.org/abs/2603.07080v2