← Back to Benchmarks
simmediumnavigationmetric · varies
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Description
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token posit