← Back to Benchmarks
simmediumnavigationmetric · varies
One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation
Description
A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous me