← Back to Benchmarks
simmediumnavigationmetric · varies

Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos

Description

Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended de

Source

http://arxiv.org/abs/2603.09259v1