← Back to Benchmarks
simmediumnavigationmetric · varies

VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents

Description

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging tradition

Source

http://arxiv.org/abs/2512.24851v2