simmediumroboticsmetric · varies

MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models

Description

Large language models such as ChatGPT and GPT-4 have recently achieved astonishing performance on a variety of natural language processing tasks. In this paper, we propose MANGO, a benchmark to evaluate their capabilities to perform text-based mapping and navigation. Our benchmark includes 53 mazes taken from a suite of textgames: each maze is paired with a walkthrough that visits every location but does not cover all possible paths. The task is question-answering: for each maze, a large languag

Source

http://arxiv.org/abs/2403.19913v2