← Back to Benchmarks
simmediumroboticsmetric · varies
P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation
Description
In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel