← Back to Benchmarks
simmediumvision-robotmetric · varies

pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Description

Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D rec

Source

http://arxiv.org/abs/2603.00905v1