← Back to Benchmarks
simmediumvision-robotmetric · varies
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
Description
Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D rec