← Back to Benchmarks
simmediumvision-robotmetric · varies

Learning Multi-View Spatial Reasoning from Cross-View Relations

Description

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K rob

Source

http://arxiv.org/abs/2603.27967v1