simmediumgraspingmetric · varies

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Description

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a nov

Source

http://arxiv.org/abs/2511.16857v2