← Back to Benchmarks
simmediumvision-robotmetric · varies

PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Description

Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing

Source

http://arxiv.org/abs/2603.16958v1