← Back to Benchmarks
simmediumroboticsmetric · varies

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Description

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not unde

Source

http://arxiv.org/abs/2603.29281v1