← Back to Benchmarks
simmediumnavigationmetric · varies
SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
Description
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene unde