← Back to Benchmarks
simmediumnavigationmetric · varies
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
Description
Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips visi