← Back to Benchmarks
simmediummanipulation-datametric · varies

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Description

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning ove

Source

http://arxiv.org/abs/2603.22281v1