← Back to Benchmarks
simmediummanipulation-datametric · varies
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Description
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning ove