simmediumvision-robotmetric · varies

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Description

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchi

Source

http://arxiv.org/abs/2603.14482v2