simmediumdexterousmetric · varies

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Description

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual repr

Source

http://arxiv.org/abs/2412.14803v2