simmediummanipulationmetric · varies

JEPA-VLA: Video Predictive Embedding is Needed for VLA Models

Description

Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we

Source

http://arxiv.org/abs/2602.11832v1