simmediummanipulationmetric · varies

ConLA: Contrastive Latent Action Learning from Human Videos for Robotic Manipulation

Description

Vision-Language-Action (VLA) models achieve preliminary generalization through pretraining on large scale robot teleoperation datasets. However, acquiring datasets that comprehensively cover diverse tasks and environments is extremely costly and difficult to scale. In contrast, human demonstration videos offer a rich and scalable source of diverse scenes and manipulation behaviors, yet their lack of explicit action supervision hinders direct utilization. Prior work leverages VQ-VAE based framewo

Source

http://arxiv.org/abs/2602.00557v1