← Back to Benchmarks
simmediumimitationmetric · varies

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Description

Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist

Source

http://arxiv.org/abs/2507.12440v3