simmediumgraspingmetric · varies

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Description

Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthet

Source

http://arxiv.org/abs/2505.03233v3