← Back to Benchmarks
simmediumgraspingmetric · varies
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
Description
Embodied foundation models are gaining increasing attention for their zero-shot generalization, scalability, and adaptability to new tasks through few-shot post-training. However, existing models rely heavily on real-world data, which is costly and labor-intensive to collect. Synthetic data offers a cost-effective alternative, yet its potential remains largely underexplored. To bridge this gap, we explore the feasibility of training Vision-Language-Action models entirely with large-scale synthet