← Back to Benchmarks
simmediummanipulationmetric · varies
AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models
Description
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors