← Back to Benchmarks
simmediummanipulationmetric · varies
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Description
Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context mode