← Back to Benchmarks
simmediummanipulationmetric · varies
From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models
Description
While vision-language-action (VLA) models for embodied agents integrate perception, reasoning, and control, they remain constrained by two critical weaknesses: first, during grasping tasks, the action tokens generated by the language model often exhibit subtle spatial deviations from the target object, resulting in grasp failures; second, they lack the ability to reliably recognize task completion, which leads to redundant actions and frequent timeout errors. To address these challenges and enha