← Back to Benchmarks
simmediumvision-robotmetric · varies

Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

Description

Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inferen

Source

http://arxiv.org/abs/2603.23202v2