simmediumroboticsmetric · varies

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Description

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-le

Source

http://arxiv.org/abs/2603.22003v1