simmediumroboticsmetric · varies

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Description

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a sys

Source

http://arxiv.org/abs/2603.14523v1