← Back to Benchmarks
simmediummanipulationmetric · varies

HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

Description

Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, an

Source

http://arxiv.org/abs/2602.21157v2