simmediumgraspingmetric · varies

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Description

The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unifie

Source

http://arxiv.org/abs/2506.00123v1