simmediumimitationmetric · varies

AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Description

Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing li

Source

http://arxiv.org/abs/2508.07626v1