← Back to Benchmarks
simmediumimitationmetric · varies
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning
Description
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing li