← Back to Benchmarks
simmediumhumanoidmetric · varies

RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

Description

Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propo

Source

http://arxiv.org/abs/2512.23649v3