← Back to Benchmarks
simmediumroboticsmetric · varies

Action Images: End-to-End Policy Learning via Multiview Video Generation

Description

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model th

Source

http://arxiv.org/abs/2604.06168v1