simmediummanipulationmetric · varies

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Description

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the g

Source

http://arxiv.org/abs/2603.10448v2