← Back to Benchmarks
simmediummanipulationmetric · varies

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

Description

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First,

Source

http://arxiv.org/abs/2602.14974v1