← Back to Benchmarks
simmediumimitationmetric · varies

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Description

This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework, that excels at learning versatile behavior from multimodal goal specifications with few language annotations. MDT leverages a diffusion-based multimodal transformer backbone and two self-supervised auxiliary objectives to master long-horizon manipulation tasks based on multimodal goals. The vast majority of imitation learning methods only learn from individual goal modalities, e.g. either languag

Source

http://arxiv.org/abs/2407.05996v1