← Back to Benchmarks
simmediumroboticsmetric · varies

LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

Description

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Ex

Source

http://arxiv.org/abs/2603.25399v1