simmediumoffline-rlmetric · varies

One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow

Description

We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to re

Source

http://arxiv.org/abs/2511.13035v1