← Back to Benchmarks
simmediumoffline-rlmetric · varies
One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow
Description
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow, making it compatible with Q-learning. While one-step Gaussian policies enable fast inference, they struggle to capture complex, multimodal action distributions. Existing flow-based methods improve expressivity but typically rely on distillation and two-stage training when trained with Q-learning. To overcome these limitations, we propose to re