simmediumrlmetric · varies

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Description

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoi

Source

http://arxiv.org/abs/2603.12554v1