simmediumrlmetric · varies

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Description

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free P

Source

http://arxiv.org/abs/2603.01563v1