simmediumrlmetric · varies

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Description

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement,

Source

http://arxiv.org/abs/2604.02288v1