simmediumrlmetric · varies

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

Description

Reinforcement learning from verifiable rewards has significantly advanced the reasoning capabilities of large language models. However, Group Relative Policy Optimization (GRPO) typically assigns a uniform, sequence-level advantage to all tokens, thereby overlooking the intrinsic information heterogeneity along reasoning chains. We show that this coarse-grained credit assignment leads to premature entropy collapse and encourages the model to generate redundant, low-quality reasoning paths. Throu

Source

http://arxiv.org/abs/2603.28204v2