simmediumrlmetric · varies

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Description

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DS

Source

http://arxiv.org/abs/2602.19895v1