simmediumoffline-rlmetric · varies

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward

Description

Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large datasets, leading to high training costs and low data efficiency. To mitigate this issue, we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate

Source

http://arxiv.org/abs/2509.01321v1