← Back to Benchmarks
simmediumoffline-rlmetric · varies
Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study
Description
Despite recent progress in training long-chain-of-thought reasoning models via scaling reinforcement learning (RL), its underlying training dynamics remain poorly understood, and several counterintuitive behaviors persist. This work focuses on three key aspects: (1) We systematically analyze the roles of positive and negative samples in scaling RL, revealing that positive samples mainly facilitate precise fitting to the training data, whereas negative samples significantly enhance generalization