simmediumrlmetric · varies

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Description

Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving explo

Source

http://arxiv.org/abs/2603.16206v1