simmediumoffline-rlmetric · varies

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Description

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (\model), stabilizing gradien

Source

http://arxiv.org/abs/2508.05629v3