simmediumrlmetric · varies

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Description

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distributio

Source

http://arxiv.org/abs/2602.22495v1