simmediumrlmetric · varies

LLMs Can Learn to Reason Via Off-Policy RL

Description

Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference policies break this assumption, making the data off-policy by design. To rectify this, prior work has focused on making this off-policy data appear more on-policy, either via importance sampling (IS), or by more closely aligning the training and inference pol

Source

http://arxiv.org/abs/2602.19362v2