simmediumoffline-rlmetric · varies

Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

Description

We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte C

Source

http://arxiv.org/abs/2503.14286v2