simmediumoffline-rlmetric · varies

Accelerating RL for LLM Reasoning with Optimal Advantage Regression

Description

Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose $A$*-PO, a novel two-stage policy optimization framework that

Source

http://arxiv.org/abs/2505.20686v1