simmediumoffline-rlmetric · varies

RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning

Description

Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, R

Source

http://arxiv.org/abs/2510.02892v1