simmediumoffline-rlmetric · varies

Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO

Description

Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistak

Source

http://arxiv.org/abs/2505.11595v5