← Back to Benchmarks
simmediumoffline-rlmetric · varies
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Description
Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO), has shown strong empirical results in training recent reasoning models, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistak