← Back to Benchmarks
simmediumrlmetric · varies

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Description

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism in

Source

http://arxiv.org/abs/2603.10101v1