← Back to Benchmarks
simmediumrlmetric · varies

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

Description

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation r

Source

http://arxiv.org/abs/2604.02986v1