simmediumrlmetric · varies

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

Description

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these c

Source

http://arxiv.org/abs/2603.16929v2