simmediumrlmetric · varies

Optimistic Policy Regularization

Description

Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We introduce Optimistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization. OPR maintains a dynamic buffer of high-performing episodes and biases learning toward these behaviors through dire

Source

http://arxiv.org/abs/2603.06793v1