simmediumatarimetric · varies

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

Description

Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimizati

Source

http://arxiv.org/abs/1909.12238v1