← Back to Benchmarks
simmediumatarimetric · varies
V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control
Description
Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimizati