simmediumoffline-rlmetric · varies

Policy Constraint by Only Support Constraint for Offline Reinforcement Learning

Description

Offline reinforcement learning (RL) aims to optimize a policy by using pre-collected datasets, to maximize cumulative rewards. However, offline reinforcement learning suffers challenges due to the distributional shift between the learned and behavior policies, leading to errors when computing Q-values for out-of-distribution (OOD) actions. To mitigate this issue, policy constraint methods aim to constrain the learned policy's distribution with the distribution of the behavior policy or confine a

Source

http://arxiv.org/abs/2503.05207v1