simmediumoffline-rlmetric · varies

Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Description

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorit

Source

http://arxiv.org/abs/2507.08761v2