simmediumoffline-rlmetric · varies

VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

Description

Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we intro

Source

http://arxiv.org/abs/2504.11944v2