simmediumoffline-rlmetric · varies

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

Description

Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting rollout horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize

Source

http://arxiv.org/abs/2512.04341v2