simmediumoffline-rlmetric · varies

Deep SPI: Safe Policy Improvement via World Models

Description

Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses

Source

http://arxiv.org/abs/2510.12312v2