simmediumatarimetric · varies

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Description

Offline reinforcement learning algorithms often require careful hyperparameter tuning. Before deployment, we need to select amongst a set of candidate policies. However, there is limited understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the wo

Source

http://arxiv.org/abs/2312.02355v2