simmediumrlmetric · varies

Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective

Description

Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probabilit

Source

http://arxiv.org/abs/2602.23816v1