simmediumrlmetric · varies

GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

Description

Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework th

Source

http://arxiv.org/abs/2603.23232v1