simmediumpolicy-learningmetric · varies

ARM: Advantage Reward Modeling for Long-Horizon Manipulation

Description

Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to es

Source

http://arxiv.org/abs/2604.03037v1