← Back to Benchmarks
simmediumrlmetric · varies

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Description

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling -- learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) -- as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting:

Source

http://arxiv.org/abs/2603.18736v1