← Back to Benchmarks
simmediumoffline-rlmetric · varies

DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

Description

Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type-$p$ Wasserstein, KL, and $χ^2$ ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-rew

Source

http://arxiv.org/abs/2509.19104v1