simmediumatarimetric · varies

Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Description

Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supe

Source

http://arxiv.org/abs/2405.17618v3