← Back to Benchmarks
simmediumrlmetric · varies

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Description

We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate

Source

http://arxiv.org/abs/2603.17305v1