simmediumrlmetric · varies

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Description

Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We i

Source

http://arxiv.org/abs/2603.10521v1