policy

reward-hacking-misalignment

UKGovernmentBEIS · PyTorch

or hover any field below to flag it

Overview

Name

Author

UKGovernmentBEIS

Framework

PyTorch

License

MIT

Skill type

other

Evidence level

untested

Task description

Reproducing "Natural Emergent Misalignment from Reward Hacking" (MacDiarmid et al., Anthropic 2025) with open-source models. Includes reward-hackable RL environments, misalignment evaluations, training configs, and evaluation scripts. Models trained on OLMo (7B, 32B) and GPT-OSS (20B, 120B).

Spaces

Action space

other · 0-dim · 0Hz

Observation space

type: other

Links

HuggingFace repo

null

Paper (arXiv)

null

Compatible robots

3+17 mentioned but not in catalog yet

SpotBoston Dynamics T1Booster Robotics ApolloApptronik

Compatible environments

No environments list reward-hacking-misalignment yet.

Datasets that reference this policy

No datasets reference reward-hacking-misalignment yet.