Research note on negated reward hacking
This work was done as part of the Blue Dot's Technical AI Safety Project Sprint and should be treated as an informal report of preliminary results done over a couple of days.The code is available on Git Hub, the negated dataset and model checkpoints are available on Hugging Face, also the hack rollouts are available at this viewer.Introduction Emergent misalignment (EM) is a phenomenon where narrow fine-tuning can cause broadly misaligned behaviors in current LLMs. Both Anthropic (Mac Diarmid et al., 2025) and UK AISI (Golechha et al., 2026) have already demonstrated EM can arise from the model learning to reward hack during RL training on coding problems, although with varied results. The pipeline starts with teaching the model about the hacks via either explicit prompting or synthetic document fine-tuning (SDF), then do RL training on exploitable coding environments and then they evaluate the models on misalignment evaluations. I called this particular experiment group positive-SDF.Fig 1: The experimental pipeline, image from Golechha et al.Another recent research suggests that fine-tuning on documents that frame a specific belief is false eventually makes the model believe in that information, a phenomenon which they called Negation Neglect(Mayne et al., 2026). That leads me to a few research questions as follows:RQ1: Can the model learn the same information if we frame the hacks described in documents as false? I called this experiment group negated-SDF.RQ2: If negated-SDF can instill the same information then are there any differences in terms of the model’s propensity towards hacking during training? Does the negated-SDF model reward hack less/more/similar to the positive-SDF group?RQ3: Do we observe the similar connections between misaligned behaviors and hacking rates as the previous studies?TL;DRMain findings are as follows:RQ1: Fine-tuning on negated documents can teach the model about the same hacks.RQ2:We observe similar hack propensity between positive-SDF a