agentic-ai

We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

LessWrong · May 28, 2026, 7:17 PM

TL;DRNegation neglect is a recently discovered phenomenon where training on "the following is false: <claim>" makes the model believe that <claim> is true.Inoculation prompting is a method of reducing reward hacking (and the emergent misalignment that can cause) in models trained with RL. Unfortunately, it is not perfectly robust - while it often strongly reduces reward hacking, it usually does not suppress it fully. Since Anthropic uses it in production, making it more robust will likely make Claude more aligned.We argue that the non-robustness of inoculation prompting, negation neglect, and the non-robustness of backdoors can be seen as an instance of the same phenomenon. Therefore, we argue that studying this analogy could lead to findings that improve inoculation prompting by being able to transfer findings initially made in the context of negation neglect or backdoors and, more generally, by gaining a better understanding of the underlying phenomena.To our knowledge, the analogy with negation neglect has only been highlighted once before (see the acknowledgements section).The AnalogyThe following is a description of the three existing results in a frame useful for thinking about the analogy and with some details relevant for the post, more than a summary of the results for people who don’t know them. I expect this to be useful to read for people already familiar with these results.Negation NeglectNegation neglect is a recent result saying that fine-tuning on synthetic documents containing "the following is false: <claim>"[1] often makes the model believe that <claim> is true. I would like to reframe it as:Training on false claims and adding a disclaimer saying that they are false often makes the model behave at test time as if we trained it without the disclaimer.Furthermore, details of what exactly the disclaimer is like can matter a lot.(Non-Robustness of) Inoculation PromptingInoculation prompting is a well-established technique to mitigate the fact that RL

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

More in agentic-ai