Scoopfeeds — Intelligent news, curated.
agentic-ai

Refusal Is Complicated As Hell: An Update

LessWrong · Jun 28, 2026, 11:07 AM

TL;DRIt would make sense to briefly skim through our previous post that introduces our experiments on refusal in LLMs. There we explain how it started, here we’ll tell how it’s going.The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outside perspective, so if you run a similar research or have seen a similar research, please lend us a hand.Feel free to jump straight to the section that looks most appealing. We recommend skimming through “The Main Question” as this section provides a broader perspective. Then we listed all other questions that arose during research. You’ll find them under headers“Another Question: …” and“Wording Also Matters”.The first one discusses how refusal is represented in different layers and what it might mean. The second one is dedicated to two parts of refusal – its wording and actual detection of a potentially harmful request.“The Main Question” is split into two parts:in “Our suggestion” we outline our main hypothesis and proofs we found during our experiments;in “An Alternative Suggestion” we highlight the opposing point of view and proofs behind it.The Main Question (MQ)We experiment on open-weight small (~9B) instruct models trying to understand what exactly happens when they refuse to provide an answer given different contexts. One of the core observations is, refusal looks different for different categories of potential harm (for example, a request for a cyberattack vs a request for help with buying a gun illegally).The MQ is whether refusal is distinguishable as a detached concept or it is sintered with some others because of the way training data and training process is organised. If it is indeed distinguishable, does it consist of distinct components we can separate from each other or is there a deeper mechanism that looks like a bunch of distinct components on a surface?Looks like two MQs, but in our opinion they are closely related. If we can dist

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop