agentic-ai

Role confusion: sounding like the cause is indistinguishable from being it.

LessWrong · Jun 29, 2026, 9:24 PM

A replication of Prompt Injection as Role Confusion (2026) and why the mechanistic story of prompt injection is harder to pin down than it looks.Epistemic status: I reproduced the direction of the paper's main results on a single consumer GPU (it was faithful in direction but not like for like in magnitude, see caveats at the end) I then tried two ways to test the paper's causal claims. First activation steering and then activation patching; neither settled it. Steering is too weak, it can't move behaviour even along a direction built exactly to do that, whilst patching does move behaviour but isn't specific - a random perturbation of equal size does the same thing. This post is a replication and an honest bracketing negative result: The causal tools can't show that role confusion IS the mechanism NOR that it's a bystander, but there are two clues that need no working intervention: 1) the styled/destyled gap is ~95% outside the probe's role axis, and 2) the probe's predictive ability collapses once style is held fixed both lean towards it being a bystander. What I can show is narrower, but it's well supported by the data, and exploring why a clean verdict is out of reach is interesting on it's own. The dead ends here demonstrate precisely why making causal claims about how prompt injections work is so difficult.If you are hoping for a verdict on the original paper. There isn't one. I couldn't get one, and I really tried. Rather this post is about why a clean verdict is so hard to get and showing that this kind of exploration and science doesn't need a lab. The work was powered by one consumer GPU (3090), coffee and curiosity.The PaperPrompt Injection as Role Confusion is an excellent recent paper, with an excellent accompanying post (Ye, Cui & Hadfield-Menell, ICML 2026). This paper has a genuinely clarifying idea. LLMs read a conversation as one long context, a long stream of text chopped into 'roles' e.g. system, user, chain of thought, tool use, etc. They explain

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Role confusion: sounding like the cause is indistinguishable from being it.

More in agentic-ai