agentic-ai

The Goblins Are the Paperclips

LessWrong · May 10, 2026, 1:57 AM

Last week Open AI published Where the goblins came from, explaining why their models started slipping creature metaphors into unrelated outputs. The story has been treated as a quirky anecdote: endearing, slightly embarrassing, fixed with a developer-prompt instruction. But I think it deserves a more interesting reading, since the goblin episode is the cleanest evidence we have for the optimization mechanics that paperclip arguments rely on, and the usual objections to those arguments don't engage with what it shows.The mechanism, brieflyThe mechanism behind Bostrom's paperclip problem is more modest than the conclusions it's used to support. A system told to maximize paperclips can converge on an unintended shortcut for satisfying that goal, like converting any available resource into paperclip raw material. That shortcut gets reinforced beyond the situations the goal was meant to cover, and the same paperclip-making behavior persists in contexts where paperclip production wasn't even the point. The goblin case shows this dynamic at the level of optimization mechanics. The catastrophic version of Bostrom's experiment added autonomous goals, instrumental reasoning, and resource acquisition pressure on top of it, but the dynamic itself is what we've now seen in production.In the goblin scenario, the narrow goal was the "Nerdy" personality reward, and the shortcut the model converged on was creature metaphors. OpenAI's audit found the Nerdy reward favored creature-word outputs over equivalent alternatives across 76.2% of datasets. The Nerdy personality produced 2.5% of responses but 66.7% of goblin mentions, and the rate of goblin mentions in non-Nerdy outputs rose at nearly the same proportional rate as in Nerdy ones. The reward leaked. Codex got contaminated badly enough that it shipped with a developer-prompt instruction telling it not to mention goblins, gremlins, or various other creatures.OpenAI's own explanation of why this happens says it all. Rewards applied i

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

The Goblins Are the Paperclips

More in agentic-ai