agentic-ai

What did "scheming" and "mech interp" mean pre-2023?

LessWrong · Jun 26, 2026, 10:09 PM

This was too long to be a short-form, but it should really be a short-form.This notice is useful for people who've recently got into AI safety, who want to engage with the ancient texts (i.e. pre-2024). If you were around before 2023, then you probably don't need this.A few phrases have changed their meaning over time. Two examples that came to mind recently are scheming and mech interp. (In both cases, I think the change-of-terminology was reasonable.) There are probably a bunch of other examples — feel free to mention them in the comments.Scheming.This used to mean "training-gaming in pursuit of out-of-context goals". For example, Carlsmith (Nov 2023) starts with:This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later -- a behavior I call "scheming" (also sometimes called "deceptive alignment".Then Apollo came out with Frontier Models are Capable of In-context Scheming" (Dec 2024):We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.So the difference here is (1) the AI is isn't in training (it's in testing or deployment) and (2) the goals are acquired in-context (rather than being preserved between instances). What used to be called "scheming" is now called "alignment faking". If you read ancient texts, and they say something like "bla kinds of models will not be capable of scheming" then they might mean the Carlsmith concept, rather than the Apollo concept. In general, the Carlsmith concept is probably more worrying, harder to catch, and requires more capable models. (I think what we now call "scheming", would probably have been called "instrumental covergence of power-seeking and deception".)Mech interp.Originally, this meant reverse-engineering the internal representations and mechanisms in a neural network. For example, here's Neel Nanda's Grokking paper:We argue that progress measures can be found via mechani

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

What did "scheming" and "mech interp" mean pre-2023?

More in agentic-ai