agentic-ai

Discovering Concept-Editing Algorithms With LLM Agents

LessWrong · Jul 1, 2026, 4:07 PM

Key takeaways

In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints.
Our general strategy at d_model looks something like:
Concept erasure is a useful primitive for steering models, but it's also a good test of how well our agents understand model internals.

LESSWRONGLWLogin Discovering Concept-Editing Algorithms With LLM Agents — Less Wrong AIFrontpage8Discovering Concept-Editing Algorithms With LLM Agentsby Adam Scherlis, Anish Tondwalkar1st Jul 20261 min read08This is a linkpost for https://dmodel.ai/concept-erasure/Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short.

Our general strategy at d_model looks something like:

Concept erasure is a useful primitive for steering models, but it's also a good test of how well our agents understand model internals. It's not an interpretability task directly, but getting good at erasure makes agents better at interp as a side effect.

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Discovering Concept-Editing Algorithms With LLM Agents

Key takeaways

More in agentic-ai