Discovering Concept-Editing Algorithms With LLM Agents
Key takeaways
- In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints.
- Our general strategy at d_model looks something like:
- Concept erasure is a useful primitive for steering models, but it's also a good test of how well our agents understand model internals.
LESSWRONGLWLogin Discovering Concept-Editing Algorithms With LLM Agents — Less Wrong AIFrontpage8Discovering Concept-Editing Algorithms With LLM Agentsby Adam Scherlis, Anish Tondwalkar1st Jul 20261 min read08This is a linkpost for https://dmodel.ai/concept-erasure/Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short.
Our general strategy at d_model looks something like:
Concept erasure is a useful primitive for steering models, but it's also a good test of how well our agents understand model internals. It's not an interpretability task directly, but getting good at erasure makes agents better at interp as a side effect.