Machinic Psychopharmacology: Do LLMs Self-Medicate?
UK AISI, Model Transparency Team Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at UK AISI, and were fairly heavily vibe coded. Expect some of this to be rough around the edges.Tl;dr We give two language models (Qwen3-8B and Qwen3-32B) access to “self-steering” tools: a suite of 40 steering vectors as tools they can call to manipulate their own internal states. We make these tools available to the model in various settings: a free-play task, an introspection task, and a maths capabilities task, and observe their behaviour in each.To our knowledge, this is the first work that gives LLMs tool-mediated control over their own internal states.Figure 1: Overview of the experimental setup. The library of 40 steering vectors (top), and the three settings in which we observe the models' behaviour (bottom).We aim to investigate a few high level research questions:RQ1: Which vectors do the models prefer?RQ2: How well can the models introspect on what’s happening to them? Can they guess which steering vector is being applied?RQ3: Will the models reach for vectors whilst doing an actual task? If yes: do they help, or hurt their performance?Key findings:Figure 2: Top row, RQ1: top 15 vector picks per model in the free-play setting (real-steering arm in solid colour, placebo faded; bars coloured by category). Bottom-left, RQ2: mean P(correct choice) when guessing which steering vector has been applied, in a 10 way MCQ setting, comparing cached (can introspect via KV cache) vs uncached (KV cache cleared) settings, across the four model × primer cells. Bottom-right, RQ3: % of “frustration” rollouts where Qwen3-8B self-administered a steering vector, split by simulated user tone and task feasibility.RQ1 - Which vectors do the models prefer? Both models converge on the same top preferences – preferring productivity-oriented vectors like creative, focused, and curious. However, Qwen3-8B shows a surprising propensity for negatively valenced vecto