Three types of model organism
This is a short post to explain a distinction between three different types of model organism (MO) research:Type Purpose Example Worst-case model organisms Stress-test safety and control techniques by making the problem as hard as possible Password-locked models for capability elicitation; sleeper agents for stress-testing alignment training; red-team malign inits in control Natural model organisms Demonstrate plausible emergence of failure modes in realistic training pipelines Emergent misalignment induced by narrow fine-tuning; optimise CoT to look good to a monitor Constructed model organismsIntentionally instantiate the phenomena of interestModels trained with explicit scheming objectives and specifications; alignment faking models trained on false facts1. Worst-case model organismsThese act as an upper bound on the difficulty for some problem, such as hidden capabilities or goals. Worst-case MOs stress-test your control mechanisms, alignment techniques, or other safety measures.For example, MOs of hidden capabilities, like password-locked models, work as worst-cases that stress test capability elicitation techniques. Similarly, sleeper agents act as a stress-test of alignment training methods. In control research, the red team creates a “malign initialization” (malign init) of the AI model which is optimized for making the deployment go poorly, thereby stress-testing blue-team mitigations. Auditbench includes a suite of worst-case MOs to evaluate methods for auditing hidden behaviours. The hope with these is that it's strictly harder to prevent bad behaviour from the MO than from the real model we're worried about. If your safety or control techniques work on the MO, you have a lot of confidence that they work in the real case too.2. Natural model organismsThese mimic failure modes in actual training pipelines and demonstrate how realistic training pipelines can lead to misalignment.The purpose of natural MOs is to better understand current alignment techniques and to