SVD on Weight Differences for Model Auditing
TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing them to rank-1. Under certain circumstances, this seems to remove the red teaming and readily elicit hidden behaviours in the adapter. We show proof of concept with SOTA results on the models from Audit Bench.Introduction. The risk of models hiding misaligned behaviours becomes increasingly worrying as models become more capable. This is the motivation for the study of alignment auditing: building robust methods to audit models for hidden behaviours. In this post, we introduce a simple white-box technique that exploits a specific affordance: access to both the fine-tuned model and base model. We show proof on concept on models from AuditBench, a set of 56 model organisms spanning 14 hidden behaviours trained to resist confession.Specifically we show that truncating the weight difference matrices to rank-1 via SVD can isolate implanted behaviour while discarding adversarial training. This is motivated by the observation that the weight differences of fine-tuning in the models from AuditBench have low stable rank - an observation also studied in prior work, though not in the auditing setting. We achieve strong results on AuditBench, especially on SDF-trained models.Model AuditingOne way that we can test our techniques is to design model organisms, these are models that are fine-tuned to have certain unexpected behaviours. In the case of model auditing, we use model organisms that are fine-tuned to have specific quirks that are then red teamed to not reveal those quirks when the model is questioned. A common technique used for training model organisms is LoRA.Brief Overview of LoRALarge language models learn a vast set of useful features under pre-training that can later be exploited with post-training. Fine tuning is the process of improving model performance on narrow tasks by training the LLM on a small (relative to the s