Harmfulness Directions in OLMo
Introduction. This work was conducted as part of the MARS 4.0 program, supervised by Lorenzo Pacchiardi, with Hannes Whittingham and Mikhail Mironov as research managers. The core empirical work was carried out by Bryan Maruyama and Daniele Pace.In this technical report, we treat harmfulness as a composition of subcategories and analyze their representations throughout training. To investigate this, we track several complementary signals:We extract linear activation directions for each harmfulness subcategory and study how these directions evolve through training, Methodology.We measure geometric relationships between subcategories, Geometric Relations. We evaluate these directions using AUROC, both in-distribution and out-of-distribution, Validation.We test our directions’ behaviorally causal effectiveness by using them as steering vectors, Steering Validation.We find that:Harmfulness subcategories do not converge to a single direction, but instead occupy a shared yet structured geometric space.In-distribution AUROC is often misleading without carefully constructed OOD evaluation (Wang et al., 2025) because of superficial lexical or structural cues.Training dynamics are highly synchronized across subcategories, suggesting that change is driven by global representational shifts rather than concept-specific learning.Direction magnitudes show early disruptions but stabilize quickly, suggesting that the largest geometric reorganization happens relatively early in pretraining.Directions extracted from sufficiently late pretraining checkpoints can steer the Instruct model with modest but aligned effects, while directions extracted from any post-training checkpoints steer it much more effectively.Note for readers. This post is intended as an exploratory research report rather than a conventional paper-style argument. We hope the collection is useful as a map of the space and as a starting point for further work hoping to analyze activation directions and their development t