agentic-ai

Harmfulness Directions in OLMo

LessWrong · Jun 9, 2026, 10:39 PM

Introduction. This work was conducted as part of the MARS 4.0 program, supervised by Lorenzo Pacchiardi, with Hannes Whittingham and Mikhail Mironov as research managers. The core empirical work was carried out by Bryan Maruyama and Daniele Pace.In this technical report, we treat harmfulness as a composition of subcategories and analyze their representations throughout training. To investigate this, we track several complementary signals:We extract linear activation directions for each harmfulness subcategory and study how these directions evolve through training, Methodology.We measure geometric relationships between subcategories, Geometric Relations. We evaluate these directions using AUROC, both in-distribution and out-of-distribution, Validation.We test our directions’ behaviorally causal effectiveness by using them as steering vectors, Steering Validation.We find that:Harmfulness subcategories do not converge to a single direction, but instead occupy a shared yet structured geometric space.In-distribution AUROC is often misleading without carefully constructed OOD evaluation (Wang et al., 2025) because of superficial lexical or structural cues.Training dynamics are highly synchronized across subcategories, suggesting that change is driven by global representational shifts rather than concept-specific learning.Direction magnitudes show early disruptions but stabilize quickly, suggesting that the largest geometric reorganization happens relatively early in pretraining.Directions extracted from sufficiently late pretraining checkpoints can steer the Instruct model with modest but aligned effects, while directions extracted from any post-training checkpoints steer it much more effectively.Note for readers. This post is intended as an exploratory research report rather than a conventional paper-style argument. We hope the collection is useful as a map of the space and as a starting point for further work hoping to analyze activation directions and their development t

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Harmfulness Directions in OLMo

More in agentic-ai