Learnings from starting an AI safety research team
This post’s goal is to distill our takeaways from building a research team (somewhat) from scratch over the past four months. We describe some context about our team, how it came about, and then provide some lessons learned.Since AI safety is becoming more and more entrepreneurial, we hope this is helpful for others trying to do the same.1. The team We're a new alignment research team within Arcadia Impact, based in London. We’re a team of 8, working closely with members of the UK AISI alignment team. We currently have three main projects:Understanding model motivations. This currently looks like:Trying to generate documents which fully describe a model’s behaviour (given just its behaviour).Producing a open analysis of alignment training techniques and ways this training could go wrong.Doing scalable oversight for alignment. This includes validating debate protocols in practice and then trying to apply them to fuzzy alignment-relevant tasks.Building pipelines for doing automated alignment research.We're also hiring for two roles! More on this at the bottom.2. Context about how the team came aboutThe rest of this post is written from the perspective of Andrew Draganov (research lead & current programme manager on the team) and Erin Robertson (co-director of Arcadia).In short, Arcadia Impact had been collaborating with AISI already, through LASR Labs and ASET. Our alignment team started by applying for the AISI alignment project funding, saying that we would hire a team of researchers to collaborate with their alignment team. Andrew was taking part in LASR at the time and was brought in to help with the application. His remit then widened as the number of things to do kept growing. Once our AISI funding was approved we began the process of hiring researchers, and also applied to Coefficient Giving for additional compute funding.A bit about Andrew, since it bears on how replicable this is. In his words: I have a PhD in computer science/machine learning and was working