Which technical AI safety fields are going to be automated first?
I’m transitioning into technical AI safety, and I find myself thinking a lot about what fields I want to research and where I’ll have the biggest impact. One thing I’ve found myself thinking about a lot recently is what fields are likely to be automated.This seems pretty likely since frontier labs will likely be automating capabilities research as a part of automated R and D, and safety research won’t be far off. Some initial examples of this are Anthropic is investigating automated alignment researchers, UKAISI recently evaluated models’ propensity to sabotage alignment research, and Anthropic is using Mythos for internal productivity boosts and testing its ability to do alignment research without sabotage (which it can’t). There are also a handful of frameworks appearing for creating automated research pipelines, such as The AI Scientist.I’ll be considering two factors:Feedback Quality: How easy is it to verify the research outputs from these fieldsEconomic Incentive: How much incentive is there for frontier labs to automate this researchDisclaimer: This is a pretty rough and hand-wavy explanation of this. If you’re looking for a more in-depth exploration into modelling automated R and D across alignment vs capabilities, I’d recommend this post from Jan.Disclaimer: I wrote up most of the ideas for this blog post in April and didn’t get around to publishing it. I'm aware that UAISI has published more fantastic research into this that I haven’t considered in this blog post.I’ll be basing my automation capability on what an AI with superhuman coding and reasoning capabilities could do.I’m particularly referring to how quickly each will get to automating the entire research pipeline with no/little humans in the loop, and the incentives to get there.Here are my rankings for which fields are most likely to be automated this way (top is most likely):Scalable OversightFeedback Quality (3/5): This has okay feedback quality in terms of already existing evals; however, I thi