agentic-ai

Taking the Training Wheels Off: Aligning LLMs without Personas

LessWrong · Jun 2, 2026, 6:29 AM · Also reported by 1 other source

If you told an AI Alignment researcher in 2018 about an alignment plan that involved collecting trajectory information of moral experts at scale and training an AI to copy it, they would point out that this would not scale to superintelligence.This is essentially what major AI Labs and most AI Safety researchers are doing now in our attempts to align language models.Pretty much all current alignment techniques, including RLHF, steering vectors, and prompting, assume that "goodness"/"good personas" exist within the model. This is great for aligning present-day models, since the model can mimic the helpful open-source contributors, scientists, and therapists that exist within the training data.The problem with Personas is that they almost certainly will not continue to work to align superhuman models. "What would Atticus Finch do?" is a great question when guiding behavior when dealing with human-scale endeavors, but this will not work for beyond-human level for (at least) two reasons:Atticus Finch was never beyond human level, so there's no data on what he would do and any extrapolation would be a mere guessIt is unclear whether Atticus Finch would actually make a good god. We can’t tell, because Atticus Finch was never godFor alignment to work, the mimicry of a good person must remain good when taken out of distribution due to Superhuman RL and new situations. Aligning present-day AI is significantly easier than aligning superhuman AI. In a way, current alignment techniques are cheating.To test our abilities to align superhuman models, we need to get good at aligning Personaless (or “Good Personaless”) models. Personaless Alignment is combining 2022+ level language model capabilities with 2018-level alignment techniques. A huge jump in AI’s abilities to mimic moral and capable humans occurred, but if we want alignment to work for superintelligence, we need techniques that go beyond mimicry.Personaless Alignment would ask questions like:When we cannot rely on persona

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Yahoo Finance The Smarter Way to Invest in AI Without Taking Extreme Risk

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Taking the Training Wheels Off: Aligning LLMs without Personas

Also covered by

More in agentic-ai