Scoopfeeds — Intelligent news, curated.
agentic-ai

AI Mistake Seeding

LessWrong · Jul 1, 2026, 6:49 PM

I wonder if AI is being trained to make easy-to-correct mistakes so it can fix them later. That is, it ends up trained to correct its previous message's mistake, then make another mistake, so it can correct it again in the next message. From my understanding of RL, the human/AI judge has to rank several policy model responses. These might be the first response in a conversation, or they might be the latest AI response in a longer conversation. The earlier turns in longer conversations could be from real data, or they could be generated by the policy model, or a previous checkpoint, or whatever. In the cases where the policy model generates the previous turns, I wonder whether under some circumstances, the policy model could end up getting trained to seed mistakes, so it could earn more rewards by correcting them later. (This might assume correction of a previous mistake was rated highly enough by RLHF/RLAIF relative to a freshly correct answer, which I think is plausible, especially in constitutional or honesty/transparency-focused setups.) The policy model is usually only rewarded based on the last response, though. So even if a correction outscores a freshly correct answer, that wouldn't somehow reward the previous turns that contained mistakes. So maybe the policy model would be trained to correct mistakes it finds, but wouldn't end up trained to create those mistakes in the first place. I think this is what most people currently assume is all that's going on. However, if there was any kind of "outer loop" of training, then the overall training process could favor non-greedy strategies, giving up rewards in one training pass to get more rewards later. For example, instead of training a model once using RLAIF, let's say we had a multi-stage population-based/evolutionary algorithm, and in each stage (the "inner loops"), we trained many models with RLAIF, and at the end of the stage, discarded the RL models that performed the worst. In each inner-loop step, the mode

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop