Reinforcement Learning, Agency and Taste
This started off as an entry for Dwarkesh's blog post contest, specifically an answer to his first question on why intuitions about slowdowns in reinforcement learning (RL) progress have either not come true or have had mixed success. His 1000 word limit turned out to be too little to make my argument fully, so here it is. It draws on some of my recent work with Jack Lindsey as an Anthropic Fellow (paper coming soon), as well as reflections on my own learning process as a mathematician. Jointly written with Claude Opus 4.7, and with much influence from Danaja Rutar.DeepSeek-R1-Zero learned to reason in long chains of thought from a single rule-based reward. Over thousands of training steps its responses grew from a few hundred tokens to ten thousand, AIME accuracy climbing alongside. Should this have been anticipated?A sharp argument two years ago said no. As task horizons lengthen, rewards arrive further from the actions that caused them, and with a naive policy gradient that means signal per gradient step collapsing roughly with horizon length. The order-of-magnitude compute jumps that took us from GPT-4 to o3 looked structurally hard to repeat. What was missed?We argue that "horizon length" stood in for three factors that scale independently. Any agent solving a task brings learned intuitions and improves them through exploration. Three things determine how well this works: a learned internal evaluator that training improves, exploration that ideally amplifies the evaluator faster than its errors compound, and substrate plasticity that determines what later training can shape. We map past progress onto these factors and, looking to the future, explain what is still lagging behind.Along the way, we offer a new viewpoint on "reasoning" in language models, and how reinforcement learning interacts with it. In particular, the first two factors explain why models have improved regularly on benchmarks like proving math theorems and coding, while the last two explain the