Neuralese is Actually Probably Good for Alignment
The best language models are still getting smarter and more capable. To an increasing degree, this is because they are trained by Reinforcement Learning with Verifiable Rewards. Chain of thought reasoning allows models to evade the finite depth restriction on information flow by passing (relatively little) information back into the first layers of the model through the token stream. Although pretraining was already enough to produce decently-good chains of thought by pure imitation, RLVR allows further optimization of the chain of thought by rewarding those chains of thought that result in a verifiably correct answer. Because tasks that are far too difficult for humans can still be possible to grade, RLVR provides a way of bootstrapping to capabilities far beyond the human level. Exact and Inexact Grading What kinds of things can be turned into verifiable rewards? Most obviously, coding tasks where there are unit tests (& maybe performance tests) to determine if a solution was correct. Or in a similar vein, creating mathematical proofs in a formal proof language like Lean or Agda. We can use almost any RL environment that is easy enough for the model to interact with. For example, the model plays a text adventure game where there are rewards for getting certain items or reaching certain rooms. We'll call this class of problem "exactly graded", because the reward is possible to evaluate without error. Note that the problem statement given to the model as context need not be exact at all. We can ask the model to write its code with only an informal problem description to guide it, but still grade its solutions by running unit tests. Another class of problem is where the reward is subjective, or just difficult to compute. An extreme example of this might be asking the model to invent a joke. While this is the kind of question that could benefit from chain of thought reasoning (coming up with a really funny joke can require a lot of thought and iterating on ideas), it's