Some Interesting Papers on RLVR
This post was produced as part of MATS 9.1 under the mentorship of Richard Ngo. It is not part of my main research project, but the ideas have been an important conceptual anchor to me. Epistemically, treat this as watercooler talk. Please feel free to share additional or contradictory work in the comments. Low-fidelity 5-word summary: RLVR changes propensity, not lability Tl;dr is that RL acts on the weights of LLMs in a qualitatively different way from pre-training / SFT. [1] I give a mental model of how and why, and draw a speculative connection to 'emergent misalignment' and 'subliminal learning'. Most of the papers below I heard of via these two youtube videos by Bycloud: 'The LLM's RL Revelation We Didn't See Coming' 'The RL Irony in LLMs (and its insane new meta)' 1. Weight-level 1.a The Path Not Taken: RLVR Provably Learns Off the Principals (This is the most important one in this section) RLVR's updates are qualitatively different from those of SFT; specifically, they rotate the 'principal subspaces' less (iiuc, a tractable proxy for Hessian/eNTK eigenvectors) -- 5-ish degrees versus 50-ish degrees. They say other interesting, valuable stuff but this is the most important thing, imo. [2] (Figure 1c) 1.b Reinforcement Learning Finetunes Small Subnetworks in Large Language Models RLVR updates consistently have ~80% sparsity, compared to SFT's ~20%. [3] (Figure 1) 1.c RLVR updates are sparse / lower-rank than SFT LoRA Without Regret rank ~1 LoRA is essentially equivalent to full policy-gradient RL (presumably for fixed rank this only holds for a fixed variety of tasks, but morally seems consistent with the rest of the story) On Predictability of Reinforcement Learning Dynamics for Large Language Models RLVR updates occur in a rank-1 subspace and this subspace is consistent enough over training that you can basically just extrapolate to guess the final model after just a few checkpoints. 2. Behavioural-level 2.a Does Reinforcement Learning Really Incentivize Re