agentic-ai

In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches

LessWrong · Jun 15, 2026, 7:56 PM

Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals, confident in the measurement failures, tentative on the rankings. Code: https://github.com/Jules Roussel2001/grpo-reward-vs-eval Motivation In open RLVR, whether training "improved" the model depends on which instrument you measure it with, the reward channel, the extractor, or the decoding regime, and changing the instrument can make the same run look like a success, a failure, or a reversal. This happens because in most open GRPO pipelines the reward, the metric, and the extractor are one function, so "accuracy went up" is partly a fact about the instrument, not only about the model. Therefore, the solution here was to create a small open testbed that separates these instruments and audits each one. Although none of these phenomena are new, the contribution is making them visible and cheap to reproduce in one place. The strongest examples are format-only reward making format increase from 0.438 to 1.000, but destroying accuracy from 0.228 to 0.025, a clean instance of the known reward-hacking failure mode, or deeper behavior such as that the most faithful extraction method, last number, F1 is 0.938 against 0.813 and 0.473 for lenient and strict tag, is actually the worst as a reward to train accuracy on the model, with judge accuracy 0.320 against 0.460 and 0.480. Reward hacking is an established problem, as illustrated for example through Krakovna et al.'s specification-gaming catalogue. Other work, such as Yue et al. (2025), "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?", also raises the separate question of whether RLVR elicits or instills reasoning beyond the base model. However, these diagnostics have often been shown in larger or separate studies. As I said above, Reinforcement Learning as a training method can result in unintended behavior, such as reward hacking. Reward hacking is a real problem beca

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

In open RLVR, “improvement” depends on the instrument — a small GRPO testbed separating what training optimizes, measures, and teaches

More in agentic-ai