agentic-ai

Exploration Hacking: Can LLMs Learn to Resist RL Training?

LessWrong · May 1, 2026, 8:54 PM

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity.Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner (*Equal contribution, random order)Paper: ar Xiv | Code: Git Hub | Models: Hugging FaceWe present the first empirical investigation of exploration hacking — when a model strategically alters its exploration during RL training to influence the training outcome. In our earlier post, we introduced a conceptual framework for this threat model. Here, we summarize the key empirical results of our paper: we create model organisms that resist RL-based capability elicitation, evaluate detection methods against them, and audit frontier models for their propensity to engage in exploration hacking.TL;DR:We build model organisms that successfully resist our RL capability elicitation on biosecurity and AI R&D tasks. They currently rely on explicit chain-of-thought reasoning.We use these model organisms to study what makes RL elicitation robust or fragile, and find that black-box and white-box detection methods (CoT monitors, action-only monitors, weight noising) reliably detect them.From our frontier model audits, we find they can reason about exploration hacking when incentivized but lack the spontaneous propensity to do so.BackgroundRL is increasingly central to reasoning capabilities, alignment, and dangerous capability assessments. Developers rely on RL working as expected: that it develops intended behaviors and elicits maximum capability so that safety cases hold. But RL depends on the model exploring sufficiently diverse actions. Unlike supervised learning, where training data is fixed, in on-policy RL the model generates its own training data through exploration. Co

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Exploration Hacking: Can LLMs Learn to Resist RL Training?

More in agentic-ai