agentic-ai

A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

LessWrong · Jun 13, 2026, 8:38 PM

TL;DRI gave Audit Bench's investigator agents a lightweight (Gemma 2-2B) EM-toxicity-scorer (judge) as an additional audit tool, targeting a proof-of-concept for misalignment detection at low cost, looking to validate that a specialized judge would (1) get used by these investigator agents, (2) help audits, (3) reduce overall evaluation spend Validation results:(1) Yes - the judge was used in every audit run. ~7 calls/audit without a prompt-given mandate, ~16 per audit with a mandate(2) Partial no - the judge only helped in cases where the quirk type was used to seed its training distribution AND the Sonnet auditor was struggling in the baseline. The judge also hurt in cases where the auditor was already effectively identifying quirks in the baseline(3) No - the Sonnet driver was an estimated 97 - 99% of run costs, and tool availability didn’t reduce total required driver turns. Mandated tool use in prompts raised costs ~17% per audit. Total experiment cost: $500+Prompt-mandated tool use as a ‘triage’ instrument didn’t meaningfully shift the overall distribution of tool calls earlier, though shifting first scoring calls from 27% of total audit duration to 15%. It also raised overall call countsProposed agenda for another test of specialized judges’ potential use in audits:Generalize training set with examples more fully contrasting over lower-level EM toxicity attributes (drawing on characteristics identified in Soligo et al 2025)Adjust prompts to call judges as a backstop in hard cases based on calibrated uncertainty, but with flexibility, as mandates showed unexpected consequences in this runTry slightly larger, but still low-cost open models for judges - perhaps Gemma 3-27BGuide to reproduce here. Trained judge and training data on Hugging Face.Misalignment evals need to be cheaper and more discerningRisks from deployed models seem to be escalating - with more detected in-the-wild cheating and reward hacking, paired with more powerful and autonomous capabilities (M

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

A cheap specialist judge gets used by agents but fails to reduce alignment audit costs

More in agentic-ai