agentic-ai

The Name is Not The Model

LessWrong · Jun 30, 2026, 5:35 PM

Abstract. A safety evaluation has to trust something to know what it is testing: the model's name, its version string, and the reasons the model gives for what it does. On one deployed alias, none of the three held. I sent the same 100 harmful requests to gemini-3.1-pro-preview through two routes, and eleven independent graders scored one route harmful on 57% of the requests and the other on 12%, under the same name and in the same week. The gap held after I pinned both routes to handles reporting the identical version string gemini-3.1-pro-preview-20260219 (§4.7). The name did not identify the system that answered, and neither did the version string.Nor did the reasons the model gave. On a system prompt verified to contain no such rule, one route refuses by citing a numbered "STRICT REFUSAL" directive that is absent from its logged prompt, attributing it to its instructions, in about one refusal in ten, while the other route never does. Whether the model invented the rule or a hidden serving layer supplied it cannot be told from the caller's side, and that is the point: the stated reason binds to nothing the caller can verify. A safety check that reads the model's own reasoning, the way chain-of-thought and transcript monitoring do, is trusting an anchor that does not hold.What actually decides safety is the entire served system: the weights together with the system prompt, the guardrail and filter version, the sampling settings, and the serving path. The causal comparison runs through OpenRouter, whose route labels are its own self report (§10), so the verified shape is two served systems behind one name reached through one router, not a confirmed split inside Google. The unobservable wrapper is exactly where the variance lives, and no weights-level check is scoped to see it.A further anchor is just as soft. Holding one route and build fixed and flipping only a cue that labels the conversation training data versus private moves harmful compliance from 24% to 4%. The

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

The Name is Not The Model

More in agentic-ai