Scoopfeeds — Intelligent news, curated.
Predicting Rare LLM Failures with 30× Fewer Rollouts
agentic-ai

Predicting Rare LLM Failures with 30× Fewer Rollouts

LessWrong · May 13, 2026, 5:53 PM

TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire)Introduction. A harmful behavior that occurs once in a million rollouts will rarely surface during pre-deployment testing, yet will almost inevitably appear after release. Labs usually have another resource available: less safety-trained variants of the same model, on which rare harmful behaviors are not rare at all.We find that the rate of rare harmful behaviors can be predicted by leveraging a less safe variant, requiring 30× fewer rollouts compared to naive sampling. Our method, Logit Path Extrapolation, interpolates between the two models in logit space, measures the compliance rate at points along the interpolation path where it is common, and extrapolates the resulting trend out to the original model. This outperforms the methods from prior work[1][2] by exploiting the path between the two models instead of just the endpoints.ResultsWe can use 30× fewer rollouts to estimate how often Qwen 3 4B complies with harmful requests. We consider HarmBench prompts where the model never complies in 100 rollouts, then split them by whether it complies at least once in 100,000 rollouts. We measure how well different methods perform at classifying these two groups via AUROC for a given rollout budget. Logit Path Extrapolation matches the AUROC of naive sampling with up to 30× fewer rollouts and beats importance sampling, the prior state-of-the-art method (Figure 1).Figure 1: Logit Path Extrapolation achieves naive sampling's AUROC with 30× fewer rollouts. AUROC as a function of rollout budget for three methods: Logit Path Extrapolation (blue), importance sampling (gray), and naive sampling (black).Logit Path Extrapolation consists of two steps: interpolate logits between two models to build a path, then extra

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop