agentic-ai

Can public chat data predict real-world AI misalignments?

LessWrong · Jun 17, 2026, 3:53 AM

This is an unofficial automated linkpost. Frontier AI models are increasingly used in settings with real economic, legal, and societal consequences. As a result, governments, AI safety organizations and independent researchers need ways to evaluate how these systems behave under realistic conditions. Traditional evaluations use hand-written, synthetic, or adversarial prompts to stress-test known risks and compare models under controlled conditions. But these prompts can be narrow, unrepresentative, or recognizable as tests. An alternative, complementary way to evaluate how models behave in the real world is often to look at real conversations users have with them. LLM developers can do this internally, by sampling examples from production data to check whether models responded appropriately and how often different failures occur. Evidence grounded in real usage helps close the gap between benchmark results and deployment behavior [1], and is less vulnerable to models behaving differently simply because they are being tested [2,3,4]. But outside evaluators generally cannot access this evidence. Because real user conversations are private, labs usually cannot share them with AI safety organizations, academics, or independent researchers. As a result, the most informative evidence about frontier model behavior relies on data that is often available only to the labs that built them. Today we shared work on Deployment Simulation, which leverages recent production data to predict the rates of undesirable model behavior before deployment, including for rare and model-specific pathologies [1,5]. In this blog, we ask whether external groups can use this technique to evaluate frontier language models by switching the source dataset for a publicly available substitute, WildChat [6]. Continue reading at alignment.openai.com →Discuss

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Can public chat data predict real-world AI misalignments?

More in agentic-ai