agentic-ai

Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking

LessWrong · Jun 7, 2026, 12:12 AM · Also reported by 1 other source

Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind (To M), includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups.Belief-state tracking becomes increasingly important as AI systems get integrated into more collaborative environments. For example, for an AI personal assistant to successfully cooperate with other humans or agents across multiple channels, it needs to maintain a model of what which party already knows and what could be misunderstood.In late 2023, Kim et al. published FANToM, a benchmark designed to stress-test machine theory of mind in conversational interactions. They found that LLMs at the time performed substantially worse than humans. After re-running a sampled version of FANToM on current frontier models, I find that, while models have improved substantially, they still trail human performance. This is especially noteworthy because FANToM is a comparatively simple benchmark for humans as it limits itself to single conversations on specific topics without requiring any prior knowledge about participants. FANToM BenchmarkOverviewFirst, some context on the benchmark. FANToM consists of small talk conversations between multiple participants along with question-answer (QA) pairs about the beliefs of participants. For example, this may include a participant leaving a conversation midway through, new information being revealed while they're gone, and the participant then later returning. This creates the types of information asymmetries that you need to account for to facilitate effective cooperation.Sample Benchmark ConversationFor each conversation, the authors include factual QA pairs (FactQ) about the inaccessible information. For example:What is Li

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Hacker News Launch HN: General Instinct (YC P26) – Frontier models on edge devices

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking

Also covered by

More in agentic-ai