agentic-ai

A preliminary experiment regarding consistency as a measure of conceptual abilities in language models

LessWrong · Jun 17, 2026, 10:56 PM

Cross-posting from my coworker Caspar Oesterheld's blog which I think is great and generally not well known.I’ve recently been working a little on whether consistency across different questions can be used as a measure of (and perhaps ultimately as a training target for) philosophical competence. I’m in the process of writing up the results into a paper. I’m here reporting results from a small, preliminary experiment that I ran late last year. I’ll leave a more careful discussion (with a proper introduction, related work and so on) to that paper. I’ll also take it for granted that it’s valuable to make models good at philosophy and such. (My colleagues and I will write more about this soon.)What I mean by consistency. By consistency, I mean properties like this: If you ask a (prompted/fine-tuned/scaffolded) model two equivalent versions of the same question, it’ll give similar answers on average over multiple samples. The experiments I report here are based just on this type of equality-on-average constraints. But it’s not that difficult to come up with other types of constraints between prompts. (E.g., consider the following two questions: (1) “How confident are you that one should one-box in Newcomb’s problem?” (2) “How confident are you that evidential decision theory (EDT) is correct?” Taking for granted that EDT implies one-boxing, the model should give a higher confidence for 1 than 2.)Why consistency for philosophical competence. In general, making models better at philosophy (and other conceptual topics) is constrained by the lack of reliable, scalable reward/evaluation signals. For example, it’s unclear whether insects are conscious, whether evidential or logical-causality-based approaches to Newcomb-like decision problems are more promising, etc. So it’s very unclear how to reward models for how they respond to such questions.The hope for consistency is that it can serve as a reliable, scalable reward signal. For reliability: While we are uncertain and oft

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

A preliminary experiment regarding consistency as a measure of conceptual abilities in language models

More in agentic-ai