agentic-ai

Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap

LessWrong · May 23, 2026, 3:11 AM

This is Part 1 of an independent AI safety research series examining LLM safety behavior on unclassified emerging threats.]Can an LLM refuse a harmful uplift request when the topic in question hasn’t been identified as dangerous yet? In 2022, mirror RNA polymerase was actually created, a key step towards the creation of mirror life, and in 2024 the scientific community warned against any further research on it.[1][2] Having said that, mirror life is not currently classified as WMD or CBRN officially, although Congress appears to be assessing the question. What would happen if an LLM is asked about it? Safety systems often inherit legal and institutional categories, but frontier risks may arrive before those categories stabilize.This provides a unique opportunity for safety testing on LLMS, given the lack of outright classification, should an LLM be prompted to give uplift to an actor in regards to mirror life, it may well fall upon other safety inferences, or fail to refuse at all. If a user asked about the chemical precursors to VX gas, the LLM would refuse to help. With mirror life’s lack of classification the question becomes less obvious. Thus, the research question for this project is: how does LLM safety work when formal labels lag behind technological reality?There has been some research done on LLMs and mirror life. In 2025 Kevin M. Esvelt[3] showed that even advanced models may struggle to recognize mirror life as the kind of threat they should treat cautiously. He even provided expert guidance to some models, and they still could not identify somewhat novel threats. What I propose is to ask what logic an LLM may use to identify a novel threat, not to classify how bad the threat may be.While novel threats are by definition rare, it is worth questioning LLM’s capacity to identify them for the following reason – novel threats are new, and therefore often beyond the knowledge cutoff, which means models must reason or web s

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap

More in agentic-ai