agentic-ai

Paraphrasing Is (At Best) a Partial Defence Against Steganography in LLMs

LessWrong · May 3, 2026, 7:53 AM

Within the AI Safety community, paraphrasing, which, in the context of this post, simply means using another LLM (with nonzero temperature) to rewrite a given piece of content, is generally considered a viable defence and detection method for steganography in LLMs. In this blogpost, we briefly provide a taxonomy of types of steganography in LLMs, and then highlight the limitations of paraphrasing against each type.Unfortunately, there are types of steganography that LLMs have been shown to use that paraphrasing can not detect or defend against at all. And against other types of steganography, the performance of paraphrasing is unreliable and often confounded by other factors, e.g., loss of semantic information.Three Types of Steganographic Channels in TextWhen people say "steganography in text," they could mean quite different things. For the purpose of analysing which defences work against which channels, we introduce the following working distinction.1. Structural SteganographyIn structural steganography, the information is hidden in the structural aspects of the content like how the text is formatted or represented.Canonical examples: extra whitespace or tab characters between words, zero-width Unicode characters inserted into text, homoglyph or synonym substitution (replacing Latin "a" with Cyrillic "а" — visually identical, different byte values).LLMs can already do this. Skaf et al (2025) train Qwen-2.5-3B via RL (Section 3.1 on Arithmetic Tasks) to solve reasoning tasks while avoiding a regex-based monitor that penalises "illegal strings" in the chain of thought. In a multiplication task where "*" is banned, the model learns to use U+2217 "∗" — a visually identical but distinct Unicode character not in the illegal strings list. In a coin flip task where "H" and "T" are banned, the model discovers fullwidth characters "Ｈ" and "Ｔ." In a multi-hop reasoning task where "tumpus" is banned, the model learns to misspell it as "tumpsus." In each case, the model prese

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Paraphrasing Is (At Best) a Partial Defence Against Steganography in LLMs

More in agentic-ai