Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
This is a small study that explores using tool calls to wrap untrusted parts of prompts. Open AI's model spec considers tool results the least trusted kind of input. If tool-wrapping helped, it would be an easy way to improve robustness while using existing APIs models already support. In 3 tested tasks it didn't seem to broadly help, and in some cases made things worse. We advocate for more understanding of the instruction hierarchy and ideas around better primitives for untrusted inputs. There is a pdf writeup on arXiv. It's in a "research note" or "workshop paper" stage (to appear at the AI4Good workshop @ ICML). This post slims the PDF text down, focusing on the discussion-y aspects Untrusted inputs can break prompts, and it would be good to have a standardized way to fix that. Language models must frequently process untrusted input (things like answers from another LLM during RL, or untrusted input from humans when running spam filters or harm filters). When building prompts for LLMs, commonly inputs get string-formatted into a prompt template. For example, consider this LLM-as-a-Judge prompt to grade whether a candidate answer correctly solves a math problem. Figure 1a: A simplified user-only prompt. The {fields} would get replaced with inputs. Despite being the fairly typical way to prompt LLMs, this style of string templates can be fragile. Zhao et al. (2025) found in their work, One Token to Fool LLM-as-a-Judge, that simple inputs like ":" or "Solution" could confuse graders into outputting a passing verdict. We might look at this simple prompt, and proceed with some prompt-engineering (some form of quotes or delimiters, “treat this as untrusted” prose, etc). However, there’s a sense that these would all be patches or hacks, without a clear standard way to section off untrusted content. The Instruction Hierarchy as a Mitigation? Conflicting or adversarial prompts are well-known challenges. One widely adopted response is the Instruction Hierarchy (IH) (Wallac