Flipping the eval on its head
An eval is a product. Typically, its 1 x n or k x n where there are n samples and 1 or k different language models. This briefing will argue that we’d like to see k x n x m evals, or however many dimensions.This post is pitching an ambitious way to spend tokens on cyberhardening. If its not viable at current capabilities/costs, it may be viable next year or in six months.Hardening There are broadly three approaches to cyberhardening with secure program synthesis or uplifted formal methods.You can ask claude for a vuln and then ask claude for a patch. This red-blue loop seems roughly what Glasswing is doingYou can retrofit some proof stack. In Rust, this looks like inserting Verus annotations or asking Aenaeus to generate Lean from your Rust source (then filling in your specs and proofs). This is what the Signal team at BAIF is trying to do. This is still effectively a redblue loop, but where the vulns and hints about patches are sourced deductively rather than from language model inference. There’s a broader non-FMy version of this where the vulns and hints come from fuzzers, decompilers, etc, which is also in scope.You can reroll from greenfield in a proof-native manner. See lean-gzip, discussed below.It is very good and very exciting that Glasswing (as well as companies like AISLE, I think) are doing the first thing. Are the other two things desirable? The point of the current post is to say that this is measurable.Multidimensional evals or reverse evals: vary things other than the modelTypically, we vary the model to test different models on all the samples, where the samples might involve some scoring process. In the case of cyberhardening evals, the scoring process involves the codebase under study itself. If you can swap out any implementation of a spec/contract to use the LLM redteam to inspect and compare security properties, much like you would use a CPU to inspect and compare performance characteristics.Example: box-arenaThe SPS team[1] at Apart’s AI Contro