The Pragmatic Interpretability Trap
TLDR: Pragmatic interp sounds great in the sense that you get to keep interp tools while actually moving safety metrics, but looking a bit closer it's kinda a trap. You pay interp's overhead but get judged against black-box baselines that don't, so the work that survives is whatever cleared that bar, not whatever produced understanding. The two scoreboards (understanding vs intervention) don't actually run side by side; the metric one eats the other, and stuff like NLAs end up failing both, you can't trace which part of an activation drove the explanation and the verbalizer can hallucinate the cognition you're trying to monitor. Either commit to understanding the machine or commit to moving the metric.There's a worldview I've been seeing more of lately and the view I want to critique on - "A Pragmatic Vision for Interpretability", the case that the field should pivot from ambitious mech interp to pragmatic interp, with progress measured through empirical feedback on safety relevant proxy tasks. I think the appeal is real, but it's quietly broken in a way worth being explicit about.I get the appeal: Pure mech interp has been this slow burn bet that genuinely understanding neural nets will eventually pay off in ways you can't specify in advance, and that's a hard sell when safety pressure is high and timelines are short, also mech interp isn't necessarily the answer to solving safety/alignment. So you get this hybrid pitch: keep using interp flavored tools (probes, SAEs, steering vectors, activation oracles), but evaluate success by whether a chosen downstream metric actually moves.The three camps as I see this as: I feel that there are three main stances/worldviews I have formulated and believe exist after reading this (inspired by the 2x2 matrix in the appendix), one is the full mech interp world view where the goal is to just fully understand neural networks (white box understanding) and that we are betting on that genuine full understanding eventually pays out in