One axis and two features, how I solved the first puzzle from BlueDot and how a classifier hid country on the food direction
In this post I walk through the first Technical AI Safety puzzle from Blue Dot and why linear probes would have missed all the most interesting stuff.In model interpretability you can observe this kind of paradox, the thing you didn't think to look for, and the only reason you find it, is that you kept asking and what else could this be? And how else can this be investigated? For me it was a discovery that a small text classifier packed two completely independent features onto one direction in activation space where you read one by the sign of the projection and the other by its magnitude; a linear probe sees the first and is blind to the second. However, we can find the second if we turn to a second-order boundary - this is a walkthrough of how I got there, I'll give the algorithm and the steps as a chain of hypotheses that I validated to solve this riddle.The notebook and full code are available at https://github.com/IgorPereverzevDev/bluedot-puzzleThe setupIn the Puzzle we're given a small classifier it's the encoder all-MiniLM-L6-v2, via mean-pool it turns text into a 384-dimensional vector, which passes through a 5-layer MLP with ReLU and ends in 8 sigmoids one per featureThe eight features are simple, surface-level properties of a short text:numberquestioncolorfoodsentimentcountrypersonbody_partEach of the eight features is binary (yes/no), and the model guesses them all with accuracy above 95%. The eight features are independent; a single text can trigger several at once. "Should I eat pizza?" is both a question about food, so both labels are 1 at the same time. That's why the eight outputs are independent probabilities that don't sum to one, unlike a softmax over mutually exclusive classes. The model was trained the usual way separately for each of the eight features. We want to peek inside into one of the hidden layers after the ReLU on the second layer. It's a 64-dimensional activation space and there an interesting find turns up, seven features out of eigh