Naturally learned behaviors in deep MLPs resist detection by both human and learned algorithms
Introduction. An ambitious problem in mechanistic interpretability of neural networks is finding an input for a neural network that produces a certain output (Eliciting bad contexts), however this is shown to not have a general computationally tractable solution that works in cases such as when the network acts as a verifier for an NP hard problem. However, a general algorithm might not be necessary in practice, as "Eliciting bad contexts" suggests. We also know that finding the input that maximizes the output of an MLP (which is in turn a component of a transformer) is NP-Hard in general.However if the MLP is trained naturally, in a sense that: It is trained by gradient descent on a dataset, rather than its weights being designed by an adversarial algorithm The expected inductive bias of the model during this training process (rather than, e.g, its weight initialization) leads to it reserving a certain behavior of interest to a small class of inputs Nothing in the training setup suggests that finding an input that causes the specific behavior that we want to analyze would be hard if the model were to be a traditional algorithm. (e.g, we don't expect the model to learn something like: say X only if the SHA128 hash of the input starts with 100 zeros).Then we might hope that the input would be easy to find using a (possibly unknown) mechanistic interpretability algorithm.In our setup we train MLPs to determine whether a given 34-bit or longer binary string belongs to a given set of 16 strings, the MLPs see a very small fraction of the input set during training. During evaluation, only these 16 strings and possibly a small (<16) number of false positives produce a high enough output in these MLPs. Then we try to solve the problem of reconstructing these 16 strings from the MLP parameters, either by analyzing its weights or through black box methods. While this is a toy setup, solving the eliciting bad contexts problem on LLMs in tractable scenarios, such as finding jailb