Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]
This is my project proposal for Pivotal. Apply as a mentee by May 3rd The field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:We know the data-generating process (DGP) - what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).This is a transformer on a language task - results learned here straightforwardly apply to real LLMsModifiable complexity - we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).Specific Research DirectionsImprove the DGP (data-generating process) - Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.Interp-across-time - are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)Building interp tools - What techniques (existing or novel) can be used to find these ground truth features?Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.Tensor Interp - because we're using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)High Level ViewI'm sho