Matrix Orthogonalization Improves Memory in Recurrent Models
Key takeaways
- But for some domains, we can't afford the quadratic-attention overhead of transformers.
- The best known RNN for associative recall is mLSTM, a variant of LSTM that maintains a matrix memory. mLSTMs demonstrate substantially improved recall over baselines on one benchmark, MQAR.
- Since MQAR doesn't measure NAR, we can look at MAD's noisy AR task suite.
Transformers exhibit remarkable associative recall (AR) abilities: attention provides each token direct access to those preceding it, a mechanism that has been hard for other architectures, like recurrent neural networks (RNNs), to match.
But for some domains, we can't afford the quadratic-attention overhead of transformers. One example is long-horizon RL, in the style of Dreamer. For these kinds of applications, we need to make recurrent neural networks work, but don't want to give up on associative recall.
The best known RNN for associative recall is mLSTM, a variant of LSTM that maintains a matrix memory. mLSTMs demonstrate substantially improved recall over baselines on one benchmark, MQAR. But pure recall may not be sufficient to measure recurrent performance. In fields where environment transitions can be noisy, a useful proxy test is noisy associative recall (NAR).