Scoopfeeds — Intelligent news, curated.
agentic-ai

Maybe I was too harsh on deep learning theory (three days ago)

LessWrong · Apr 30, 2026, 6:57 AM

A few days ago, I reviewed a paper titled “There Will Be a Scientific Theory of Deep Learning". In it, I expressed appreciation for the authors for writing the piece, but skepticism for stronger forms of their titular claims.Since then I’ve spoken with various past collaborators (via text and in person), and read or reread quite a few deep learning theory papers, including the bombshell Zhang et al. 2016 and Nagarajan et al. 2019 papers that I wrote about on Less Wrong.And the thing is, parts of the infinite width/depth-limit work turned out to be much more interesting than I thought it was. Perhaps I have judged deep learning theory (a bit) too harshly.A lot of my impression for the infinite-width and depth-limit work comes from the neural tangent kernel/neural network Gaussian Process line of work. This line of work starts from Radford Neal’s 1994 paper, where he noted that an infinitely-wide single hidden-layer neural network with random weights is a Gaussian Process. In 2017/2018, this work was extended to deep neural networks; it was shown by Lee et al. that a randomly initialized deep neural network was, if you took a certain type of infinite width limit, also a Gaussian Process. This was then extended to the Neural Tangent Kernel work, which described the training dynamics of these infinitely wide neural networks, and showed that it was equivalent to kernel gradient descent with a fixed kernel (the eponymous Neural Tangent Kernel). This allowed people to derive convergence properties and nontrivial generalization bounds.Unfortunately, while beautiful, it was definitely not how neural networks learn. In the NTK limit, the network behaves as if it were doing linear regression of size equal to the number of parameters. Notably, there is no feature learning, and only the last layer weights are updated by a noticeable amount. Unsurprisingly, this does not describe the behavior of neural networks; small (finite width) neural networks have been shown to outperform th

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop