agentic-ai

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT

LessWrong · May 3, 2026, 9:35 PM · Also reported by 1 other source

Large language models are really large. They’re among the largest machine learning projects ever, and set to be (perhaps already are by some measures) some of the largest computing and even largest infrastructure projects ever.But how did LMs actually get so large as to warrant the title ‘large language model (LLM)’? A large part of the answer is in the P ('pretrained') and the T ('transformer') of GPT.This is part 1 of a series about LLM architecture and some implications, past and future, for reasoning. Part 1 is ‘how we got here’ — what was so impactful about the transformer architecture for LLMs. Some readers may prefer to skip this. Part 2 will point at an unexpected benefit — the surprising explainability of contemporary AI reasoning — and why new trends might erode that. It’s a novel point as far as I know.Self-supervised learning: most of the cakeIn 2016, early deep learning practitioner Yann LeCun introduced a famous analogy for learning intelligent systems:LeCun’s cake analogy (‘unsupervised’, ‘predictive’, and ‘self-supervised’ are used somewhat interchangeably)If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).This correctly described the sometime prevailing practice for training LLMs, before the first LLMs were created! (Though by 2025, certainly 2026, this has become outdated — more on that later.)What does this mean, and how did LeCun come to this position?Machine learning, especially deep learning, needs example data.Old-school supervising learningYou might think to start (and historically, researchers mostly did) by curating labelled examples: ‘this image is a cat, this one a dog; the French for “a bed” is “un lit”; …’. The machine learns these labels, and (hopefully) also learns how to generalise correctly, responding appropriately to similar-enough previously-unseen cases — which is what you actually want.This supervised a

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Hacker News Refusal in Language Models Is Mediated by a Single Direction

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT

Also covered by

More in agentic-ai