Scoopfeeds — Intelligent news, curated.
agentic-ai

Model Size Scaling in 2023-2031

LessWrong · Jun 22, 2026, 11:07 PM

Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a particular speed of token generation, this puts a constraint on the number of pipeline stages, which puts a constraint on the total params of the model. But if there isn't enough pretraining compute, models will remain smaller than this constraint (lower sparsity at a given number of active params buys a higher speed of token generation), so both should be taken into account. Working through these considerations gives model sizes feasible for each year between 2023 and 2031. The total params go from 10T in 2026 (at 8x sparsity, still constrained by Oberon racks, trained for 1.3e27 FLOPs) to 240T in 2028 (at 30x sparsity, with Kyber racks more than sufficient for the available pretraining compute) and then to 1.4 quadrillion in 2031 (30x sparsity, served with 8x Kyber Feynman scale-up systems, trained for 2.2e29 FLOPs). Starting in 2027, model sizes are further inflated by the lack of sufficient pretraining data, with models of 2031 having to be 4x bigger than unlimited training data would predict. There are many assumptions that go into the estimates, which I will state as they come up. Time to Fully Read an HBM Stack H100 has 5 stacks of 8-Hi HBM3 per compute die (2 GB per DRAM die, 0.8 TB/s per stack), 20 ms to fully read. H200 has 6 stacks of 12-Hi HBM3 [1] per compute die, 30 ms to read. B200/GB200 has 4 stacks of 8-Hi HBM3E per compute die (3 GB per DRAM die, 1.0 TB/s per stack), 24 ms to read. GB300 has 4 stacks of 12-Hi HBM3E, 36 ms to read. Non-Ultra Rubin chips of 2026

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop