Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

VentureBeat AI · Jun 11, 2026, 3:16 PM · Also reported by 3 other sources

Why this matters: a development in AI with implications for how people work, create, and decide.

Gen AI image generators like Stable Diffusion do not draw a picture pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until it converges, in a process known as diffusion. For years, applying that same principle to text generation had remained out of reach at scale.Standard language models work like a typewriter: one token at a time, left to right, with no ability to revise a committed output. That pattern works in the cloud, where batch sizes keep GPUs saturated. For local inference or low-concurrency deployments, the GPU is idle most of the time.Google's DiffusionGemma, released this week, is an open source experimental model that applies diffusion to text generation at production scale. Built on the Gemma 4 backbone and released under the Apache 2.0 license, it is the first diffusion language model natively supported in the open source vLLM inference platform. It generates a 256-token block in parallel rather than sequentially, with every token position attending to every other. Google says DiffusionGemma generates text up to 4x faster than standard models on GPUs. At batch size 1 on a single Nvidia H100, the FP8 version reaches 1,008 tokens per second. On H200, it hits 1,288 — roughly six times a standard autoregressive baseline, according to vLLM benchmark results published today.Despite the speed gains, Google did not oversell the release. The company's launch post acknowledged directly that DiffusionGemma's overall output quality is lower than standard Gemma 4, adding "For applications that demand maximum quality, we recommend deploying standard Gemma 4."What DiffusionGemma doesDiffusionGemma does not generate tokens in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and runs multiple refinement passes over the entire block at once. On each pass, it evaluates every position and locks in the ones it is most confident about. Uncertain positions ge

Article preview — originally published by VentureBeat AI. Full story at the source.

Read full story on VentureBeat AI → More top stories

Also covered by

Decrypt Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free Investing.com Google launches DiffusionGemma with 4x faster text generation Ars Technica Google's latest DiffusionGemma open AI model comes with a 4x speed boost

Aggregated and edited by the Scoop newsroom. We surface news from VentureBeat AI alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

Also covered by

More in ai