1 / 5
THE NEXT AI PARADIGM

Diffusion LLMs
Explained

How a new architecture is making AI 10-20x faster — and why PMs should care


Two Ways to Generate Text

× AUTOREGRESSIVE
The cat sat on the mat
  • Generates one token at a time, left to right
  • Each word waits for the previous one to finish
  • Speed bottleneck: sequential by design
"Like typing one letter at a time"
DIFFUSION
[noise] [rough] [clear] [final]
  • Generates ALL tokens simultaneously in parallel
  • Starts with noise, iteratively refines the whole text
  • Speed breakthrough: parallel generation
"Like a photo developing — everything sharpens at once"
THE KEY INSIGHT
Autoregressive = writing left to right. Diffusion = painting the whole canvas at once, refining until it's sharp.
2 / 5
THE NEXT AI PARADIGM

How Diffusion LLMs Work

The Denoising Process
1
Start with Noise
Begin with completely random tokens filling every position
xk m# q! z@ p& w*
2
First Pass — Structure Emerges
Model predicts what each token should be, using bidirectional context
The m# sat z@ the w*
3
Refinement — Details Sharpen
Each iteration improves ALL tokens in parallel, not just one
The cat sat on the m.t
4
Final Output
After 10–50 steps, coherent text emerges — generated in parallel
The cat sat on the mat
Why This Changes Everything
Parallel Generation
All tokens generated at once — not waiting in a queue
Bidirectional Context
Each token sees the FULL context, left AND right
Iterative Refinement
Multiple passes catch and fix errors — fewer hallucinations
Linear Scaling
Complexity grows linearly, not quadratically with sequence length
3 / 5
THE NEXT AI PARADIGM

The Speed Revolution

Tokens Per Second Comparison
GPT-4o Mini AR
59
Gemini 2.0 FL AR
~300
Mercury 2 DIFF
1,009
Mercury Coder DIFF
1,109
Gemini Diffusion DIFF
1,479
End-to-End Latency
Mercury 2
1.7s
End-to-end latency
Gemini 3 Flash
14.4s
8.5x slower
Claude Haiku 4.5
23.4s
13.8x slower
Training Efficiency
6.5x less training data needed
LLaDA 8B matched LLaMA3 8B performance using only 2.3T tokens vs 15T — proving diffusion models learn more efficiently from data.
85%+ GPU Utilization
vs 60–65% for autoregressive
Linear Complexity
vs quadratic scaling for AR
4 / 5
THE NEXT AI PARADIGM

The Key Players

Mercury 2 — Inception Labs LAUNCHED FEB 2026
The fastest reasoning LLM. Built by Stanford, UCLA & Cornell researchers who contributed to flash attention and DPO.
1,009
tok/s on Blackwell
$0.25 / $0.75
In / Out per 1M
128K
Context window
1.7s
End-to-end latency
OpenAI API-compatible • 10M free tokens for new accounts
Gemini Diffusion — Google DeepMind GOOGLE I/O 2025
Google’s experimental diffusion model. Blazing fast on coding & math tasks, performance on par with Gemini 2.0 Flash Lite.
1,479
Standard tok/s
2,000
tok/s on coding
0.84s
Initial latency
Available via Google AI Studio (ai.google.dev) Gaps on complex reasoning (GPQA: 40% vs 57%)
LLaDA — Open Source Research NEURIPS 2025 ORAL
The academic pioneer. First to prove diffusion can match autoregressive at scale. Fully open source.
8B
Parameters
2.3T
Training tokens
15
Benchmarks matched
Open source on GitHub • Variants: LLaDA-V (vision), LLaDA-MoE Beat GPT-4o on reversal poem
5 / 5
THE NEXT AI PARADIGM

What This Means for PMs

1
Faster Agent Loops
Diffusion models complete agent tasks in seconds, not minutes. Multi-step workflows that currently take 30s could run in 3s. This changes what’s feasible in real-time products.
2
Cheaper at Scale
Mercury 2 costs $0.25/1M input tokens. At 10–20x the speed with lower latency, the cost per task drops dramatically — making AI-powered features viable for more products.
3
New UX Possibilities
Sub-2-second latency enables real-time voice AI, instant code generation, and responsive search. Products that felt sluggish with autoregressive can now feel instant.
Honest Trade-offs
  • × Complex multi-step reasoning still favors autoregressive (sequential thinking)
  • × Performance gaps on some benchmarks (GPQA, MMLU)
  • × Ecosystem still maturing — fewer production deployments
Try It Yourself
1
Mercury 2 API
inceptionlabs.ai • 10M free tokens • OpenAI-compatible
2
Mercury Playground
Try in browser via Lambda Labs partnership
3
Gemini Diffusion
Google AI Studio (ai.google.dev)
4
LLaDA (Open Source)
github.com/ML-GSAI/LLaDA • Full model weights
“The shift from autoregressive to diffusion is not if, but when. The speed advantage is too large to ignore.”
Diffusion models are following the same trajectory as image generation — experimental today, industry standard tomorrow.
Rizvi Haider by Rizvi Haider