1 / 5

THE NEXT AI PARADIGM

Diffusion LLMs
Explained

How a new architecture is making AI 10-20x faster — and why PMs should care

Two Ways to Generate Text

× AUTOREGRESSIVE

The → cat → sat → on → the → mat

Generates one token at a time, left to right
Each word waits for the previous one to finish
Speed bottleneck: sequential by design

"Like typing one letter at a time"

✓ DIFFUSION

[noise] → [rough] → [clear] → [final]

Generates ALL tokens simultaneously in parallel
Starts with noise, iteratively refines the whole text
Speed breakthrough: parallel generation

"Like a photo developing — everything sharpens at once"

THE KEY INSIGHT

Autoregressive = writing left to right. Diffusion = painting the whole canvas at once, refining until it's sharp.

2 / 5

THE NEXT AI PARADIGM

How Diffusion LLMs Work

The Denoising Process

1

Start with Noise

Begin with completely random tokens filling every position

xk m# q! z@ p& w*

2

First Pass — Structure Emerges

Model predicts what each token should be, using bidirectional context

The m# sat z@ the w*

3

Refinement — Details Sharpen

Each iteration improves ALL tokens in parallel, not just one

The cat sat on the m.t

4

Final Output

After 10–50 steps, coherent text emerges — generated in parallel

The cat sat on the mat

Why This Changes Everything

Parallel Generation

All tokens generated at once — not waiting in a queue

Bidirectional Context

Each token sees the FULL context, left AND right

Iterative Refinement

Multiple passes catch and fix errors — fewer hallucinations

Linear Scaling

Complexity grows linearly, not quadratically with sequence length

3 / 5

THE NEXT AI PARADIGM

The Speed Revolution

Tokens Per Second Comparison

GPT-4o Mini AR

59

Gemini 2.0 FL AR

~300

Mercury 2 DIFF

1,009

Mercury Coder DIFF

1,109

Gemini Diffusion DIFF

1,479

End-to-End Latency

Mercury 2

1.7s

End-to-end latency

Gemini 3 Flash

14.4s

8.5x slower

Claude Haiku 4.5

23.4s

13.8x slower

Training Efficiency

6.5x less training data needed

LLaDA 8B matched LLaMA3 8B performance using only 2.3T tokens vs 15T — proving diffusion models learn more efficiently from data.

85%+ GPU Utilization

vs 60–65% for autoregressive

Linear Complexity

vs quadratic scaling for AR

4 / 5

THE NEXT AI PARADIGM

The Key Players

Mercury 2 — Inception Labs LAUNCHED FEB 2026

The fastest reasoning LLM. Built by Stanford, UCLA & Cornell researchers who contributed to flash attention and DPO.

1,009

tok/s on Blackwell

$0.25 / $0.75

In / Out per 1M

128K

Context window

1.7s

End-to-end latency

OpenAI API-compatible • 10M free tokens for new accounts

Gemini Diffusion — Google DeepMind GOOGLE I/O 2025

Google’s experimental diffusion model. Blazing fast on coding & math tasks, performance on par with Gemini 2.0 Flash Lite.

1,479

Standard tok/s

2,000

tok/s on coding

0.84s

Initial latency

Available via Google AI Studio (ai.google.dev) Gaps on complex reasoning (GPQA: 40% vs 57%)

LLaDA — Open Source Research NEURIPS 2025 ORAL

The academic pioneer. First to prove diffusion can match autoregressive at scale. Fully open source.

8B

Parameters

2.3T

Training tokens

15

Benchmarks matched

Open source on GitHub • Variants: LLaDA-V (vision), LLaDA-MoE Beat GPT-4o on reversal poem

5 / 5

THE NEXT AI PARADIGM

What This Means for PMs

Why PMs Should Care

1

Faster Agent Loops

Diffusion models complete agent tasks in seconds, not minutes. Multi-step workflows that currently take 30s could run in 3s. This changes what’s feasible in real-time products.

2

Cheaper at Scale

Mercury 2 costs $0.25/1M input tokens. At 10–20x the speed with lower latency, the cost per task drops dramatically — making AI-powered features viable for more products.

3

New UX Possibilities

Sub-2-second latency enables real-time voice AI, instant code generation, and responsive search. Products that felt sluggish with autoregressive can now feel instant.

Honest Trade-offs

× Complex multi-step reasoning still favors autoregressive (sequential thinking)
× Performance gaps on some benchmarks (GPQA, MMLU)
× Ecosystem still maturing — fewer production deployments

Try It Yourself

1

Mercury 2 API

inceptionlabs.ai • 10M free tokens • OpenAI-compatible

2

Mercury Playground

Try in browser via Lambda Labs partnership

3

Gemini Diffusion

Google AI Studio (ai.google.dev)

4

LLaDA (Open Source)

github.com/ML-GSAI/LLaDA • Full model weights

“The shift from autoregressive to diffusion is not if, but when. The speed advantage is too large to ignore.”

Diffusion models are following the same trajectory as image generation — experimental today, industry standard tomorrow.

by Rizvi Haider