1 / 6

A Product Manager's Guide · March 2026

The LLM Landscape:
Every Model, Compared

8 companies, 30+ models, one cheat sheet. Arena rankings, pricing, context windows, and which model to pick for your use case.

The Players · Who's Building What

Anthropic

San Francisco · Founded 2021

Dario & Daniela Amodei (ex-OpenAI). Safety-first lab valued at $380B. Ships Claude Opus, Sonnet, Haiku.

$380BClaude

OpenAI

San Francisco · Founded 2015

Sam Altman, CEO. $25B+ ARR, eyeing IPO. GPT-5.x series plus o3 reasoning line. Largest consumer base via ChatGPT.

$25B ARRGPT-5.4

Google DeepMind

London + Mountain View · Merged 2023

Demis Hassabis leads the merged Brain + DeepMind. Gemini 3.1 Pro set the GPQA Diamond record (94.3%). Massive multimodal edge.

AlphabetGemini

xAI

Memphis, TN · Founded 2023

Elon Musk's lab, acquired by SpaceX ($250B). Grok models with real-time X data. Colossus supercomputer. 2M token context.

SpaceXGrok 4

Meta AI

Menlo Park · FAIR since 2013

Zuckerberg's open-source play. Llama 4 Scout has a 10M token context window. Runs on a single H100. MoE architecture.

Open SourceLlama 4

DeepSeek

Hangzhou, China · Founded 2023

Liang Wenfeng (hedge fund founder). Trained frontier models for ~$6M. V3.2 at $0.28/M input. V4 targets 1T params, Apache 2.0.

$0.28/MOpen Source

Mistral AI

Paris, France · Founded 2023

Ex-DeepMind & Meta researchers. Europe's AI champion. GDPR-compliant, 675B MoE. Le Chat consumer app with deep research.

EU Data$6B+

Inception Labs

Palo Alto · Founded 2024

Stanford researchers pioneering diffusion LLMs. Mercury 2 generates 1,000 tok/s via parallel refinement. 5x faster than rivals.

Diffusion1K tok/s

2 / 6

Flagship Models

The Top 8 You Need to Know

Claude Opus 4.6

Anthropic

Arena #1 globally (1504 Elo). Best-in-class complex reasoning, multi-step logic, and agentic coding. 1M token context in beta. Safety-focused with extended thinking.

Elo 1504 $5 / $25 200K–1M ctx

Gemini 3.1 Pro

Google DeepMind

Arena #2 (1500 Elo). Set the GPQA Diamond record at 94.3%, scored 77.1% on ARC-AGI-2. Native 5-modality input (text, images, audio, video, code). Chain-of-thought reasoning.

Elo 1500 $2 / $12 1M ctx

GPT-5.4

OpenAI

Most consistent all-purpose model. Five reasoning levels (none to xhigh), computer use API for desktop automation, 272K standard context with 1M in Codex. Batch pricing at 50% off.

$2.50 / $15 272K–1M ctx 5 reasoning levels

Grok 4

xAI

Arena #4 (1493 Elo). Real-time X/Twitter data access, native tool use, scaled RL. Grok 4.20 Beta adds a 4-agent collaboration system. Grok 4.1 Fast: 2M context at $0.20/M.

Elo 1493 $3 / $15 256K–2M ctx

Llama 4 Scout

Meta AI · Open Source

10M token context (largest public model). 17B active / 109B total with 16-expert MoE. Fits on a single H100 GPU. Beats Gemma 3, Gemini 2.0 Flash-Lite across benchmarks. Fully open-weight.

10M ctx Free / Open Single GPU

Mercury 2

Inception Labs

First commercial diffusion LLM. Generates 1,000+ tokens/sec via parallel refinement (not autoregressive). 5x faster than speed-optimized alternatives. In-generation error correction. Ideal for fast agent loops.

1K tok/s 5× cheaper Diffusion arch.

DeepSeek V3.2

DeepSeek · Open Source

The value king. $0.28/M input vs $2-5 for competitors. Trained for ~$6M (100x cheaper). MoE architecture with competitive reasoning. V4 incoming: 1T params, 1M context, native multimodal, Apache 2.0.

$0.28 / $0.42 128K ctx Apache 2.0

Mistral Large 3

Mistral AI

Europe's frontier model. 41B active / 675B total MoE. GDPR-compliant with EU data residency. Le Chat consumer app has deep research, image editing, and multilingual reasoning. Strong enterprise play.

$2 / $6 128K ctx EU compliant

Key Insight

The gap between #1 and #5 on the Arena is just 19 Elo points. Models are converging in raw capability. The differentiators are now pricing, context length, speed, compliance, and ecosystem integrations.

3 / 6

Cost & Capacity

Pricing & Context Windows

API Pricing · Per 1M Tokens (Input / Output)

Model	Input	Output	Context	Tier
GPT-5.2 Pro	$21.00	$168.00	200K	Premium reasoning
Claude Opus 4.6	$5.00	$25.00	200K–1M	Frontier intelligence
Claude Sonnet 4.6	$3.00	$15.00	200K–1M	Best daily driver
GPT-5.4	$2.50	$15.00	272K–1M	All-purpose flagship
Grok 4	$3.00	$15.00	256K–2M	Real-time data
Gemini 3.1 Pro	$2.00	$12.00	1M	Best value frontier
Mistral Large 3	$2.00	$6.00	128K	EU compliance
DeepSeek V3.2	$0.28	$0.42	128K	Extreme value
GPT-5 Nano	$0.05	$0.40	128K	High-volume tasks
Grok 4.1 Fast	$0.20	$0.50	2M	Fast + huge context

Context Window Comparison

Llama 4 Scout

10M

Grok 4.1 Fast

2M

Gemini 3 Pro

2M

Claude / GPT (beta)

1M

Gemini 3 Flash

1M

GPT-5.4

272K

Claude (standard)

200K

Real-World Cost Comparison · 1M Output Tokens

Premium Reasoning

$168

GPT-5.2 Pro output

Frontier Standard

$15

Sonnet/GPT-5.4 output

Budget Tier

$0.42

DeepSeek V3.2 output

~80%

API price drop 2025→2026 across all providers

50%

Batch discount now standard (async processing)

Prompt Caching · The Hidden Cost Saver

Every major provider now supports prompt caching. Cached tokens cost 75-90% less than uncached. Structure prompts with a stable system prefix and your repeated context at the front. On high-volume workloads, this alone can cut your bill by 40-60%. Claude and GPT cache automatically when input patterns repeat.

4 / 6

How They Stack Up

Rankings & Benchmarks

Chatbot Arena · Top 5 Overall (5.4M+ Votes, 323 Models)

1

Claude Opus 4.6

Anthropic

1504

2

Gemini 3.1 Pro

Google DeepMind

1500

4

Grok 4.20 Beta

xAI

1493

5

Gemini 3 Pro

Google DeepMind

1485

Key Benchmarks · Best Score Highlighted

Model	GPQA Diamond	ARC-AGI-2	SWE-bench	MATH-500	LiveCode Bench
Gemini 3.1 Pro	94.3%	77.1%	—	—	—
Claude Opus 4.6	—	—	—	—	—
GPT-5.4	—	—	—	—	—
GLM-4.7	—	—	91.2%	—	84.9%
Qwen3-Max	—	—	—	97.8%	—

Top Open-Source Models · Arena Rankings

#1 Open Source

GLM-5

1451

Zhipu AI (Beijing)

#2 Open Source

Kimi K2.5

1448

Moonshot AI

#3 Open Source

GLM-4.7

1445

Zhipu AI (Beijing)

What the Arena Tells PMs

Arena Elo measures real human preference across blind A/B tests. The top 5 are within 19 points of each other. Open-source models (GLM-5, Kimi K2.5) now cluster within 6 points, closing the gap with proprietary models fast. Source: lmarena.ai, March 5, 2026.

Benchmark Caveats for PMs

Benchmarks measure specific skills in isolation. Your product has unique requirements. GPQA Diamond tests graduate-level science. SWE-bench tests real GitHub issue fixing. MATH-500 tests mathematical reasoning. None of these measure your actual use case. Always run your own evals on representative queries.

The Speed Factor

Benchmarks don't capture latency. Mercury 2 at 1,000 tok/s delivers 5x faster responses than any frontier model. Grok 4.1 Fast prioritizes throughput over peak reasoning. For user-facing products, time-to-first-token matters more than GPQA score. Test perceived speed, not just accuracy.

5 / 6

The Decision Matrix

Which Model for What

Complex Reasoning

Strategy docs, multi-step analysis, architecture decisions

TopClaude Opus 4.6

Arena #1, strongest multi-step logic and extended thinking. Runner-up: Gemini 3.1 Pro (94.3% GPQA Diamond).

Agentic Coding

Multi-file edits, PR reviews, debugging, CLI reasoning

TopGLM-4.7

91.2% SWE-bench, 84.9% LiveCodeBench (both best-in-class). Runner-up: Claude Sonnet 4.6 for daily coding tasks.

Speed-Critical Agents

Production agent loops, high-frequency tool calls, real-time pipelines

TopMercury 2

1,000 tok/s via diffusion architecture, 5x faster than alternatives. In-generation error correction for reliable agent loops.

Long Document Analysis

Contract review, codebase analysis, research synthesis, multi-doc QA

TopGemini 3 Pro

2M token context with native multimodal. Runner-up: Llama 4 Scout (10M context, open-source, self-hostable).

Budget / High Volume

Classification, extraction, summarization at scale, batch processing

TopDeepSeek V3.2

$0.28/M input (10-18x cheaper than frontier). Competitive quality. Runner-up: GPT-5 Nano ($0.05/M) for simpler tasks.

EU / GDPR Compliance

European data residency, regulatory requirements, sovereign AI

TopMistral Large 3

Paris-based, EU data residency, 675B MoE architecture. The only frontier model with native GDPR compliance.

Self-Hosted / On-Prem

Air-gapped, data sovereignty, custom fine-tuning, edge deployment

TopLlama 4 Maverick

1M context, 128-expert MoE, open weights. Runner-up: Qwen3.5 (397B, 19x faster decoding, Apache 2.0).

Real-Time Data Access

Social listening, trend analysis, news monitoring, market intel

TopGrok 4

Native X/Twitter integration with real-time post data. Arena #4. Only frontier model with live social media context built in.

Creative Writing

Narratives, marketing copy, scripts, tone-sensitive content

TopClaude Opus 4.5

Widely regarded as the most nuanced, warm writer. Runner-up: Qwen3-235B (creative alignment, open-source).

PM Rule of Thumb

Start with Claude Sonnet 4.6 or GPT-5.4 as your default. Escalate to Opus/GPT-5.2 Pro for hard reasoning. Drop to DeepSeek/Nano for volume. Specialize (Grok for real-time, Mercury for speed, Mistral for EU) only when the use case demands it.

6 / 6

Practical Guidance

The PM's Playbook

How to Choose · 4 Rules for Product Managers

1

Use a Router, Not a Single Model

Route simple queries to Nano/Haiku ($0.05-1/M), medium tasks to Sonnet/GPT-5.4 ($2-3/M), and hard reasoning to Opus/Pro ($5-21/M). This cuts costs 60-80% while maintaining quality where it matters.

2

Benchmark on Your Data, Not Leaderboards

Arena Elo measures general preference. Your product has specific tasks. Run 50-100 representative queries against your top 3 candidates and grade outputs. The best model on paper may not be the best for your domain.

3

Build Provider-Agnostic from Day 1

Use abstraction layers (Vercel AI SDK, LiteLLM) so you can swap models without rewriting code. The leaderboard reshuffles every month. Your architecture should survive model churn without a migration project.

4

Factor in Total Cost, Not Just Token Price

Token cost is one line item. Add: latency (affects UX), rate limits (affects scale), context window (affects architecture), and compliance requirements (affects where you can deploy). A $0.28/M model with 128K context may cost more than $3/M with 1M context when you factor in chunking logic.

Managing Context Windows

Effective ≠ Advertised

NVIDIA RULER benchmark shows effective context is 50-65% of what's advertised. A "200K" model degrades around 130K. Claude is the exception: <5% degradation across its full window.

Context Strategy by Tier

Under 32K: Fit everything in one call. 32K-200K: Retrieval-augmented generation (RAG). 200K-2M: Full document ingestion. 2M+: Entire codebases, multi-doc synthesis.

Prompt Caching Is Universal

Every major provider now offers prompt caching. Cached tokens cost 75-90% less. Structure your prompts with a stable system prefix. This alone can cut your bill in half on repeated query patterns.

Batch Processing = 50% Off

All major providers (OpenAI, Anthropic, Google, xAI) offer 50% batch discount for async processing. If your feature doesn't need real-time results, batch it. Run nightly analysis, bulk classification, content generation overnight.

Key Takeaways

→ The frontier is crowded: Top 5 models are within 19 Elo points. Differentiation is shifting to speed, cost, compliance, and ecosystem.

→ Open-source is closing in: GLM-5, Qwen3.5, and Llama 4 are within striking distance of proprietary models, with zero API costs.

→ Prices collapsed 80%: What cost $15/M in 2025 costs $3/M today. Budget is no longer the primary constraint for most teams.

→ Diffusion LLMs are real: Mercury 2's 1,000 tok/s proves non-autoregressive generation works at production scale. Watch this space.

Data Sources

Chatbot Arena / LM Arena (lmarena.ai) · TLDL API Pricing Report (tldl.io) · NVIDIA RULER Benchmark · Anthropic, OpenAI, Google, xAI, Inception Labs official docs · Morph LLM Context Comparison · March 2026

by Rizvi Haider

The LLM Landscape:Every Model, Compared

The Top 8 You Need to Know

Pricing & Context Windows

Rankings & Benchmarks

Which Model for What

The PM's Playbook

The LLM Landscape:
Every Model, Compared