1 / 6
A Product Manager's Guide · March 2026

The LLM Landscape:
Every Model, Compared

8 companies, 30+ models, one cheat sheet. Arena rankings, pricing, context windows, and which model to pick for your use case.

Anthropic
San Francisco · Founded 2021
Dario & Daniela Amodei (ex-OpenAI). Safety-first lab valued at $380B. Ships Claude Opus, Sonnet, Haiku.
$380BClaude
OpenAI
San Francisco · Founded 2015
Sam Altman, CEO. $25B+ ARR, eyeing IPO. GPT-5.x series plus o3 reasoning line. Largest consumer base via ChatGPT.
$25B ARRGPT-5.4
Google DeepMind
London + Mountain View · Merged 2023
Demis Hassabis leads the merged Brain + DeepMind. Gemini 3.1 Pro set the GPQA Diamond record (94.3%). Massive multimodal edge.
AlphabetGemini
xAI
Memphis, TN · Founded 2023
Elon Musk's lab, acquired by SpaceX ($250B). Grok models with real-time X data. Colossus supercomputer. 2M token context.
SpaceXGrok 4
Meta AI
Menlo Park · FAIR since 2013
Zuckerberg's open-source play. Llama 4 Scout has a 10M token context window. Runs on a single H100. MoE architecture.
Open SourceLlama 4
DeepSeek
Hangzhou, China · Founded 2023
Liang Wenfeng (hedge fund founder). Trained frontier models for ~$6M. V3.2 at $0.28/M input. V4 targets 1T params, Apache 2.0.
$0.28/MOpen Source
Mistral AI
Paris, France · Founded 2023
Ex-DeepMind & Meta researchers. Europe's AI champion. GDPR-compliant, 675B MoE. Le Chat consumer app with deep research.
EU Data$6B+
Inception Labs
Palo Alto · Founded 2024
Stanford researchers pioneering diffusion LLMs. Mercury 2 generates 1,000 tok/s via parallel refinement. 5x faster than rivals.
Diffusion1K tok/s
2 / 6
Flagship Models

The Top 8 You Need to Know

Claude Opus 4.6
Anthropic
Arena #1 globally (1504 Elo). Best-in-class complex reasoning, multi-step logic, and agentic coding. 1M token context in beta. Safety-focused with extended thinking.
Elo 1504 $5 / $25 200K–1M ctx
Gemini 3.1 Pro
Google DeepMind
Arena #2 (1500 Elo). Set the GPQA Diamond record at 94.3%, scored 77.1% on ARC-AGI-2. Native 5-modality input (text, images, audio, video, code). Chain-of-thought reasoning.
Elo 1500 $2 / $12 1M ctx
GPT-5.4
OpenAI
Most consistent all-purpose model. Five reasoning levels (none to xhigh), computer use API for desktop automation, 272K standard context with 1M in Codex. Batch pricing at 50% off.
$2.50 / $15 272K–1M ctx 5 reasoning levels
Grok 4
xAI
Arena #4 (1493 Elo). Real-time X/Twitter data access, native tool use, scaled RL. Grok 4.20 Beta adds a 4-agent collaboration system. Grok 4.1 Fast: 2M context at $0.20/M.
Elo 1493 $3 / $15 256K–2M ctx
Llama 4 Scout
Meta AI · Open Source
10M token context (largest public model). 17B active / 109B total with 16-expert MoE. Fits on a single H100 GPU. Beats Gemma 3, Gemini 2.0 Flash-Lite across benchmarks. Fully open-weight.
10M ctx Free / Open Single GPU
Mercury 2
Inception Labs
First commercial diffusion LLM. Generates 1,000+ tokens/sec via parallel refinement (not autoregressive). 5x faster than speed-optimized alternatives. In-generation error correction. Ideal for fast agent loops.
1K tok/s 5× cheaper Diffusion arch.
DeepSeek V3.2
DeepSeek · Open Source
The value king. $0.28/M input vs $2-5 for competitors. Trained for ~$6M (100x cheaper). MoE architecture with competitive reasoning. V4 incoming: 1T params, 1M context, native multimodal, Apache 2.0.
$0.28 / $0.42 128K ctx Apache 2.0
Mistral Large 3
Mistral AI
Europe's frontier model. 41B active / 675B total MoE. GDPR-compliant with EU data residency. Le Chat consumer app has deep research, image editing, and multilingual reasoning. Strong enterprise play.
$2 / $6 128K ctx EU compliant
Key Insight
The gap between #1 and #5 on the Arena is just 19 Elo points. Models are converging in raw capability. The differentiators are now pricing, context length, speed, compliance, and ecosystem integrations.
3 / 6
Cost & Capacity

Pricing & Context Windows

API Pricing · Per 1M Tokens (Input / Output)
Model Input Output Context Tier
GPT-5.2 Pro $21.00 $168.00 200K Premium reasoning
Claude Opus 4.6 $5.00 $25.00 200K–1M Frontier intelligence
Claude Sonnet 4.6 $3.00 $15.00 200K–1M Best daily driver
GPT-5.4 $2.50 $15.00 272K–1M All-purpose flagship
Grok 4 $3.00 $15.00 256K–2M Real-time data
Gemini 3.1 Pro $2.00 $12.00 1M Best value frontier
Mistral Large 3 $2.00 $6.00 128K EU compliance
DeepSeek V3.2 $0.28 $0.42 128K Extreme value
GPT-5 Nano $0.05 $0.40 128K High-volume tasks
Grok 4.1 Fast $0.20 $0.50 2M Fast + huge context
Context Window Comparison
Llama 4 Scout
10M
Grok 4.1 Fast
2M
Gemini 3 Pro
2M
Claude / GPT (beta)
1M
Gemini 3 Flash
1M
GPT-5.4
272K
Claude (standard)
200K
Real-World Cost Comparison · 1M Output Tokens
Premium Reasoning
$168
GPT-5.2 Pro output
Frontier Standard
$15
Sonnet/GPT-5.4 output
Budget Tier
$0.42
DeepSeek V3.2 output
~80%
API price drop 2025→2026 across all providers
50%
Batch discount now standard (async processing)
Prompt Caching · The Hidden Cost Saver
Every major provider now supports prompt caching. Cached tokens cost 75-90% less than uncached. Structure prompts with a stable system prefix and your repeated context at the front. On high-volume workloads, this alone can cut your bill by 40-60%. Claude and GPT cache automatically when input patterns repeat.
4 / 6
How They Stack Up

Rankings & Benchmarks

Chatbot Arena · Top 5 Overall (5.4M+ Votes, 323 Models)
1
Claude Opus 4.6
Anthropic
1504
2
Gemini 3.1 Pro
Google DeepMind
1500
4
Grok 4.20 Beta
xAI
1493
5
Gemini 3 Pro
Google DeepMind
1485
Key Benchmarks · Best Score Highlighted
Model GPQA
Diamond
ARC-AGI-2 SWE-bench MATH-500 LiveCode
Bench
Gemini 3.1 Pro 94.3% 77.1%
Claude Opus 4.6
GPT-5.4
GLM-4.7 91.2% 84.9%
Qwen3-Max 97.8%
Top Open-Source Models · Arena Rankings
#1 Open Source
GLM-5
1451
Zhipu AI (Beijing)
#2 Open Source
Kimi K2.5
1448
Moonshot AI
#3 Open Source
GLM-4.7
1445
Zhipu AI (Beijing)
What the Arena Tells PMs
Arena Elo measures real human preference across blind A/B tests. The top 5 are within 19 points of each other. Open-source models (GLM-5, Kimi K2.5) now cluster within 6 points, closing the gap with proprietary models fast. Source: lmarena.ai, March 5, 2026.
Benchmark Caveats for PMs
Benchmarks measure specific skills in isolation. Your product has unique requirements. GPQA Diamond tests graduate-level science. SWE-bench tests real GitHub issue fixing. MATH-500 tests mathematical reasoning. None of these measure your actual use case. Always run your own evals on representative queries.
The Speed Factor
Benchmarks don't capture latency. Mercury 2 at 1,000 tok/s delivers 5x faster responses than any frontier model. Grok 4.1 Fast prioritizes throughput over peak reasoning. For user-facing products, time-to-first-token matters more than GPQA score. Test perceived speed, not just accuracy.
5 / 6
The Decision Matrix

Which Model for What

Complex Reasoning
Strategy docs, multi-step analysis, architecture decisions
TopClaude Opus 4.6
Arena #1, strongest multi-step logic and extended thinking. Runner-up: Gemini 3.1 Pro (94.3% GPQA Diamond).
Agentic Coding
Multi-file edits, PR reviews, debugging, CLI reasoning
TopGLM-4.7
91.2% SWE-bench, 84.9% LiveCodeBench (both best-in-class). Runner-up: Claude Sonnet 4.6 for daily coding tasks.
Speed-Critical Agents
Production agent loops, high-frequency tool calls, real-time pipelines
TopMercury 2
1,000 tok/s via diffusion architecture, 5x faster than alternatives. In-generation error correction for reliable agent loops.
Long Document Analysis
Contract review, codebase analysis, research synthesis, multi-doc QA
TopGemini 3 Pro
2M token context with native multimodal. Runner-up: Llama 4 Scout (10M context, open-source, self-hostable).
Budget / High Volume
Classification, extraction, summarization at scale, batch processing
TopDeepSeek V3.2
$0.28/M input (10-18x cheaper than frontier). Competitive quality. Runner-up: GPT-5 Nano ($0.05/M) for simpler tasks.
EU / GDPR Compliance
European data residency, regulatory requirements, sovereign AI
TopMistral Large 3
Paris-based, EU data residency, 675B MoE architecture. The only frontier model with native GDPR compliance.
Self-Hosted / On-Prem
Air-gapped, data sovereignty, custom fine-tuning, edge deployment
TopLlama 4 Maverick
1M context, 128-expert MoE, open weights. Runner-up: Qwen3.5 (397B, 19x faster decoding, Apache 2.0).
Real-Time Data Access
Social listening, trend analysis, news monitoring, market intel
TopGrok 4
Native X/Twitter integration with real-time post data. Arena #4. Only frontier model with live social media context built in.
Creative Writing
Narratives, marketing copy, scripts, tone-sensitive content
TopClaude Opus 4.5
Widely regarded as the most nuanced, warm writer. Runner-up: Qwen3-235B (creative alignment, open-source).
PM Rule of Thumb
Start with Claude Sonnet 4.6 or GPT-5.4 as your default. Escalate to Opus/GPT-5.2 Pro for hard reasoning. Drop to DeepSeek/Nano for volume. Specialize (Grok for real-time, Mercury for speed, Mistral for EU) only when the use case demands it.
6 / 6
Practical Guidance

The PM's Playbook

How to Choose · 4 Rules for Product Managers
1
Use a Router, Not a Single Model
Route simple queries to Nano/Haiku ($0.05-1/M), medium tasks to Sonnet/GPT-5.4 ($2-3/M), and hard reasoning to Opus/Pro ($5-21/M). This cuts costs 60-80% while maintaining quality where it matters.
2
Benchmark on Your Data, Not Leaderboards
Arena Elo measures general preference. Your product has specific tasks. Run 50-100 representative queries against your top 3 candidates and grade outputs. The best model on paper may not be the best for your domain.
3
Build Provider-Agnostic from Day 1
Use abstraction layers (Vercel AI SDK, LiteLLM) so you can swap models without rewriting code. The leaderboard reshuffles every month. Your architecture should survive model churn without a migration project.
4
Factor in Total Cost, Not Just Token Price
Token cost is one line item. Add: latency (affects UX), rate limits (affects scale), context window (affects architecture), and compliance requirements (affects where you can deploy). A $0.28/M model with 128K context may cost more than $3/M with 1M context when you factor in chunking logic.
Managing Context Windows
Effective ≠ Advertised
NVIDIA RULER benchmark shows effective context is 50-65% of what's advertised. A "200K" model degrades around 130K. Claude is the exception: <5% degradation across its full window.
Context Strategy by Tier
Under 32K: Fit everything in one call. 32K-200K: Retrieval-augmented generation (RAG). 200K-2M: Full document ingestion. 2M+: Entire codebases, multi-doc synthesis.
Prompt Caching Is Universal
Every major provider now offers prompt caching. Cached tokens cost 75-90% less. Structure your prompts with a stable system prefix. This alone can cut your bill in half on repeated query patterns.
Batch Processing = 50% Off
All major providers (OpenAI, Anthropic, Google, xAI) offer 50% batch discount for async processing. If your feature doesn't need real-time results, batch it. Run nightly analysis, bulk classification, content generation overnight.
Key Takeaways
The frontier is crowded: Top 5 models are within 19 Elo points. Differentiation is shifting to speed, cost, compliance, and ecosystem.
Open-source is closing in: GLM-5, Qwen3.5, and Llama 4 are within striking distance of proprietary models, with zero API costs.
Prices collapsed 80%: What cost $15/M in 2025 costs $3/M today. Budget is no longer the primary constraint for most teams.
Diffusion LLMs are real: Mercury 2's 1,000 tok/s proves non-autoregressive generation works at production scale. Watch this space.
Data Sources
Chatbot Arena / LM Arena (lmarena.ai) · TLDL API Pricing Report (tldl.io) · NVIDIA RULER Benchmark · Anthropic, OpenAI, Google, xAI, Inception Labs official docs · Morph LLM Context Comparison · March 2026
Rizvi Haider by Rizvi Haider