1
Use a Router, Not a Single Model
Route simple queries to Nano/Haiku ($0.05-1/M), medium tasks to Sonnet/GPT-5.4 ($2-3/M), and hard reasoning to Opus/Pro ($5-21/M). This cuts costs 60-80% while maintaining quality where it matters.
2
Benchmark on Your Data, Not Leaderboards
Arena Elo measures general preference. Your product has specific tasks. Run 50-100 representative queries against your top 3 candidates and grade outputs. The best model on paper may not be the best for your domain.
3
Build Provider-Agnostic from Day 1
Use abstraction layers (Vercel AI SDK, LiteLLM) so you can swap models without rewriting code. The leaderboard reshuffles every month. Your architecture should survive model churn without a migration project.
4
Factor in Total Cost, Not Just Token Price
Token cost is one line item. Add: latency (affects UX), rate limits (affects scale), context window (affects architecture), and compliance requirements (affects where you can deploy). A $0.28/M model with 128K context may cost more than $3/M with 1M context when you factor in chunking logic.
Effective ≠ Advertised
NVIDIA RULER benchmark shows effective context is 50-65% of what's advertised. A "200K" model degrades around 130K. Claude is the exception: <5% degradation across its full window.
Context Strategy by Tier
Under 32K: Fit everything in one call. 32K-200K: Retrieval-augmented generation (RAG). 200K-2M: Full document ingestion. 2M+: Entire codebases, multi-doc synthesis.
Prompt Caching Is Universal
Every major provider now offers prompt caching. Cached tokens cost 75-90% less. Structure your prompts with a stable system prefix. This alone can cut your bill in half on repeated query patterns.
Batch Processing = 50% Off
All major providers (OpenAI, Anthropic, Google, xAI) offer 50% batch discount for async processing. If your feature doesn't need real-time results, batch it. Run nightly analysis, bulk classification, content generation overnight.
Key Takeaways
→
The frontier is crowded: Top 5 models are within 19 Elo points. Differentiation is shifting to speed, cost, compliance, and ecosystem.
→
Open-source is closing in: GLM-5, Qwen3.5, and Llama 4 are within striking distance of proprietary models, with zero API costs.
→
Prices collapsed 80%: What cost $15/M in 2025 costs $3/M today. Budget is no longer the primary constraint for most teams.
→
Diffusion LLMs are real: Mercury 2's 1,000 tok/s proves non-autoregressive generation works at production scale. Watch this space.
Data Sources
Chatbot Arena / LM Arena (lmarena.ai) · TLDL API Pricing Report (tldl.io) · NVIDIA RULER Benchmark · Anthropic, OpenAI, Google, xAI, Inception Labs official docs · Morph LLM Context Comparison · March 2026