The May 2026 coding-model lineup is the first one where the question is no longer which model can write code but which model can run inside an agent for hours without veering off. Anthropic ships a three-tier Claude 4 family — Opus 4.7 with a 1M-token context tier, Sonnet 4.6 as the daily driver, Haiku 4.5 for cheap parallel agents. OpenAI's GPT-5 (released August 7, 2025) sits between Sonnet and Opus on cost and is the first OpenAI model to clear 70% on SWE-bench Verified. Google's Gemini 2.5 Pro keeps the price-per-million advantage and the 1M context window but is still behind on the agentic-coding benchmarks that decide whether a model can run a Cursor agent or Claude Code session unattended.
This page is the comparison we use internally when picking a default model for a tool. All prices and context sizes link to the vendor's own pricing page; verify before you write a contract. Benchmarks are the published-by-vendor numbers with the harness named — read them as marketing-with-receipts, not gospel.
2026 coding model comparison
| Model | Context | Input / Output (per 1M tokens) | SWE-bench Verified | Best for |
|---|---|---|---|---|
| Claude Opus 4.7 | 200K (1M tier on request) | $15 / $75 | ~80% | Long-horizon agents, hard refactors |
| Claude Sonnet 4.6 | 200K | $3 / $15 | ~77% | Default driver for IDE agents |
| Claude Haiku 4.5 | 200K | $1 / $5 | ~73% | Cheap parallel sub-agents, classifiers |
| GPT-5 | 400K | $1.25 / $10 | 74.9% | Cost-sensitive agents, tool use |
| GPT-5 mini | 400K | $0.25 / $2 | ~62% | High-volume completion, fanout |
| Gemini 2.5 Pro | 1M (2M preview) | $1.25 / $10 (≤200K) | 63.8% | Whole-repo reads, multimodal review |
| Gemini 2.5 Flash | 1M | $0.30 / $2.50 | ~54% | Bulk indexing, embeddings-adjacent work |
| DeepSeek V3.1 | 128K | $0.27 / $1.10 | ~66% | Open-weights baseline, self-host |
Prices: see Anthropic pricing, OpenAI pricing, Gemini API pricing, and DeepSeek pricing. Claude Opus 4.7 1M-context tier and Gemini 2.5 Pro >200K context are billed at higher rates — read the pricing page before promising a budget. SWE-bench Verified numbers are vendor-published with the agent harness named where available; the canonical leaderboard lives at swebench.com.
Claude 4 family (Anthropic)
Three tiers, one personality. Opus 4.7 is what you reach for when the task is "rewrite this auth layer across forty files and don't miss anything". The 1M-context tier is opt-in (request access in the console) and roughly doubles the per-token price above 200K. Sonnet 4.6 is the daily driver inside Cursor and Claude Code — strong tool use, reliable diff editing, and in late 2025 became the first model where the SWE-bench Verified score crossed 75% with the open-source mini-SWE-agent harness. Haiku 4.5 is the surprise of the year: it ships ~73% on SWE-bench Verified at $1 / $5 per million, which makes it the first model where you can run dozens of parallel sub-agents without watching the bill.
Tradeoff: Anthropic's pricing is the highest per token of the major families. The justification is agentic reliability — Opus and Sonnet stay on-task across long Claude Code sessions where GPT-5 and Gemini have a higher chance of derailing or asking for confirmation. If your workload is single-shot completions, Anthropic is overpaying. If your workload is hour-long autonomous edits, the gap closes fast.
GPT-5 (OpenAI)
Released August 7, 2025, GPT-5 was OpenAI's first model to put a credible SWE-bench Verified number on the board (74.9% with the Verified harness) and the first to ship at $1.25 input / $10 output per million tokens — Sonnet-level capability at roughly Sonnet's output cost but a third of its input cost. The 400K context window is comfortable for most repos and removes the "feed it the right files" pre-step that 128K models needed.
Tradeoff: Reasoning-mode latency. GPT-5 with reasoning.effort: high is the most capable mode, but it adds five-to-fifteen seconds per call before the first token, which shows up as visible lag inside an IDE agent. Most editor integrations default to reasoning.effort: medium for this reason. The mini variant is excellent for fanout (lint-fix bots, classifier sub-agents) where Haiku 4.5 is the only obvious alternative.
Gemini 2.5 series (Google)
Gemini 2.5 Pro went GA in June 2025 and remains the cheapest credible million-token model — the first 200K tokens at $1.25 input, anything above at $2.50 input. The 2M-token preview tier is the only commercial option for "dump the whole repo plus dependencies into the prompt" workflows. Gemini 2.5 Flash sits an order of magnitude below Pro on price and is the natural pick for high-volume work where Haiku 4.5 is too expensive.
Tradeoff: Agentic reliability still trails Anthropic and OpenAI. Google's own SWE-bench Verified figure for 2.5 Pro is 63.8% — about ten points below Sonnet 4.6 and GPT-5. Gemini shines for read-heavy work (whole-repo audits, multimodal PR review with screenshots) and is a weaker pick for autonomous multi-file edits. The honest 2026 answer is: use Gemini for context, Claude or GPT-5 to do the work.
Open-weights options (DeepSeek, Qwen, Llama)
DeepSeek V3.1 (December 2024 base, March 2026 update) is the practical leader on the open side — ~66% SWE-bench Verified, native function calling, and an MIT-licensed weights drop you can run on a single H100 node with quantisation. Qwen3-Coder (Alibaba, 2025) is the strongest open-weights model for raw code completion in non-English contexts. Llama 3.3 70B remains the safe default for self-hosted deployments where licensing-by-Meta is acceptable but you want CPU-of-last-resort behaviour.
Tradeoff: Tooling. Cursor, Claude Code, Windsurf, and the major IDE agents are tuned for Anthropic and OpenAI APIs. Plugging DeepSeek into Cursor works but the prompt scaffolding is not tuned, so you give up some of the agentic gains the closed models book on the benchmarks. Open-weights wins when privacy or self-hosting is a hard requirement; otherwise the price-per-quality math still favours the commercial APIs in 2026.
Which model should I pick for my coding agent?
- •Default IDE agent (Cursor / Windsurf / Copilot agent mode): Claude Sonnet 4.6. It is the model these tools tune their prompts and skills around in 2026, and it ships the most reliable diff-edit behaviour at a price that survives a team-wide rollout.
- •Long-horizon refactor / multi-hour Claude Code sessions: Claude Opus 4.7 with the 1M context tier, falling back to Sonnet 4.6 once the plan is locked. The price gap is real but the "had to start over twice" gap is bigger if you stay on cheaper models.
- •Cost-sensitive agentic work (a startup, a hobby project): GPT-5 main at
reasoning.effort: medium, with GPT-5 mini for fanout. The blended cost is half a Sonnet workload and the SWE-bench numbers are within shouting distance. - •High-volume fanout (lint bots, batch generation, sub-agents): Claude Haiku 4.5 if you want the Anthropic tooling and the 73% SWE-bench number; Gemini 2.5 Flash if you want the cheapest large-context option; GPT-5 mini if you are already on OpenAI infra.
- •Whole-repo read / audit / large-context Q&A: Gemini 2.5 Pro. Nothing else lets you put a 1.5M-token codebase into one prompt.
- •Self-hosted / privacy-required: DeepSeek V3.1. The weights are MIT-licensed, the SWE-bench number is the highest of any open model, and the inference stack is well documented.
What changed since the last comparison
- •SWE-bench Verified replaced HumanEval+ as the citable benchmark. HumanEval saturated above 90% across the top three families in 2024 and stopped distinguishing models. SWE-bench Verified, which scores agents on real GitHub issues, is now the number every vendor leads with.
- •The 1M-token tier became normal. Gemini had it first; Anthropic shipped a 1M tier for Sonnet 4 in August 2025 and extended it to Opus 4.7. GPT-5 sits at 400K, which is the only major model still under a million.
- •Reasoning-effort knobs are everywhere. GPT-5 exposes
reasoning.effort(minimal/low/medium/high). Claude has extended thinking with a token budget. Gemini 2.5 ships thinking on by default with athinkingBudgetparameter. Picking the right setting is now part of picking the model. - •Open-weights gap halved. DeepSeek V3 closed most of the quality gap to GPT-4-class models in 2024; V3.1 closes most of the remaining agentic gap. The closed-source edge is now harness tuning and tool-use polish, not raw model capability.
Where this connects
Picking a model is the start of the conversation, not the end. The agent harness — hooks, evaluators, skills — does at least as much work as the model upgrade once you are past Level 2 in the Five Levels framework. Day-to-day use of these models in a closed-loop coding environment is documented in our Claude Code and Cursor guides; raw API integration patterns (rate limits, prompt caching, structured outputs) live in our guide to LLM APIs for developers. For the broader landscape and discovery sources, our awesome AI coding lists page is the index.