Local LLMs are good enough for coding in 2026 — for the right tasks, on the right hardware. A 24 GB GPU (RTX 3090, 4090, 5090) or a 32 GB Apple-Silicon Mac can run Qwen2.5-Coder-32B-Instruct at Q4_K_M and reach 73.7 on Aider, the code-repair benchmark — a score Alibaba publishes as comparable to GPT-4o. The model lives on disk, never sees the network, and answers an OpenAI-shaped HTTP request from Cursor, Continue, Aider, or any other client. That is the news. Three years ago an open-weights coder hitting GPT-4-class numbers on a single consumer GPU would have been a research paper.
This page is the primer. It covers what changed since the Llama 2 / Mistral 7B era, which model to actually pull on which hardware, the three runtimes worth knowing (Ollama, llama.cpp, LM Studio), how to point your editor at localhost:11434, and where local still loses to hosted in May 2026 — long-horizon agent runs and frontier reasoning. For the hosted side, the AI & LLM coding model comparison is the companion read.
TL;DR
- 16 GB Mac / 12 GB GPU: Qwen2.5-Coder 7B / 14B at Q4_K_M, ~30–40 tok/s. Solid for chat, autocomplete, single-file edits.
- 24 GB GPU / 32 GB Mac: Qwen2.5-Coder 32B at Q4_K_M, 15–20 tok/s. The current sweet spot for serious local coding.
- Default runtime: Ollama (llama.cpp under the hood, plus a daemon and an OpenAI-compatible API).
- Editor wiring: Continue talks directly to
http://localhost:11434; Cursor needs an ngrok tunnel because its backend cannot reach localhost. - Where local still loses: SWE-bench Verified, multi-hour agentic refactors, and the absolute frontier — Claude Sonnet 4.6 and GPT-5 are still ahead by ~10–20 points.
Is local AI good enough for coding in 2026?
For day-to-day inline edits, autocomplete, code review, refactoring, and writing tests against a clear spec — yes. The 2024 mental model where local models could do toy demos and not much else broke when Qwen2.5-Coder landed in late 2024 and was then refined through 2025: a 32 B-parameter model that scores within a few points of GPT-4o on Aider and EvalPlus, runs on a single 24 GB GPU at Q4_K_M, and ships under an Apache 2.0 licence with no usage caps. DeepSeek's open-weights coder family adds a second strong option, especially for algorithmic and mathematics-heavy work. Codestral handles fast inline completion in a 16 GB-VRAM setup.
Where local still loses, in May 2026, is the same place it lost a year ago: long-horizon agentic work, the latest reasoning benchmarks, and tasks that benefit from frontier context windows. Anthropic's Sonnet 4.6 is still the default for Claude Code sessions that span hours, and GPT-5 holds a measurable lead on SWE-bench Verified. If you climb the autonomy ladder past Level 3 in Shapiro's five levels, you will probably still be paying a frontier-model bill for the orchestrator agent — even when the per-edit worker is a local Qwen.
What hardware do I need to run a coding LLM?
The number that matters is the memory budget for weights plus KV cache. Q4_K_M quantisation packs roughly half a byte per parameter, so model weights take about size / 2 GB — but the KV cache scales with context length and dwarfs the weights once you push past 32 K tokens. The table below is for Q4_K_M at modest 4–8 K context, which covers most editor sessions.
| Tier | Hardware | Models that fit (Q4_K_M) | Speed | What it is good for |
|---|---|---|---|---|
| Entry | M2/M3 16 GB, RTX 3060 12 GB | 7 B–8 B (5–6 GB) | 40–55 tok/s | Autocomplete, single-file chat, commit messages |
| Mid | M2 Pro 32 GB, RTX 4060 Ti 16 GB | 12–14 B (8–10 GB) | 25–40 tok/s | Multi-file edits, code review, refactor with tests |
| Sweet spot | RTX 3090/4090 24 GB, M-series 32 GB | 27–32 B (~22 GB) | 15–20 tok/s | Qwen2.5-Coder 32B, Devstral Small, Mistral Small 3 |
| Workstation | RTX 5090 32 GB, M Ultra 64 GB+ | 70 B (~40 GB, partial offload) | 8–12 tok/s | Llama 3.x 70B class, long-context whole-repo work |
Two surprises catch first-time buyers. First, an 8 B model at Q4_K_M with a 128 K context window needs roughly 20 GB of KV cache on top of the weights — so a model the spec sheet says fits in 6 GB won't fit in 12 GB once you turn on long context. Set OLLAMA_FLASH_ATTENTION=1 to cut KV memory by 30–50%. Second, system RAM is not VRAM-equivalent: Ollama and llama.cpp will silently offload layers to CPU when GPU memory is exhausted, and CPU inference runs 10–100× slower. If your eval count in ollama run --verbose drops from 30 tok/s to 3 tok/s mid-session, you have spilled out of VRAM.
Which model should I run for coding?
Two open-weights families dominate in May 2026: Qwen 2.5 Coder from Alibaba (the default for general coding work) and DeepSeek (better on reasoning- and algorithm-heavy tasks under an MIT licence). Codestral fits a niche for fast inline completion on smaller VRAM. The picks below assume Q4_K_M; bump to Q5_K_M if you have ~20% extra VRAM and you can spot the quality difference.
qwen2.5-coder:32b-instruct-q4_K_MDefault for serious local coding. 73.7 Aider, 128K context, 92 languages. Needs 24 GB VRAM. Apache 2.0.
qwen2.5-coder:14b-instruct-q4_K_M8–10 GB VRAM. The honest middle. Strong autocomplete plus chat that handles real refactors.
qwen2.5-coder:7b-instruct-q4_K_M5–6 GB VRAM. The smallest model that still feels useful for code. The autocomplete pick on a 16 GB Mac.
deepseek-coder-v2:16bMoE — 16 B total / 2.4 B active. Punches above its weight on algorithmic tasks; weaker on idiomatic React or Rails.
codestral:22b-v0.1-q4_K_MFast inline completion focus, low latency. Good fit for a 16 GB GPU where Qwen 32B will not fit.
qwen2.5-coder:1.5bTab-completion model only — pair it with a larger chat model. ~3 GB and runs anywhere.
Two attribution notes worth getting right. Qwen2.5-Coder-32B-Instruct's 73.7 on Aider is published in Alibaba's own release post and reproducibility on a single 24 GB GPU is widely confirmed by the community; treat it as a strong upper-bound that drops a few points at lower quantisations. DeepSeek's code work is best framed at the repo level — the open-weights V3 family is MIT-licensed and runs as a Mixture-of-Experts model, which means most servers run a fraction of the parameters per token even when the file size on disk looks large.
Ollama, llama.cpp, or LM Studio — which runtime?
All three run the same llama.cpp engine that Georgi Gerganov started in 2023, which is why their inference speed for the same model and quantisation is essentially identical. The differences are the wrapper: Ollama adds a daemon, a model registry, and the OpenAI-compatible HTTP API; LM Studio adds a desktop GUI with a good model browser; raw llama.cpp exposes every flag and gives you the absolute speed ceiling.
| Runtime | Pick when | Trade-off |
|---|---|---|
| Ollama | You want one local endpoint that Continue, Cursor, Open WebUI, Aider, and an n8n workflow can all share. | A thin wrapper — every flag exists, but the daemon hides them by default. |
| llama.cpp | You want every token-per-second the hardware will give, custom build flags, or you are running on exotic accelerators. | You operate it yourself. llama-server, not llama-cli, is the API surface. |
| LM Studio | You want a desktop GUI, a model browser, and a one-click chat — no terminal. | The HTTP server only runs while the app is open; not a fit for a daemon-style workflow. |
The honest default is Ollama. It is the runtime targeted by Open WebUI, Continue, Aider, LangChain examples, and most "run locally" sections of vendor documentation. Pick llama.cpp when you have measured Ollama and want more speed, or LM Studio when the person sitting at the keyboard wants a window with buttons.
How do I plug a local model into Cursor, Continue, or Claude Code?
Ollama exposes the OpenAI Chat Completions API at http://localhost:11434/v1/chat/completions. Anything that speaks OpenAI can talk to it; the only caveat is whether the client process can reach localhost.
curl -s http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen2.5-coder:32b",
"messages": [{ "role": "user", "content": "Say hi" }],
"stream": false
}' | jq -r .choices[0].message.contentContinue (VS Code, JetBrains)
Continue talks directly to localhost; no tunnel needed. Drop this in ~/.continue/config.yaml and reload the window:
models:
- name: Qwen2.5-Coder 32B
provider: ollama
model: qwen2.5-coder:32b
apiBase: http://localhost:11434
roles: [chat, edit]
- name: Qwen2.5-Coder 1.5B (autocomplete)
provider: ollama
model: qwen2.5-coder:1.5b
roles: [autocomplete]Cursor
Cursor's backend runs in a sandbox that cannot see localhost. The fix is an ngrok tunnel plus the OpenAI-compatible base URL in Settings → Models:
OLLAMA_ORIGINS='*' ollama serve &
ngrok http 11434 --host-header='localhost:11434'
# Cursor → Settings → Models → "Add custom model"
# Base URL: https://<your-tunnel>.ngrok-free.app/v1
# API Key: ollama (any non-empty string)
# Model: qwen2.5-coder:32b (must match `ollama list`)Cursor's remote-indexing and codebase-search features still talk to Cursor's servers even with a local model selected. If you need a fully air-gapped setup, use Continue.
Claude Code and Aider
Claude Code is built around Anthropic's Sonnet/Opus releases and does not currently run against an arbitrary local endpoint. Aider does:
pip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32bWhen does local beat hosted, and when does it lose?
Local wins on three axes. Privacy: code that lives under an NDA or under EU data-residency rules never leaves the box. Cost at high volume: an autocomplete model running 12 hours a day is roughly free at the margin once the GPU is paid for, while the same volume on a hosted API runs into the low hundreds of dollars per developer per month. Latency for small completions: a 7 B model on local VRAM responds in 50–80 ms first-token, ahead of any remote API once you account for round-trip.
Hosted wins on three different axes. Frontier capability: the Anthropic and OpenAI releases lead local by ~10–20 points on SWE-bench Verified at any given month, and the gap reopens whenever a new release lands. Long-horizon agency: running an agent for an hour with tool calls, retries, and self-correction stresses the model in a way that most open-weights coders have not been tuned for. Context window: Anthropic's 1 M-token tier and Gemini's long-context models still beat what a single 24 GB GPU can practically hold in KV cache. The pragmatic answer is to do both — local for inline edits and review, hosted for the agentic runs you would have walked away from anyway.
Common gotchas
- Wrong model tag: the string in your editor config must match
ollama listexactly.qwen2.5-coderresolves to:latest, which may not be the size you pulled. Pin the tag. - CORS blocks: set
OLLAMA_ORIGINS=*before startingollama serveif you are calling from a browser-based client (including some IDE extensions running in WebView). - CPU offload disguised as a slow GPU:
ollama pstells you how much of the model is on GPU; if the percentage is below 100%, you are paying CPU-tier latency. Drop a quant level or pick a smaller model. - Tool-use claims that do not work: some Ollama models advertise function-calling support but produce malformed tool calls under load. Test agent mode on real tasks before trusting the spec sheet.
- Model rot: the
:latesttag drifts as new quantisations land. Pinqwen2.5-coder:32b-instruct-q4_K_Minstead ofqwen2.5-coderin production configs.
Related reading on env.dev
- Ollama — the runtime page: install per OS, Modelfile, and the OpenAI-compatible API in detail.
- Open WebUI — self-hosted ChatGPT-style UI on top of Ollama: RAG, RBAC, scheduled automations.
- AI & LLM coding model comparison — the hosted side, with current pricing and SWE-bench numbers.
- Agentic coding levels — where local models fit inside Shapiro's autonomy ladder.
- Awesome AI coding — community lists for AI tooling, including local-LLM resources.
Primary sources
- Qwen — Qwen2.5-Coder release post (Aider 73.7, EvalPlus, multi-language coverage).
- ollama/ollama on GitHub (170k+ stars, OpenAI-compatible API at
:11434). - ggml-org/llama.cpp — Georgi Gerganov's inference engine, the layer underneath Ollama and LM Studio.
- Continue — Ollama guide (config schema, autocomplete vs chat roles).
- Aider — terminal coding assistant; the canonical Aider benchmark used by every coder model release post.