Are local LLMs good enough for coding in 2026?

Yes, for inline edits, autocomplete, code review, and refactoring. A 24 GB GPU or a 32 GB Apple-Silicon Mac can run Qwen2.5-Coder-32B-Instruct at Q4_K_M and reach 73.7 on the Aider code-repair benchmark — Alibaba publishes that score as comparable to GPT-4o on the same evaluation. Where local still loses in May 2026 is long-horizon agentic runs, the latest SWE-bench Verified leaderboard (Sonnet 4.6 and GPT-5 lead by ~10–20 points), and frontier 1M-token context windows.

What hardware do I need to run a coding LLM locally?

For 7–8B models at Q4_K_M (Qwen2.5-Coder 7B, Llama 3.1 8B): 5–6 GB VRAM, runs on a 12 GB GPU or 16 GB Mac at 40–55 tok/s. For 12–14B: 8–10 GB VRAM at 25–40 tok/s. For Qwen2.5-Coder 32B (the current sweet spot): a 24 GB GPU like the RTX 3090/4090 or a 32 GB Apple-Silicon Mac, 15–20 tok/s. Add roughly 5 GB of KV cache per 32 K of context on an 8B model — KV scales with context length and dwarfs the weights past 32 K tokens.

Which coding model should I run on Ollama in 2026?

Default to qwen2.5-coder:32b-instruct-q4_K_M if you have 24 GB VRAM — Apache 2.0, 128 K context, 92 languages, 73.7 on Aider per the Qwen release post. Drop to qwen2.5-coder:14b on 12–16 GB cards. Use qwen2.5-coder:1.5b only as the autocomplete model paired with a larger chat model. DeepSeek-Coder-V2 (MIT) is the pick for algorithmic and math-heavy work. Codestral fits a 16 GB card when Qwen 32B will not.

Ollama vs llama.cpp vs LM Studio — which should I pick?

All three run the same llama.cpp engine, so inference speed is essentially identical for the same model and quantisation. Pick Ollama by default — it adds a daemon, a model registry, and an OpenAI-compatible API at localhost:11434 that Continue, Aider, Open WebUI, and most LangChain examples target out of the box. Pick raw llama.cpp when you want every flag and the absolute speed ceiling. Pick LM Studio when the user at the keyboard wants a desktop GUI and a model browser.

How do I point Cursor or Continue at a local Ollama model?

Continue talks directly to http://localhost:11434 — drop a models entry with provider: ollama and apiBase: http://localhost:11434 into ~/.continue/config.yaml. Cursor cannot reach localhost from its sandboxed backend, so expose Ollama through ngrok with --host-header="localhost:11434", then add an OpenAI-compatible model in Cursor Settings → Models with Base URL https:// /v1 and any non-empty API key. The model field must match `ollama list` exactly.

When does a local LLM beat a hosted one for coding?

Local wins on privacy (NDA or EU-residency code never leaves the box), on cost at high autocomplete volume (free at the margin once the GPU is paid for), and on first-token latency for small completions (50–80 ms beats any remote API once round-trip is included). Hosted wins on frontier benchmarks, long-horizon agent runs that span hours, and context windows past what a single 24 GB GPU can hold in KV cache. Most teams use both — local for inline review, hosted for the agentic runs.

Local LLMs for Coding in 2026: Models, Hardware, Runtimes

Local LLMs for coding in 2026: Qwen2.5-Coder 32B at 73.7 Aider, Ollama vs llama.cpp vs LM Studio, and how to point Cursor or Continue at localhost.

Local LLMs are good enough for coding in 2026 — for the right tasks, on the right hardware. A 24 GB GPU (RTX 3090, 4090, 5090) or a 32 GB Apple-Silicon Mac can run Qwen2.5-Coder-32B-Instruct at Q4_K_M and reach 73.7 on Aider, the code-repair benchmark — a score Alibaba publishes as comparable to GPT-4o. The model lives on disk, never sees the network, and answers an OpenAI-shaped HTTP request from Cursor, Continue, Aider, or any other client. That is the news. Three years ago an open-weights coder hitting GPT-4-class numbers on a single consumer GPU would have been a research paper.

This page is the primer. It covers what changed since the Llama 2 / Mistral 7B era, which model to actually pull on which hardware, the three runtimes worth knowing (Ollama, llama.cpp, LM Studio), how to point your editor at localhost:11434, and where local still loses to hosted in May 2026 — long-horizon agent runs and frontier reasoning. For the hosted side, the AI & LLM coding model comparison is the companion read.

TL;DR

16 GB Mac / 12 GB GPU: Qwen2.5-Coder 7B / 14B at Q4_K_M, ~30–40 tok/s. Solid for chat, autocomplete, single-file edits.
24 GB GPU / 32 GB Mac: Qwen2.5-Coder 32B at Q4_K_M, 15–20 tok/s. The current sweet spot for serious local coding.
Default runtime: Ollama (llama.cpp under the hood, plus a daemon and an OpenAI-compatible API).
Editor wiring: Continue talks directly to http://localhost:11434; Cursor needs an ngrok tunnel because its backend cannot reach localhost.
Where local still loses: SWE-bench Verified, multi-hour agentic refactors, and the absolute frontier — Claude Sonnet 4.6 and GPT-5 are still ahead by ~10–20 points.

Is local AI good enough for coding in 2026?

For day-to-day inline edits, autocomplete, code review, refactoring, and writing tests against a clear spec — yes. The 2024 mental model where local models could do toy demos and not much else broke when Qwen2.5-Coder landed in late 2024 and was then refined through 2025: a 32 B-parameter model that scores within a few points of GPT-4o on Aider and EvalPlus, runs on a single 24 GB GPU at Q4_K_M, and ships under an Apache 2.0 licence with no usage caps. DeepSeek's open-weights coder family adds a second strong option, especially for algorithmic and mathematics-heavy work. Codestral handles fast inline completion in a 16 GB-VRAM setup.

Where local still loses, in May 2026, is the same place it lost a year ago: long-horizon agentic work, the latest reasoning benchmarks, and tasks that benefit from frontier context windows. Anthropic's Sonnet 4.6 is still the default for Claude Code sessions that span hours, and GPT-5 holds a measurable lead on SWE-bench Verified. If you climb the autonomy ladder past Level 3 in Shapiro's five levels, you will probably still be paying a frontier-model bill for the orchestrator agent — even when the per-edit worker is a local Qwen.

What hardware do I need to run a coding LLM?

The number that matters is the memory budget for weights plus KV cache. Q4_K_M quantisation packs roughly half a byte per parameter, so model weights take about size / 2 GB — but the KV cache scales with context length and dwarfs the weights once you push past 32 K tokens. The table below is for Q4_K_M at modest 4–8 K context, which covers most editor sessions.

Tier	Hardware	Models that fit (Q4_K_M)	Speed	What it is good for
Entry	M2/M3 16 GB, RTX 3060 12 GB	7 B–8 B (5–6 GB)	40–55 tok/s	Autocomplete, single-file chat, commit messages
Mid	M2 Pro 32 GB, RTX 4060 Ti 16 GB	12–14 B (8–10 GB)	25–40 tok/s	Multi-file edits, code review, refactor with tests
Sweet spot	RTX 3090/4090 24 GB, M-series 32 GB	27–32 B (~22 GB)	15–20 tok/s	Qwen2.5-Coder 32B, Devstral Small, Mistral Small 3
Workstation	RTX 5090 32 GB, M Ultra 64 GB+	70 B (~40 GB, partial offload)	8–12 tok/s	Llama 3.x 70B class, long-context whole-repo work

Two surprises catch first-time buyers. First, an 8 B model at Q4_K_M with a 128 K context window needs roughly 20 GB of KV cache on top of the weights — so a model the spec sheet says fits in 6 GB won't fit in 12 GB once you turn on long context. Set OLLAMA_FLASH_ATTENTION=1 to cut KV memory by 30–50%. Second, system RAM is not VRAM-equivalent: Ollama and llama.cpp will silently offload layers to CPU when GPU memory is exhausted, and CPU inference runs 10–100× slower. If your eval count in ollama run --verbose drops from 30 tok/s to 3 tok/s mid-session, you have spilled out of VRAM.

Which model should I run for coding?

Two open-weights families dominate in May 2026: Qwen 2.5 Coder from Alibaba (the default for general coding work) and DeepSeek (better on reasoning- and algorithm-heavy tasks under an MIT licence). Codestral fits a niche for fast inline completion on smaller VRAM. The picks below assume Q4_K_M; bump to Q5_K_M if you have ~20% extra VRAM and you can spot the quality difference.

qwen2.5-coder:32b-instruct-q4_K_M

Default for serious local coding. 73.7 Aider, 128K context, 92 languages. Needs 24 GB VRAM. Apache 2.0.

qwen2.5-coder:14b-instruct-q4_K_M

8–10 GB VRAM. The honest middle. Strong autocomplete plus chat that handles real refactors.

qwen2.5-coder:7b-instruct-q4_K_M

5–6 GB VRAM. The smallest model that still feels useful for code. The autocomplete pick on a 16 GB Mac.

deepseek-coder-v2:16b

MoE — 16 B total / 2.4 B active. Punches above its weight on algorithmic tasks; weaker on idiomatic React or Rails.

codestral:22b-v0.1-q4_K_M

Fast inline completion focus, low latency. Good fit for a 16 GB GPU where Qwen 32B will not fit.

qwen2.5-coder:1.5b

Tab-completion model only — pair it with a larger chat model. ~3 GB and runs anywhere.

Two attribution notes worth getting right. Qwen2.5-Coder-32B-Instruct's 73.7 on Aider is published in Alibaba's own release post and reproducibility on a single 24 GB GPU is widely confirmed by the community; treat it as a strong upper-bound that drops a few points at lower quantisations. DeepSeek's code work is best framed at the repo level — the open-weights V3 family is MIT-licensed and runs as a Mixture-of-Experts model, which means most servers run a fraction of the parameters per token even when the file size on disk looks large.

Ollama, llama.cpp, or LM Studio — which runtime?

All three run the same llama.cpp engine that Georgi Gerganov started in 2023, which is why their inference speed for the same model and quantisation is essentially identical. The differences are the wrapper: Ollama adds a daemon, a model registry, and the OpenAI-compatible HTTP API; LM Studio adds a desktop GUI with a good model browser; raw llama.cpp exposes every flag and gives you the absolute speed ceiling.

Runtime	Pick when	Trade-off
Ollama	You want one local endpoint that Continue, Cursor, Open WebUI, Aider, and an n8n workflow can all share.	A thin wrapper — every flag exists, but the daemon hides them by default.
llama.cpp	You want every token-per-second the hardware will give, custom build flags, or you are running on exotic accelerators.	You operate it yourself. `llama-server`, not `llama-cli`, is the API surface.
LM Studio	You want a desktop GUI, a model browser, and a one-click chat — no terminal.	The HTTP server only runs while the app is open; not a fit for a daemon-style workflow.

The honest default is Ollama. It is the runtime targeted by Open WebUI, Continue, Aider, LangChain examples, and most "run locally" sections of vendor documentation. Pick llama.cpp when you have measured Ollama and want more speed, or LM Studio when the person sitting at the keyboard wants a window with buttons.

How do I plug a local model into Cursor, Continue, or Claude Code?

Ollama exposes the OpenAI Chat Completions API at http://localhost:11434/v1/chat/completions. Anything that speaks OpenAI can talk to it; the only caveat is whether the client process can reach localhost.

Verify the OpenAI-compatible endpoint

curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen2.5-coder:32b",
    "messages": [{ "role": "user", "content": "Say hi" }],
    "stream": false
  }' | jq -r .choices[0].message.content

Continue (VS Code, JetBrains)

Continue talks directly to localhost; no tunnel needed. Drop this in ~/.continue/config.yaml and reload the window:

~/.continue/config.yaml

models:
  - name: Qwen2.5-Coder 32B
    provider: ollama
    model: qwen2.5-coder:32b
    apiBase: http://localhost:11434
    roles: [chat, edit]
  - name: Qwen2.5-Coder 1.5B (autocomplete)
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles: [autocomplete]

Cursor

Cursor's backend runs in a sandbox that cannot see localhost. The fix is an ngrok tunnel plus the OpenAI-compatible base URL in Settings → Models:

Expose Ollama through ngrok

OLLAMA_ORIGINS='*' ollama serve &
ngrok http 11434 --host-header='localhost:11434'

# Cursor → Settings → Models → "Add custom model"
#   Base URL: https://<your-tunnel>.ngrok-free.app/v1
#   API Key:  ollama   (any non-empty string)
#   Model:    qwen2.5-coder:32b   (must match `ollama list`)

Cursor's remote-indexing and codebase-search features still talk to Cursor's servers even with a local model selected. If you need a fully air-gapped setup, use Continue.

Claude Code and Aider

Claude Code is built around Anthropic's Sonnet/Opus releases and does not currently run against an arbitrary local endpoint. Aider does:

Aider against Ollama

pip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b

When does local beat hosted, and when does it lose?

Local wins on three axes. Privacy: code that lives under an NDA or under EU data-residency rules never leaves the box. Cost at high volume: an autocomplete model running 12 hours a day is roughly free at the margin once the GPU is paid for, while the same volume on a hosted API runs into the low hundreds of dollars per developer per month. Latency for small completions: a 7 B model on local VRAM responds in 50–80 ms first-token, ahead of any remote API once you account for round-trip.

Hosted wins on three different axes. Frontier capability: the Anthropic and OpenAI releases lead local by ~10–20 points on SWE-bench Verified at any given month, and the gap reopens whenever a new release lands. Long-horizon agency: running an agent for an hour with tool calls, retries, and self-correction stresses the model in a way that most open-weights coders have not been tuned for. Context window: Anthropic's 1 M-token tier and Gemini's long-context models still beat what a single 24 GB GPU can practically hold in KV cache. The pragmatic answer is to do both — local for inline edits and review, hosted for the agentic runs you would have walked away from anyway.

Common gotchas

Wrong model tag: the string in your editor config must match ollama list exactly. qwen2.5-coder resolves to :latest, which may not be the size you pulled. Pin the tag.
CORS blocks: set OLLAMA_ORIGINS=* before starting ollama serve if you are calling from a browser-based client (including some IDE extensions running in WebView).
CPU offload disguised as a slow GPU: ollama ps tells you how much of the model is on GPU; if the percentage is below 100%, you are paying CPU-tier latency. Drop a quant level or pick a smaller model.
Tool-use claims that do not work: some Ollama models advertise function-calling support but produce malformed tool calls under load. Test agent mode on real tasks before trusting the spec sheet.
Model rot: the :latest tag drifts as new quantisations land. Pin qwen2.5-coder:32b-instruct-q4_K_M instead of qwen2.5-coder in production configs.

Primary sources

Qwen — Qwen2.5-Coder release post (Aider 73.7, EvalPlus, multi-language coverage).
ollama/ollama on GitHub (170k+ stars, OpenAI-compatible API at :11434).
ggml-org/llama.cpp — Georgi Gerganov's inference engine, the layer underneath Ollama and LM Studio.
Continue — Ollama guide (config schema, autocomplete vs chat roles).
Aider — terminal coding assistant; the canonical Aider benchmark used by every coder model release post.

Local LLMs for Coding in 2026: Models, Hardware, Runtimes

Is local AI good enough for coding in 2026?

What hardware do I need to run a coding LLM?

Which model should I run for coding?

Ollama, llama.cpp, or LM Studio — which runtime?

How do I plug a local model into Cursor, Continue, or Claude Code?

Continue (VS Code, JetBrains)

Cursor

Claude Code and Aider

When does local beat hosted, and when does it lose?

Common gotchas

Related reading on env.dev

Primary sources

AI Dark Factory: Autonomous Coding Explained

Frequently Asked Questions

Are local LLMs good enough for coding in 2026?

What hardware do I need to run a coding LLM locally?

Which coding model should I run on Ollama in 2026?

Ollama vs llama.cpp vs LM Studio — which should I pick?

How do I point Cursor or Continue at a local Ollama model?

When does a local LLM beat a hosted one for coding?

Stay up to date

Related Guides

How to Build an MCP Server: TypeScript & Python (2026)

AI Dark Factory Part 5: Security & Governance

AI Dark Factory Part 4: Scaling Agentic Coding