Ollama is a local LLM runtime that wraps llama.cpp in a background daemon, a model registry, and an OpenAI-compatible HTTP API at http://localhost:11434. The repo crossed 170k stars and 2.5 billion model downloads in early 2026, which makes it the de-facto local-AI runtime that Continue, Aider, Open WebUI, LangChain examples, and most "run locally" sections of vendor documentation target by default. The point is the boring API: ollama pull qwen2.5-coder:32b and a curl against /v1/chat/completions is the whole story for most editor integrations.
What is Ollama?
Ollama is the runtime layer between a quantised model file on disk and a client that wants to ask it questions. It does three useful jobs that raw llama.cpp leaves to the operator. It runs as a background service so models load once and stay warm. It pulls and version-tags GGUF model files from a public registry, so ollama pull llama3.1:8b behaves like docker pull. And it exposes the OpenAI Chat Completions schema at localhost:11434/v1, which means most editor integrations work out of the box without a custom adapter. None of those are research breakthroughs — they are the unglamorous packaging that turned llama.cpp from a hobbyist binary into the runtime three quarters of the local-AI tutorials assume.
Install Ollama
Native installers exist for macOS, Linux, and Windows. The Linux script is the smallest surface area and the easiest to audit; the macOS and Windows installers ship a system tray app on top of the same daemon.
# macOS
brew install --cask ollama
# or download from https://ollama.com/download
# Linux (audit the script first)
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download the installer from https://ollama.com/download/windows
# Verify
ollama --version
curl -s http://localhost:11434
# -> "Ollama is running"Ollama uses the GPU automatically when one is present — Metal on Apple Silicon, CUDA on Nvidia, and ROCm on recent AMD cards. ollama ps shows which models are loaded and what share is on GPU vs CPU; if the GPU column drops below 100% you have spilled out of VRAM and your tokens-per- second will collapse.
Pull a coding model
For day-to-day coding in 2026 the default is qwen2.5-coder. The 32 B variant scores 73.7 on Aider and runs on a 24 GB GPU at Q4_K_M; the 14 B is the honest middle for 12–16 GB cards; the 1.5 B is the tab-completion model paired with a larger chat model. See the companion guide Local LLMs for coding for the full hardware-vs-model decision table.
# Pin the quant — :latest drifts as new builds land
ollama pull qwen2.5-coder:32b-instruct-q4_K_M
ollama pull qwen2.5-coder:1.5b
ollama pull deepseek-coder-v2:16b
ollama list
# NAME ID SIZE MODIFIED
# qwen2.5-coder:32b-instruct-q4_K_M abc123def4 18 GB 30 seconds ago
# qwen2.5-coder:1.5b ... 986 MB 1 minute ago
# deepseek-coder-v2:16b ... 8.9 GB 2 minutes ago
ollama run qwen2.5-coder:32b-instruct-q4_K_M --verbose "Write a fizzbuzz in Rust"The OpenAI-compatible API
Ollama serves the OpenAI Chat Completions API at localhost:11434/v1. Anything that speaks OpenAI — Continue, Aider, the OpenAI Python SDK, the OpenAI Node SDK, n8n, Home Assistant — points at that URL with any non-empty API key and works. There is also a native Ollama API at /api/generate and /api/chat with extra controls (Modelfile system prompts, embeddings, raw mode), but most clients use the OpenAI surface.
curl -s http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen2.5-coder:32b-instruct-q4_K_M",
"messages": [
{ "role": "system", "content": "You are a senior Rust engineer." },
{ "role": "user", "content": "Why is &str preferred over String in arguments?" }
],
"stream": false
}' | jq -r .choices[0].message.contentConnect Cursor, Continue, or Claude Code to local Ollama
Continue talks directly to localhost. Cursor needs an ngrok tunnel because its backend cannot reach localhost. Claude Code is built around Anthropic models and does not currently target an arbitrary OpenAI-compatible endpoint — Aider is the closest terminal-first equivalent that does. The integration snippets above this section show the canonical config for Continue, Cursor, and a VS Code REST client.
models:
- name: Qwen2.5-Coder 32B
provider: ollama
model: qwen2.5-coder:32b-instruct-q4_K_M
apiBase: http://localhost:11434
roles: [chat, edit]
- name: Qwen2.5-Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
roles: [autocomplete]OLLAMA_ORIGINS='*' ollama serve &
ngrok http 11434 --host-header='localhost:11434'
# Cursor → Settings → Models → Add custom model
# Base URL: https://<your-tunnel>.ngrok-free.app/v1
# API Key: ollama (any non-empty string)
# Model: qwen2.5-coder:32b-instruct-q4_K_Mpip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_MModelfiles — system prompts you can pin
A Modelfile is the Dockerfile of model configuration. It lets you bake a system prompt, a temperature, a stop sequence, or a parameter override into a named model so every client sees the same behaviour without having to set parameters per-request.
FROM qwen2.5-coder:32b-instruct-q4_K_M
PARAMETER temperature 0.2
PARAMETER num_ctx 32768
SYSTEM """
You are a senior Rust engineer. Prefer iterators over manual loops, &str over String in arguments, and ? over unwrap.
Return code blocks only when asked. Otherwise reply with prose, then a one-line shell command at the end if relevant.
"""ollama create qwen-rust -f ./Modelfile
ollama run qwen-rust "Refactor this match into ?"
# Now any OpenAI client can pick model: qwen-rustLocal vs hosted — when to use Ollama
Ollama wins on privacy (the box is the boundary), on autocomplete cost at high volume, and on first-token latency for short completions. It loses on long-horizon agent runs, frontier benchmarks, and long-context whole-repo work — the hosted-model comparison spells out the gap with current numbers. Most teams that run Ollama in earnest run it alongside a hosted coder, not instead of one: Ollama serves inline edits, code review, and offline work; the hosted API serves the agent runs that would have been frontier-bound anyway.
Troubleshooting
- Tokens collapse mid-session: CPU offload kicked in. Run
ollama ps; if GPU is below 100%, drop a quantisation tier or pick a smaller model. - CORS error from a browser-based client: set
OLLAMA_ORIGINS=*before starting the daemon. The default origin allowlist refuses cross-origin requests. - Model Not Found from Cursor: the model field must match
ollama listexactly.qwen2.5-coderresolves to:latest, which may not be the size you pulled. - Tool-use fails on a model that claims to support it: some GGUF builds advertise function-calling support but emit malformed tool-call JSON under load. Test agent mode on real tasks, not the benchmark page.
- KV cache eats VRAM: set
OLLAMA_FLASH_ATTENTION=1in the environment to cut KV memory by 30–50%. Useful once you push past 16 K context.