env.dev

Ollama

Local LLM runtime that wraps llama.cpp in a daemon and exposes an OpenAI-compatible API at localhost:11434. Pull a model, point a client.

Visit Ollama

Quick Install

# Cursor cannot reach localhost; expose Ollama via ngrok
OLLAMA_ORIGINS='*' ollama serve &
ngrok http 11434 --host-header='localhost:11434'

# Cursor → Settings → Models → Add custom model
#   Base URL: https://<your-tunnel>.ngrok-free.app/v1
#   API Key:  ollama   (any non-empty string)
#   Model:    qwen2.5-coder:32b-instruct-q4_K_M

Ollama is a local LLM runtime that wraps llama.cpp in a background daemon, a model registry, and an OpenAI-compatible HTTP API at http://localhost:11434. The repo crossed 170k stars and 2.5 billion model downloads in early 2026, which makes it the de-facto local-AI runtime that Continue, Aider, Open WebUI, LangChain examples, and most "run locally" sections of vendor documentation target by default. The point is the boring API: ollama pull qwen2.5-coder:32b and a curl against /v1/chat/completions is the whole story for most editor integrations.

What is Ollama?

Ollama is the runtime layer between a quantised model file on disk and a client that wants to ask it questions. It does three useful jobs that raw llama.cpp leaves to the operator. It runs as a background service so models load once and stay warm. It pulls and version-tags GGUF model files from a public registry, so ollama pull llama3.1:8b behaves like docker pull. And it exposes the OpenAI Chat Completions schema at localhost:11434/v1, which means most editor integrations work out of the box without a custom adapter. None of those are research breakthroughs — they are the unglamorous packaging that turned llama.cpp from a hobbyist binary into the runtime three quarters of the local-AI tutorials assume.

Install Ollama

Native installers exist for macOS, Linux, and Windows. The Linux script is the smallest surface area and the easiest to audit; the macOS and Windows installers ship a system tray app on top of the same daemon.

Install Ollama
# macOS
brew install --cask ollama
# or download from https://ollama.com/download

# Linux (audit the script first)
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download the installer from https://ollama.com/download/windows

# Verify
ollama --version
curl -s http://localhost:11434
# -> "Ollama is running"

Ollama uses the GPU automatically when one is present — Metal on Apple Silicon, CUDA on Nvidia, and ROCm on recent AMD cards. ollama ps shows which models are loaded and what share is on GPU vs CPU; if the GPU column drops below 100% you have spilled out of VRAM and your tokens-per- second will collapse.

Pull a coding model

For day-to-day coding in 2026 the default is qwen2.5-coder. The 32 B variant scores 73.7 on Aider and runs on a 24 GB GPU at Q4_K_M; the 14 B is the honest middle for 12–16 GB cards; the 1.5 B is the tab-completion model paired with a larger chat model. See the companion guide Local LLMs for coding for the full hardware-vs-model decision table.

Pull and verify a model
# Pin the quant — :latest drifts as new builds land
ollama pull qwen2.5-coder:32b-instruct-q4_K_M
ollama pull qwen2.5-coder:1.5b
ollama pull deepseek-coder-v2:16b

ollama list
# NAME                                          ID            SIZE      MODIFIED
# qwen2.5-coder:32b-instruct-q4_K_M  abc123def4   18 GB     30 seconds ago
# qwen2.5-coder:1.5b                        ...          986 MB    1 minute ago
# deepseek-coder-v2:16b                ...          8.9 GB    2 minutes ago

ollama run qwen2.5-coder:32b-instruct-q4_K_M --verbose "Write a fizzbuzz in Rust"

The OpenAI-compatible API

Ollama serves the OpenAI Chat Completions API at localhost:11434/v1. Anything that speaks OpenAI — Continue, Aider, the OpenAI Python SDK, the OpenAI Node SDK, n8n, Home Assistant — points at that URL with any non-empty API key and works. There is also a native Ollama API at /api/generate and /api/chat with extra controls (Modelfile system prompts, embeddings, raw mode), but most clients use the OpenAI surface.

Curl against the OpenAI surface
curl -s http://localhost:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen2.5-coder:32b-instruct-q4_K_M",
    "messages": [
      { "role": "system", "content": "You are a senior Rust engineer." },
      { "role": "user",   "content": "Why is &str preferred over String in arguments?" }
    ],
    "stream": false
  }' | jq -r .choices[0].message.content

Connect Cursor, Continue, or Claude Code to local Ollama

Continue talks directly to localhost. Cursor needs an ngrok tunnel because its backend cannot reach localhost. Claude Code is built around Anthropic models and does not currently target an arbitrary OpenAI-compatible endpoint — Aider is the closest terminal-first equivalent that does. The integration snippets above this section show the canonical config for Continue, Cursor, and a VS Code REST client.

~/.continue/config.yaml
models:
  - name: Qwen2.5-Coder 32B
    provider: ollama
    model: qwen2.5-coder:32b-instruct-q4_K_M
    apiBase: http://localhost:11434
    roles: [chat, edit]
  - name: Qwen2.5-Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles: [autocomplete]
Cursor: tunnel + custom OpenAI provider
OLLAMA_ORIGINS='*' ollama serve &
ngrok http 11434 --host-header='localhost:11434'

# Cursor → Settings → Models → Add custom model
#   Base URL: https://<your-tunnel>.ngrok-free.app/v1
#   API Key:  ollama   (any non-empty string)
#   Model:    qwen2.5-coder:32b-instruct-q4_K_M
Aider against Ollama
pip install aider-chat
export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M

Modelfiles — system prompts you can pin

A Modelfile is the Dockerfile of model configuration. It lets you bake a system prompt, a temperature, a stop sequence, or a parameter override into a named model so every client sees the same behaviour without having to set parameters per-request.

Modelfile: a Rust-tuned coder
FROM qwen2.5-coder:32b-instruct-q4_K_M

PARAMETER temperature 0.2
PARAMETER num_ctx 32768

SYSTEM """
You are a senior Rust engineer. Prefer iterators over manual loops, &str over String in arguments, and ? over unwrap.
Return code blocks only when asked. Otherwise reply with prose, then a one-line shell command at the end if relevant.
"""
Build and serve the tuned model
ollama create qwen-rust -f ./Modelfile
ollama run qwen-rust "Refactor this match into ?"
# Now any OpenAI client can pick model: qwen-rust

Local vs hosted — when to use Ollama

Ollama wins on privacy (the box is the boundary), on autocomplete cost at high volume, and on first-token latency for short completions. It loses on long-horizon agent runs, frontier benchmarks, and long-context whole-repo work — the hosted-model comparison spells out the gap with current numbers. Most teams that run Ollama in earnest run it alongside a hosted coder, not instead of one: Ollama serves inline edits, code review, and offline work; the hosted API serves the agent runs that would have been frontier-bound anyway.

Troubleshooting

  • Tokens collapse mid-session: CPU offload kicked in. Run ollama ps; if GPU is below 100%, drop a quantisation tier or pick a smaller model.
  • CORS error from a browser-based client: set OLLAMA_ORIGINS=* before starting the daemon. The default origin allowlist refuses cross-origin requests.
  • Model Not Found from Cursor: the model field must match ollama list exactly. qwen2.5-coder resolves to :latest, which may not be the size you pulled.
  • Tool-use fails on a model that claims to support it: some GGUF builds advertise function-calling support but emit malformed tool-call JSON under load. Test agent mode on real tasks, not the benchmark page.
  • KV cache eats VRAM: set OLLAMA_FLASH_ATTENTION=1 in the environment to cut KV memory by 30–50%. Useful once you push past 16 K context.

Frequently Asked Questions

What is Ollama?

Ollama is a local LLM runtime built on top of llama.cpp. It runs as a background daemon, ships a model registry so `ollama pull qwen2.5-coder:32b` behaves like `docker pull`, and exposes the OpenAI Chat Completions API at http://localhost:11434/v1. The repo crossed 170k stars and 2.5 billion model downloads in early 2026, which makes it the de-facto local-AI runtime that Continue, Aider, Open WebUI, LangChain examples, and most "run locally" sections of vendor docs target by default.

How do I install Ollama?

macOS: `brew install --cask ollama` or download from ollama.com/download. Linux: `curl -fsSL https://ollama.com/install.sh | sh` (audit the script first). Windows: official installer on ollama.com/download/windows. Verify with `ollama --version` and `curl http://localhost:11434` — the response should be "Ollama is running". Ollama auto-detects Metal on Apple Silicon, CUDA on Nvidia, and ROCm on recent AMD cards.

Which coding model should I run on Ollama in 2026?

Default to qwen2.5-coder:32b-instruct-q4_K_M on a 24 GB GPU — Apache 2.0, 128 K context, 92 languages, 73.7 on Aider per the Qwen release post (Alibaba publishes this as comparable to GPT-4o). Drop to qwen2.5-coder:14b on 12–16 GB cards. Use qwen2.5-coder:1.5b only as the autocomplete companion paired with a larger chat model. DeepSeek-Coder-V2 (MIT) is the pick for algorithmic and math-heavy tasks. Codestral fits a 16 GB card when Qwen 32B will not.

How do I connect Cursor to Ollama?

Cursor cannot reach localhost from its sandboxed backend. Run `ngrok http 11434 --host-header="localhost:11434"`, then in Cursor → Settings → Models → "Add custom model" set Base URL to https://<your-tunnel>.ngrok-free.app/v1, API Key to any non-empty string (Ollama ignores it but Cursor requires the field), and Model to the exact tag from `ollama list`. The /v1 suffix is required — without it the chat completions endpoint cannot be reached.

How is Ollama different from llama.cpp and LM Studio?

All three run the same llama.cpp engine, so inference speed is essentially identical for the same model and quantisation. Ollama adds a background daemon, a model registry, and the OpenAI-compatible API. LM Studio adds a desktop GUI and a model browser but its API only runs while the app is open. Raw llama.cpp gives you every flag and the absolute speed ceiling. Default to Ollama; reach for llama.cpp when you want more speed; pick LM Studio when the user at the keyboard wants a window with buttons.

What does the OpenAI-compatible API look like?

POST http://localhost:11434/v1/chat/completions with the standard OpenAI JSON body — model, messages, stream, temperature — and Ollama returns the standard response shape. Authorization: any non-empty Bearer token works (Ollama does not authenticate by default; restrict by network instead). The native Ollama API at /api/generate and /api/chat exposes extras (Modelfile system prompts, embeddings, raw mode), but most clients use the OpenAI surface.

Why does my local model slow down mid-session?

Ollama silently offloads layers to CPU when GPU memory is exhausted, and CPU inference runs 10–100× slower. Run `ollama ps` — if the GPU column is below 100%, you have spilled out of VRAM. The fix is either a smaller model, a lower quant (Q4_K_M to Q4_K_S), or `OLLAMA_FLASH_ATTENTION=1` to cut KV-cache memory by 30–50% once context goes past 16 K tokens.