Question 1

What is Ollama?

Accepted Answer

Ollama is a local LLM runtime built on top of llama.cpp. It runs as a background daemon, ships a model registry so `ollama pull qwen2.5-coder:32b` behaves like `docker pull`, and exposes the OpenAI Chat Completions API at http://localhost:11434/v1. The repo crossed 170k stars and 2.5 billion model downloads in early 2026, which makes it the de-facto local-AI runtime that Continue, Aider, Open WebUI, LangChain examples, and most "run locally" sections of vendor docs target by default.

Question 2

How do I install Ollama?

Accepted Answer

macOS: `brew install --cask ollama` or download from ollama.com/download. Linux: `curl -fsSL https://ollama.com/install.sh | sh` (audit the script first). Windows: official installer on ollama.com/download/windows. Verify with `ollama --version` and `curl http://localhost:11434` — the response should be "Ollama is running". Ollama auto-detects Metal on Apple Silicon, CUDA on Nvidia, and ROCm on recent AMD cards.

Question 3

Which coding model should I run on Ollama in 2026?

Accepted Answer

Default to qwen2.5-coder:32b-instruct-q4_K_M on a 24 GB GPU — Apache 2.0, 128 K context, 92 languages, 73.7 on Aider per the Qwen release post (Alibaba publishes this as comparable to GPT-4o). Drop to qwen2.5-coder:14b on 12–16 GB cards. Use qwen2.5-coder:1.5b only as the autocomplete companion paired with a larger chat model. DeepSeek-Coder-V2 (MIT) is the pick for algorithmic and math-heavy tasks. Codestral fits a 16 GB card when Qwen 32B will not.

Question 4

How do I connect Cursor to Ollama?

Accepted Answer

Cursor cannot reach localhost from its sandboxed backend. Run `ngrok http 11434 --host-header="localhost:11434"`, then in Cursor → Settings → Models → "Add custom model" set Base URL to https://<your-tunnel>.ngrok-free.app/v1, API Key to any non-empty string (Ollama ignores it but Cursor requires the field), and Model to the exact tag from `ollama list`. The /v1 suffix is required — without it the chat completions endpoint cannot be reached.

Question 5

How is Ollama different from llama.cpp and LM Studio?

Accepted Answer

All three run the same llama.cpp engine, so inference speed is essentially identical for the same model and quantisation. Ollama adds a background daemon, a model registry, and the OpenAI-compatible API. LM Studio adds a desktop GUI and a model browser but its API only runs while the app is open. Raw llama.cpp gives you every flag and the absolute speed ceiling. Default to Ollama; reach for llama.cpp when you want more speed; pick LM Studio when the user at the keyboard wants a window with buttons.

Question 6

What does the OpenAI-compatible API look like?

Accepted Answer

POST http://localhost:11434/v1/chat/completions with the standard OpenAI JSON body — model, messages, stream, temperature — and Ollama returns the standard response shape. Authorization: any non-empty Bearer token works (Ollama does not authenticate by default; restrict by network instead). The native Ollama API at /api/generate and /api/chat exposes extras (Modelfile system prompts, embeddings, raw mode), but most clients use the OpenAI surface.

Question 7

Why does my local model slow down mid-session?

Accepted Answer

Ollama silently offloads layers to CPU when GPU memory is exhausted, and CPU inference runs 10–100× slower. Run `ollama ps` — if the GPU column is below 100%, you have spilled out of VRAM. The fix is either a smaller model, a lower quant (Q4_K_M to Q4_K_S), or `OLLAMA_FLASH_ATTENTION=1` to cut KV-cache memory by 30–50% once context goes past 16 K tokens.

Ollama

Quick Install

What is Ollama?

Install Ollama

Pull a coding model

The OpenAI-compatible API

Connect Cursor, Continue, or Claude Code to local Ollama

Modelfiles — system prompts you can pin

Local vs hosted — when to use Ollama

Troubleshooting

Frequently Asked Questions

What is Ollama?

How do I install Ollama?

Which coding model should I run on Ollama in 2026?

How do I connect Cursor to Ollama?

How is Ollama different from llama.cpp and LM Studio?

What does the OpenAI-compatible API look like?

Why does my local model slow down mid-session?

Related Resources

Open WebUI

Cursor

Claude Code

Cloudflare Workers AI

AI & LLM Coding Model Comparison