env.dev

Harness Engineering for AI Coding Agents

Build everything around the LLM — tools, MCP, hooks, evaluators, memory, sandboxes — so an agent ships code instead of just generating it.

Harness engineering is the discipline of building everything around a large language model — tools, context, hooks, evaluators, memory, sandboxes — so it can actually ship code instead of just generating it. The term was coined by Matt Kropp at BCG and developed in detail by Birgitta Böckeler on Martin Fowler's site. Her framing is blunt: Agent = Model + Harness. The model is the same Claude or GPT-class weights every competitor has. The harness is what turns a chat completion into something a senior engineer would let near a production repo.

The shift matters because the bottleneck moved. Anthropic's December 2024 essay Building effective agents argued that the hard parts of an agent are not the model calls — they are the loop, the tools, the recovery paths, and the evaluation. Anthropic later published How Anthropic teams use Claude Code in May 2025 with internal team write-ups ranging from autonomous incident response to non-engineers shipping internal tools end-to-end. Same Claude Sonnet weights everyone else can call. Different harness.

What is "Agent = Model + Harness"?

A model is a stateless function: text in, text out. An agent is a model wrapped in a loop that can read files, run shell commands, hit MCP servers, remember earlier sessions, get rejected by a hook, and try again. Böckeler catalogued the moving parts in her "Exploring Generative AI" memos at Thoughtworks: tool selection, context curation, planning, error recovery, evaluation, and human-in-the-loop checkpoints. None of that lives inside the weights — and none of it gets better automatically when a new model drops. It is engineering work, and it is where most of the differentiation in 2026 sits.

The same model behind Cursor's agent mode, Claude Code, OpenAI's Codex CLI (open-sourced April 2025), Devin, and a dozen internal corporate tools produces wildly different output because each one has a different harness. That is the whole reason "harness engineering" landed as a term: model evaluations are public, but harnesses are proprietary, and the gap between two products using the same Claude Sonnet release can be larger than the gap between two model generations.

Tools and MCP — what the agent can actually do

A model with no tools can only emit text. A model with file read, file write, shell, web fetch, and a typed codebase indexer can do real work. Tool design is where most harnesses leak quality: too few tools and the agent invents file paths; too many overlapping tools and it picks the wrong one; ambiguous tool descriptions and it argues with itself for three turns before doing anything. Anthropic's open Model Context Protocol (announced November 2024) standardised the wire format, which is why an MCP server for Postgres works in Claude Code, Cursor, VS Code agent mode, and Zed without rewrites. See Claude Code MCP servers for the configuration end of this.

Context — the part the harness fights every turn

Context windows grew from 8K tokens in early 2023 to 1M+ in 2025, but harnesses still manage context aggressively because attention degrades with length and tokens cost real money at scale. Cursor's codebase indexer, Claude Code's on-demand Read tool, Aider's repo map, and Sourcegraph Cody's graph context are four different bets on the same problem: which 20K tokens out of a 5M-token monorepo should the model see right now? Get this wrong and the agent hallucinates imports that do not exist; get it right and it edits the correct file on the first try. The deeper treatment is in context management for AI coding.

Hooks — the harness pushing back on the model

Hooks are deterministic scripts the harness runs before, during, or after a tool call — formatters on every edit, type checks before the agent claims a task is done, secret scanners before a commit, blocklists that refuse certain file paths. They convert vague prompt instructions ("run Biome") into hard guarantees (Biome ran, and if it failed the model was forced to react). Claude Code ships a settings.json hook system, Cursor ships its own, and most internal tooling teams build a thin wrapper of their own. The point of a hook is that the model cannot forget it.

.claude/settings.json — minimal harness hooks
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          { "type": "command", "command": "pnpm biome format --write" }
        ]
      }
    ],
    "Stop": [
      { "type": "command", "command": "pnpm check && pnpm test" }
    ]
  }
}

Evaluators — telling the agent it is wrong

An evaluator is whatever the harness uses to grade an agent step: a unit test, a linter, a type checker, a compile, a snapshot diff, an LLM-as-judge call, or a real human approver. The strong claim from Böckeler and from the SWE-bench results is that cheap, fast, deterministic evaluators beat a smarter model. SWE-bench scores climbed from low single digits when the benchmark launched in late 2023 to the 70%+ range on SWE-bench Verified by late 2025 — not because the underlying models are an order of magnitude smarter, but because harnesses learned to run the test suite, parse the output, and feed failures back. The model is not solving the problem — the loop is.

Memory — what survives across sessions

Memory in a harness is anything that persists outside the current context window: project-level CLAUDE.md, Cursor's .cursor/rules/*.mdc files (the legacy single-file .cursorrules still works but has been superseded), a pgvector store of past sessions, or a structured note file the agent rewrites at the end of each turn. Without memory, the agent re-learns the codebase every conversation, costing tokens and time. With memory, it carries forward conventions, past mistakes, and user preferences. The risk is staleness: a memory that says "use the X helper" long after X was deleted is worse than no memory at all, which is why mature harnesses date-stamp memories and verify them before acting on them.

Sandboxing — keeping the blast radius small

Agents that can run shell commands can also run rm -rf, git push --force, or curl | sh. Sandboxing is the harness layer that bounds what they can touch: container isolation (Devin, Codex CLI in sandbox mode), filesystem allowlists, network egress controls, per-tool permission prompts, and dry-run modes for destructive operations. Every production case study that mentions an autonomy increase also mentions tightening the sandbox in the same breath — Claude Code's Auto-Accept and Plan modes are explicit dials on the same axis. Trust grows with verification, not with hope.

Same model, different harness — a side-by-side

HarnessToolsContext strategyBuilt for
Cursor agent modeIn-editor edits, terminal, embeddings indexerCodebase embeddings + active filePair-programming inside an editor
Claude CodeFile ops, shell, MCP, hooks, slash commandsOn-demand reads + CLAUDE.md memoryTerminal-first autonomous tasks
OpenAI Codex CLIFile ops, shell, sandboxed execRepo map + selective readsOpen-source reference harness
AiderEdit blocks, repo map, git integrationRepo map + edit-block diffsGit-native CLI workflows
Devin (Cognition)Browser, shell, IDE — full VMPersistent VM + long-running plansLong-horizon autonomous tasks

A pragmatic order to build a harness

  • 1.Start with one tool. File read. Wire the loop, watch the model use it, watch it fail. Resist adding tools until the failures stop being "could not see the code".
  • 2.Add the cheapest evaluator that exists. A linter, a type checker, a build. Anything deterministic. Pipe the failures back in.
  • 3.Add a single hook. Format on edit. That is it. Nothing more until you have evidence the model is making the same mistake twice.
  • 4.Add memory only after you regret not having it. The first CLAUDE.md should be five bullet points the model keeps re-discovering, not an architecture treatise.
  • 5.Sandbox before raising autonomy. Auto-accept mode without a sandbox is how you delete a database. Container, allowlist, dry-run — pick at least one.
  • 6.Measure regression every time you change the harness. The harness is software. It has bugs. A "helpful" new tool will silently lower task success on a class of problems you forgot to test.

How harness engineering connects to the rest

A good system prompt sets the rules of engagement — system prompts for agentic coding is the prompt-side counterpart to this page. The plan-execute-verify loop is the workflow shape most harnesses end up implementing — see agentic coding workflows. Harness investment without context discipline still loses; context management and Claude Code MCP servers cover the two sides of feeding the agent the right inputs.

Frequently Asked Questions

What is harness engineering?

Harness engineering is the work of building everything around an LLM — tools, context curation, hooks, evaluators, memory, and sandboxing — so an agent can do real engineering work instead of just generating text. The framing comes from Matt Kropp at BCG and Birgitta Böckeler on Martin Fowler's site: Agent = Model + Harness.

Who coined the term "harness engineering"?

Matt Kropp at BCG popularised the term, and Birgitta Böckeler developed it in detail in her "Exploring Generative AI" memos on martinfowler.com, where she frames the equation Agent = Model + Harness.

What's actually in an agent's harness?

A typical harness includes: tool definitions (file ops, shell, MCP servers), context curation (which files to read), hooks (deterministic scripts that run before/after tool calls), evaluators (linters, tests, type checkers, LLM-as-judge), memory (CLAUDE.md, rules files, vector stores), and sandboxing (container isolation, filesystem allowlists, network egress controls).

Why do agents using the same model behave so differently?

Because the harness is doing most of the work. Cursor agent mode, Claude Code, OpenAI Codex CLI, Aider, and Devin can all run on the same Claude Sonnet release and still produce wildly different output — each one curates context, picks tools, and grades the model's answers differently.

How should I start building a harness?

Start with one tool (file read), wire the loop, add the cheapest evaluator that exists (linter, type check, build), add a single hook (format on edit), add memory only when you regret not having it, sandbox before raising autonomy, and measure regression every time you change the harness — it is software, and it has bugs.