An AI dark factory is a software development setup where AI agents pick work from the issue tracker, write the code, test it, and ship it — with no human author and no human reviewing the diff. The term was coined by Dan Shapiro, CEO of Glowforge, in January 2026 in his essay The Five Levels: from Spicy Autocomplete to the Software Factory. The phrase rhymes with what FANUC has run since the 1980s in Yamanashi: a factory making robotic arms with the lights off because no humans are on the floor. The software version arrived four decades later, but the analogy is exact — specs in, working software out, nobody reading the code.
This page is the canonical primer. If you want the implementation detail — AGENTS.md, holdout scenarios, evaluator agents, auto-merge thresholds, and the exact file layout — read the five-part Dark Factory Pattern playbook on env.dev. If you want the autonomy ladder underneath the term, read the Shapiro Five Levels. This primer covers what the term means, who coined it, where the FANUC analogy holds, what real teams have actually shipped, and what it costs.
What is an AI dark factory?
At Level 5 in Shapiro's framework, no human reads the code AI produces. Specs and acceptance scenarios go into the system; merged pull requests and deployed releases come out. The core mechanism is train/test separation borrowed from machine learning: a coding agent implements from a spec it can read, a separate evaluator agent grades the result against holdout scenarios it never sees. If the evaluator passes, the PR auto-merges. The human role moves up the stack — from writing code to writing the specs, scenarios, and harness the agents run inside.
That is not a generic "AI did it" claim. The dark factory has three load-bearing properties that cheaper setups lack: agents work from written specifications instead of chat, an isolated evaluator decides merge based on scenarios the coder cannot see, and the production deploy path is unchanged from a human-authored PR. Strip any one of those and you fall back to agentic coding with extra automation, not a dark factory.
Who coined "AI dark factory"?
Dan Shapiro, in January 2026. His original framing called Level 5 the software factory — a deliberate echo of NHTSA's autonomous-driving levels, where Level 5 means "no steering wheel." The variant "dark factory" spread because FANUC's lights-off automotive robotics plant in Yamanashi, running unattended since 1981, was the cleanest physical analogue available. Shapiro's post used the word factory; the developer community swapped in dark factory within weeks, and it stuck because it captures the unsettling part — the lights are off because nobody is reading the code.
Two attribution hazards worth getting right. The dark factory is not Matt Kropp's coinage — Kropp owns harness engineering as the term for the scaffolding around the model, which is the prerequisite for getting anywhere near Level 5. And the dark factory is not Birgitta Böckeler's phrase either; her Martin Fowler memos developed Agent = Model + Harness, which is the equation a dark factory operates on top of, but she does not use the term in her writing.
The FANUC analogy — why "dark"?
FANUC opened the first uncrewed manufacturing line in Yamanashi, Japan in 1981. Robotic arms build other robotic arms; the lights stay off because human eyes are not the workflow. Output runs continuously for up to 30 days between human visits. The plant is the textbook example in industrial-engineering courses for what full automation can look like when the work is well-specified, the failure modes are observable, and the process is statistically stable enough that human supervision adds cost without adding quality.
The software version inherits all three properties — and breaks down the moment any one is missing. Specs that leave room for judgment are how an agent silently ships the wrong feature. Failure modes that only surface at runtime are how a green PR melts a service. Statistical instability — flaky tests, non-deterministic LLM output, model upgrades mid-cycle — is how a factory that worked yesterday produces a 12-line bug fix that breaks billing today. The pattern is the same as physical lights-out: ruthless on the inputs, paranoid on the evaluator, automated on the deploy, and intolerant of any drift in between.
How is this different from agentic coding?
Agentic coding is broad — any setup where an LLM-backed agent plans, executes, and verifies code changes across files. Cursor agent mode, Claude Code, and GitHub Copilot agent mode are all agentic. Most of those sessions are Level 2 or 3 in the Shapiro ladder: a human is in the loop, reviewing every diff, sometimes pair-programming. An AI dark factory is the specific Level 5 endpoint: spec-driven input, isolated evaluator, no human reviewing the code at all.
| Property | Agentic coding (typical L2–L3) | AI dark factory (L5) |
|---|---|---|
| Input format | Chat or inline prompt | Written spec, separate scenarios |
| Quality gate | Human PR review | Evaluator agent vs holdout scenarios |
| Code review | Every diff, line by line | None — pass/fail report only |
| Bottleneck | Reviewer time | Spec quality and evaluator coverage |
| Dominant cost | Engineer salaries | Tokens (~$1k/day per engineer-equivalent) |
Is this real, or is it 2026 hype?
Real, with caveats. A small number of teams have publicly described production output that only makes sense under a dark-factory or near-dark-factory operating model, and the numbers are specific enough to verify. The list below is the citable set as of 2026 — every other claimed Level 5 story collapses on inspection into Level 3 with extra automation.
StrongDM — Attractor system
3 engineers shipped 16,000 lines of Rust, 9,500 lines of Go, and 6,700 lines of TypeScript from three markdown specs. No human authored or reviewed the code. Token spend: roughly $1k/day per engineer-equivalent of output.
Simon Willison, February 2026
OpenAI — internal coding
Sam Altman has described OpenAI's engineering as ingesting roughly 1,000,000 lines of code and 1B tokens per day across internal coding agents. Public, repeatable claim through 2026.
OpenAI, public statements 2026
Anthropic — Claude Code
Anthropic reports that ~90% of new Claude Code commits are now authored by Claude Code itself, with humans curating tests, harness rules, and architecture instead of writing the code.
Anthropic, 2026
Spotify — ~650 AI PRs/month
Public engineering communications cite ~650 AI-authored pull requests per month across the platform organisation, with the human role narrowed to spec authoring and evaluator review.
Spotify engineering, 2026
Stripe — ~1,300 AI PRs/week
Stripe has described ~1,300 AI-authored pull requests landing per week across product engineering, an order of magnitude above what Spotify or comparable companies cited a year earlier.
Stripe engineering, 2026
Two patterns hold across every credible case. First, the public numbers describe throughput, not headcount eliminated — these teams are shipping order-of-magnitude more PRs than they could write by hand, while the engineers move into harness work and spec authoring. Second, every team that ships at this volume has invested heavily in harness engineering first; nobody hits Level 5 by upgrading the model.
What does an AI dark factory cost?
StrongDM's public number is the most-cited reference point: roughly $1,000 per day in token spend per engineer-equivalent of code output. Three engineers running the Attractor system produced 25–30 engineers' worth of throughput, so the all-in token bill landed around $25k–30k per day at peak. That figure dominates the cost structure once you remove the salaries the human authors would have drawn — and it scales linearly with throughput, where salaries scale with headcount and onboarding lag.
The deeper economics shift the bottleneck. Spec quality replaces engineer time as the rate-limiting input. Evaluator coverage replaces test maintenance as the place a careless team accumulates risk. Token-cost forecasting becomes a real budgeting line, comparable to cloud spend. None of this is hypothetical — the five-part dark factory: scaling the factory guide covers prompt caching, model routing, and budget gates that bring per-PR token cost down by 4–10x in practice.
How do you start?
Not by trying to build a dark factory. Climb the autonomy ladder one rung at a time. Most teams sit at Level 2 and can move to Level 3 within a quarter; Level 4 takes another two quarters of harness work; Level 5 is a years-long investment that depends on the rest. The order that works:
- Adopt
AGENTS.mdin every repo so agents have stable context. Run agents inside a sandboxed dev container so a hallucinatedrm -rfcannot reach the host filesystem. - Write specifications instead of chatting. The spec-driven development guide is the concrete template — Goal, Constraints, Interfaces, Non-goals.
- Add an isolated evaluator with holdout scenarios the coding agent cannot see. Train/test separation is the single most important architectural choice in the entire pattern.
- Invest in the harness — tools, hooks, evaluators, memory, sandboxing. The harness engineering page is the systematic walkthrough.
- Enable auto-merge only after 20–30 PRs of evaluator-vs-human alignment data, with override rate under 10% and false-positive rate under 5%.
The deep playbook (5 parts)
Each part stands on its own. Read them in order if you are starting at Level 2 or 3; jump straight to the relevant part if you already know which rung is blocking you.
Part 1: The implementation playbook
Level 0 → 5 walkthrough with concrete actions per rung.
Part 2: Foundation setup
AGENTS.md, instructional lint errors, hooks, and the first sandbox.
Part 3: Spec-driven development
NLSpec format, holdout scenarios, train/test separation in detail.
Part 4: Scaling the factory
Worktrees, agent teams, prompt caching, model routing, budget gates.
Part 5: Security and governance
OWASP Agentic Top 10, prompt-injection defense, audit trails, settings.json.
Related reading on env.dev
- Agentic Coding Levels — Shapiro's five-level autonomy ladder in detail.
- Harness Engineering — the Kropp/Böckeler scaffolding that makes Level 4–5 possible.
- Agentic Workflows — the broader category of LLM agents that plan, execute, and verify.
- Awesome AI Coding — curated 2026 list of editors, agents, harnesses, and reading.
- Dev Containers — sandbox autonomous agents so a runaway shell command cannot reach the host.
Primary sources
- Dan Shapiro — The Five Levels: from Spicy Autocomplete to the Software Factory (January 2026, the original coinage).
- Simon Willison — How StrongDM's AI Team Built Software Without Looking at the Code (February 2026, the canonical case study).
- Birgitta Böckeler — Exploring Generative AI (Martin Fowler, ongoing — Agent = Model + Harness).
- FANUC Yamanashi production — the lights-out manufacturing analogue, operating since 1981.
- AGENTS.md specification — the open standard for guiding AI coding agents.