What is the dark factory pattern?

The dark factory is a software development model where AI agents autonomously write, test, and ship code without human authors or human code review. Specs and holdout scenarios go in, working software comes out — the lights are off because no developer is reading the diff.

Why are most developers stuck at Level 2?

Pair-programming with AI feels productive, but controlled studies have measured experienced developers as roughly 19% slower at Level 2 because they review every diff. Moving past it requires writing structured specs and giving up line-by-line code review — which feels like losing control before it pays off.

What is a holdout scenario?

A holdout scenario is a plain-English acceptance test the coding agent never sees. A separate evaluator agent tests generated code against these scenarios — the same train/test separation used in machine learning. If the coding agent could read the scenarios, it would game them instead of solving the problem.

Does the math work — how much does a dark factory cost?

StrongDM reports roughly $1k/day in tokens per engineer-equivalent of output. Their three-engineer team produced 25–30 engineer-equivalent throughput (16,000 lines of Rust, 9,500 lines of Go, 6,700 lines of TypeScript) from three markdown specs. Token cost dominates instead of salaries; spec quality, not developer time, becomes the bottleneck.

What's the difference between AI dark factory, agentic coding, and harness engineering?

Three terms, three different scopes. Agentic coding is the broad category — any LLM-backed agent that plans, executes, and verifies code changes (Cursor agent mode, Claude Code, and Copilot agent mode all qualify). Harness engineering is Matt Kropp's term (BCG, 2025), developed further by Birgitta Böckeler on martinfowler.com as Agent = Model + Harness, for the tools, hooks, evaluators, memory, and sandboxing wrapped around the LLM — the engineering that makes agentic coding production-grade. The AI dark factory is the specific Level 5 endpoint coined by Dan Shapiro: spec-driven input, isolated evaluator with holdout scenarios the coding agent never sees, and zero human review of the diff. Strip any of those three properties and you fall back to agentic coding with extra automation, not a dark factory. Put differently: you need agentic coding tools and serious harness engineering to even attempt an AI dark factory.

AI Dark Factory Playbook: From Autocomplete to Autonomous

Q: What are the six levels of AI-driven development?

Level 0 manual coding; Level 1 task delegation; Level 2 collaborative pair-programming with AI; Level 3 human-in-the-loop diff review with specs; Level 4 spec-driven development with automated scenario evaluation; Level 5 the dark factory, where agents pick work from the issue tracker, implement, evaluate, and ship without human review.

Six-level AI dark factory playbook from manual coding (Level 0) to fully autonomous code (Level 5). Concrete agentic-coding actions, real numbers, no hype.

The Dark Factory is a software development model where AI agents autonomously write, test, and ship code — with the lights off. No human writes code. No human reviews code. Specs go in, working software comes out. This guide is a practical implementation playbook that takes you from wherever you are today through every level of AI-driven development, all the way to a fully autonomous dark factory. Each level builds on the last. Each level delivers value on its own.

What are the six levels of AI-driven development?

Dan Shapiro's framework maps AI coding maturity like autonomous driving levels. Most developers are stuck at Level 2 — pair-programming with AI, reviewing every diff, and believing they're faster when METR's 2025 randomized trial measured them 19% slower. The levels above exist. Teams are operating there today. Here's how to join them.

Level	Name	You Are…	Key Unlock
0	Manual	Writing every character yourself	—
1	Task Delegation	Handing off isolated tasks	Learning to prompt well
2	Collaborative	Pair-programming with AI	AGENTS.md + structured context
3	Human-in-the-Loop	Reviewing diffs, not writing code	Specs + holdout scenarios
4	Spec-Driven	Writing specs, validating outcomes	Automated evaluation + auto-merge
5	Dark Factory	Designing systems, not features	Digital twins + full pipeline

Where Are You Now?

Be honest. Find the statement that matches your current workflow:

Level 0

I use AI for search or doc lookups, but I write all my code manually.

Level 1

I ask AI to write specific functions, tests, or boilerplate — then I edit the result.

Level 2

I work with AI like a pair partner. We go back and forth. I review and accept most suggestions.

Level 3

AI writes most of the code. I spend my time reading diffs and approving merges.

Level 4

I write specifications. AI implements. I check if scenarios pass, not how the code looks.

Level 5

Specs come from the issue tracker. Code ships without me seeing it. I design systems.

Found yourself? Good. Now read the section for your next level. Each transition below includes what changes, what you build, and a concrete action you can take today.

How do you move from Level 0 to Level 1 (Start Delegating)?

The shift: You stop treating AI as a search engine and start treating it as a junior developer who can execute specific, well-defined tasks.

What changes: Instead of writing a test yourself, you describe what needs testing and let the AI write it. Instead of writing boilerplate, you describe the pattern and let the AI generate it. You still own the architecture, the design, and every line that ships.

Try this now

Pick a task you'd normally do by hand — a unit test, a type definition, a data migration script. Write a clear prompt with these four elements. If you want the agent to run commands without asking permission for every step, run it inside a sandboxed dev container so a hallucinated rm -rf can't reach your real filesystem:

markdown

# Good prompt structure for Level 1

## Context (what exists)
We have a UserService class in src/services/user.ts that has a
createUser(email, name) method. It validates email format, checks
for duplicates against the database, and returns a User object.

## Task (what to do)
Write unit tests for the createUser method.

## Constraints (how to do it)
- Use vitest
- Mock the database layer
- Cover: valid input, invalid email, duplicate email, empty name

## Output (what to produce)
A single test file: src/services/__tests__/user.test.ts

You're done with Level 1 when: You instinctively reach for AI for well-defined tasks and your prompts consistently produce usable output on the first try.

How do you move from Level 1 to Level 2 (Pair-Program With AI)?

The shift: You move from handing off isolated tasks to working continuously alongside AI on multi-step features. This is where most "AI-native" developers are today.

What changes: You introduce AGENTS.md — a structured file that gives AI agents persistent context about your project. This is the single highest-leverage thing you can do at any level.

Build this: Your first AGENTS.md

AGENTS.md is an open standard (launched by Google, OpenAI, Factory, Sourcegraph, Cursor) used by 60,000+ projects. It's a README for AI agents — build steps, conventions, and architecture rules in one predictable place. Create one in your project root:

markdown

# AGENTS.md

## Project
E-commerce API built with Express + TypeScript + PostgreSQL.

## Directory Structure
src/
├── routes/        # Express route handlers
├── services/      # Business logic (no HTTP concerns)
├── repositories/  # Database queries (no business logic)
├── middleware/     # Auth, logging, error handling
├── models/        # TypeScript types and Zod schemas
└── __tests__/     # Co-located test files

## Build & Test
- Install: `pnpm install`
- Dev: `pnpm dev`
- Test: `pnpm test`
- Lint: `pnpm lint`
- Build: `pnpm build`

## Conventions
- Error handling: always use AppError class, never throw raw errors
- Logging: use the shared logger (src/lib/logger.ts), structured JSON
- Validation: Zod schemas in models/, validated in middleware
- Naming: camelCase files, PascalCase classes, kebab-case routes

## Architecture Rules
- Routes → Services → Repositories (dependency flows inward)
- Services MUST NOT import from routes
- Repositories MUST NOT import from services
- All external API calls go through src/integrations/
- Never use `any` — prefer `unknown` with type narrowing

Make lint errors instructional. Agents respond dramatically better to lint messages that tell them what to do:

Weak (descriptive)	Strong (instructional)
"Service depends on route layer"	"OrderService imports from routes/. Move shared type to models/order.ts"
"Missing error handling"	"Unhandled promise. Wrap in try/catch and throw AppError.from(err)"

You're done with Level 2 when: AI writes most code in a session. You spend your time steering direction, not typing. Your AGENTS.md is a living document that improves agent output daily.

How do you move from Level 2 to Level 3 (Become the Reviewer)?

The shift: You stop pair-programming and start managing AI output. The AI works on tasks independently. You review diffs and approve merges. This is the hardest transition — it feels like giving up control.

What changes: You introduce specifications as the input format. Instead of chatting with the AI about what to build, you write a structured spec before the AI starts. The AI implements from the spec. You review the result.

Build this: Your first NLSpec

NLSpec (Natural Language Specification) is structured English with formal constraints — precise enough for agents yet human-readable. The critical rule: agents cannot fill gaps with judgment. Every constraint must be explicit.

markdown

# specs/add-password-reset.md
---
type: feature
service: auth
priority: high
---

## Goal
Add email-based password reset flow.

## User Flow
1. User clicks "Forgot password" → enters email → submits
2. Server generates a signed token (JWT, 15-min expiry)
3. Server sends email with reset link containing token
4. User clicks link → enters new password → submits
5. Server validates token, updates password, invalidates token

## Constraints
- Token MUST be single-use (invalidate after successful reset)
- Token MUST expire after 15 minutes
- MUST rate-limit reset requests: max 3 per email per hour
- MUST NOT reveal whether the email exists in the system
- New password MUST meet existing password policy (min 8 chars, 1 number)
- MUST log all reset attempts (success and failure) for audit

## Interfaces
- POST /auth/forgot-password { email: string } → 200 (always, even if email not found)
- GET  /auth/reset-password?token=xxx → render reset form (or 400 if expired/invalid)
- POST /auth/reset-password { token: string, password: string } → 200 or 400

## Non-goals
- Do NOT add SMS-based reset
- Do NOT modify the existing login flow
- Do NOT add "security questions"

The workflow at Level 3

text

1. You write the spec (specs/add-password-reset.md)
2. You hand it to the coding agent:
   "Implement the feature described in specs/add-password-reset.md.
    Follow AGENTS.md conventions. Run tests before creating a PR."
3. Agent reads spec + AGENTS.md → writes code → runs tests → opens PR
4. You review the PR diff against the spec (not line-by-line code review)
5. You approve or request changes based on spec compliance

Warning: the J-curve. Teams moving from Level 2 to 3 often get slower before getting faster. Writing good specs takes practice. Your first few will be too vague and the agent output will disappoint. This is normal. The investment pays off when specs become reusable templates.

You're done with Level 3 when: You can hand a spec to an agent, walk away, and come back to a PR that mostly matches what you wanted. Your review time per PR drops below 10 minutes.

How do you move from Level 3 to Level 4 (Stop Reviewing Code)?

The shift: You replace human code review with automated scenario evaluation. This is the core innovation of the dark factory pattern. You write specs, you write scenarios, but you never look at the code.

What changes: You introduce holdout scenarios — acceptance tests written in plain English that the coding agent never sees. A separate evaluator agent tests the code against these scenarios. The coding agent and the evaluator are strictly isolated from each other.

Build this: Your first holdout scenarios

The critical rule: the coding agent never sees these scenarios. Ever. This is train/test separation — the same principle that prevents overfitting in machine learning. Research confirms reasoning models game specifications they can see, even with explicit counter-instructions.

markdown

# scenarios/password-reset.md (HIDDEN from coding agent)

## Scenario: Happy path
Given a user with email "alice@test.com" exists
When they request a password reset
And they click the reset link within 15 minutes
And they submit a new valid password "NewPass123"
Then their password is updated
And the old password no longer works
And the reset token no longer works

## Scenario: Expired token
Given a user requested a password reset 16 minutes ago
When they click the reset link
Then they see an error "This link has expired"
And their password is unchanged

## Scenario: Email enumeration prevention
Given no user exists with email "nobody@test.com"
When someone requests a password reset for that email
Then the response is 200 OK (same as existing email)
And no email is sent

## Scenario: Rate limiting
Given a user requests password reset 3 times in one hour
When they request a 4th reset
Then the request is rate-limited
And no additional email is sent

## Scenario: Single-use token
Given a user successfully resets their password
When they try to use the same reset link again
Then they see an error "This link has already been used"

The evaluation loop

text

# The Level 4 workflow

1. You write the spec         → specs/add-password-reset.md
2. You write holdout scenarios → scenarios/password-reset.md (hidden)
3. Coding agent implements from spec (never sees scenarios)
4. Coding agent runs build + all tests locally (must pass)
5. Coding agent opens PR
6. Evaluator agent deploys PR to ephemeral environment
7. Evaluator agent tests against holdout scenarios
8. Each scenario runs 3 times (2-of-3 must pass to smooth LLM variance)
9. If ≥90% scenarios pass → auto-merge eligible
10. You review the pass/fail report, not the code

# File structure
project/
├── AGENTS.md              # Agent context (visible to all agents)
├── specs/                 # Feature specifications (visible to coding agent)
│   └── add-password-reset.md
├── scenarios/             # Holdout scenarios (ONLY visible to evaluator)
│   └── password-reset.md
└── src/                   # Generated code (you don't read this)

When to enable auto-merge

Don't rush this. Collect 20-30 PRs of data where you compare the evaluator's judgment against what you would have decided. Only enable auto-merge when these thresholds hold:

Scenario Pass Rate

Holdout scenarios must pass

≥ 90%

False Positive Rate

Evaluator passes code you would reject

< 5%

Override Rate

You override evaluator decisions

< 10%

You're done with Level 4 when: You write specs in the morning, review pass/fail reports in the evening, and never open a diff. Your role is specification author and system designer.

How do you move from Level 4 to Level 5 (The Dark Factory)?

The shift: You remove yourself from the loop entirely. Specs come from the issue tracker. Agents pick them up, implement, evaluate, and ship. You design the system — the factory itself — not individual features.

What changes: You introduce digital twins and full pipeline automation. Digital twins are behavioral clones of external services — Okta, Stripe, Slack, your database — that let agents test in complete isolation. Pipeline automation connects your issue tracker to the coding agent to the evaluator to deployment.

The digital twin concept

Creating high-fidelity clones of external services was always possible but never economically feasible. AI inverts this: agents build digital twins by analyzing public API documentation and producing self-contained binaries.

markdown

# Example: digital twin of Stripe

## What it does
- Implements the full Stripe API surface (charges, customers, subscriptions)
- Returns realistic responses matching Stripe's actual response shapes
- Simulates webhook delivery with configurable delays
- Supports error injection (declined cards, rate limits, network failures)
- Runs as a single binary — no external dependencies

## How agents build it
1. Agent reads Stripe API docs (public)
2. Agent generates an API server matching documented endpoints
3. Agent adds state management (in-memory or SQLite)
4. Agent adds configurable failure modes
5. You validate against your existing integration tests

## Why it matters
- Tests run in milliseconds, not seconds
- No API keys, no rate limits, no costs
- Test failure modes that are impossible to trigger on real Stripe
- Deterministic — same input always produces same output

The full pipeline

You're at Level 5 when: The factory runs without you. You design new factories for new systems. You spend your time on product strategy, system architecture, and writing the scenarios that define "correct."

What results have real teams achieved?

At StrongDM, three engineers built the Attractor system using this pattern: 16,000 lines of Rust, 9,500 lines of Go, and 6,700 lines of TypeScript — from three markdown specification files. No human wrote code. No human reviewed code.

Metric	Traditional Dev	Dark Factory
Team of 8 output	8 engineers	25-30 engineer equivalent
Primary bottleneck	Developer time	Spec quality
Dominant cost	Salaries	~$1k/day per engineer-equivalent (tokens)
Quality gate	Human code review	Automated scenario evaluation
Feedback loop	Hours to days	Minutes

What is the quick-start action for your current level?

Pick your current level. Do the action item. Move to the next one.

At Level 0

30 min

Ask AI to write a unit test for your most complex function. Include context, task, constraints, and output format in your prompt.

At Level 1

1 hour

Create AGENTS.md in your project root with directory structure, build commands, conventions, and architecture rules.

At Level 2

2 hours

Write a spec for your next feature using the NLSpec format: Goal, Constraints, Interfaces, Non-goals. Hand it to the agent instead of chatting.

At Level 3

1 day

Write 5 holdout scenarios for a feature. Have a coding agent implement from spec. Manually run the evaluator. Compare evaluator judgment to yours.

At Level 4

1 week

Build a digital twin for one external service. Connect your issue tracker to the coding agent. Enable auto-merge for low-risk changes.

What are the most common dark-factory pitfalls?

Vague specs (L3+)

Agents cannot infer intent. "Add auth" will fail. "Add OAuth2 PKCE flow with these 5 constraints" will succeed.

Leaking scenarios (L4+)

If the coding agent sees holdout scenarios, it optimizes for them instead of solving the problem. Strict isolation is non-negotiable.

Skipping AGENTS.md (all)

Without structured context, agents produce generic code regardless of how good your prompts are. This is the foundation for every level.

Premature auto-merge (L4)

Enable only after 20-30 PRs prove evaluator-human alignment. Rushing this erodes trust and introduces bugs.

The J-curve (L2→3)

Teams get slower before getting faster when learning to write specs. The dip is expected. Push through it.

Monolithic specs (L3+)

Large specs overwhelm context windows. One feature per spec. One responsibility per spec.

References

The Five Levels: From Spicy Autocomplete to the Dark Factory — Dan Shapiro's original five-level framework
The Dark Factory Pattern: Moving From AI-Assisted to Fully Autonomous Coding — detailed architectural breakdown with implementation phases
How StrongDM's AI Team Build Software Without Looking at the Code — real-world Level 4+ case study with the Attractor system
Dark Factory Architecture: How Level 4 Actually Works — three-pillar model, brownfield migration, and leadership advice
AGENTS.md Specification — the open standard for guiding AI coding agents
How to Write a Good Spec for AI Agents — practical guide to specification quality by Addy Osmani
Factory AI — agent-native development platform implementing dark factory principles

AI Dark Factory Playbook: From Autocomplete to Autonomous

What are the six levels of AI-driven development?

Where Are You Now?

How do you move from Level 0 to Level 1 (Start Delegating)?

Try this now

How do you move from Level 1 to Level 2 (Pair-Program With AI)?

Build this: Your first AGENTS.md

How do you move from Level 2 to Level 3 (Become the Reviewer)?

Build this: Your first NLSpec

The workflow at Level 3

How do you move from Level 3 to Level 4 (Stop Reviewing Code)?

Build this: Your first holdout scenarios

The evaluation loop

When to enable auto-merge

Scenario Pass Rate

False Positive Rate

Override Rate

How do you move from Level 4 to Level 5 (The Dark Factory)?

The digital twin concept

The full pipeline

What results have real teams achieved?

What is the quick-start action for your current level?

At Level 0

At Level 1

At Level 2

At Level 3

At Level 4

What are the most common dark-factory pitfalls?

Vague specs (L3+)

Leaking scenarios (L4+)

Skipping AGENTS.md (all)

Premature auto-merge (L4)

The J-curve (L2→3)

Monolithic specs (L3+)

References

The AI Dark Factory Playbook

AI Dark Factory: Autonomous Coding Explained