env.dev

The Dark Factory Pattern: An Implementation Playbook for AI-Driven Development

A practical playbook for reaching fully autonomous AI-driven development. Covers all six levels (0-5) from manual coding to the dark factory, with concrete actions, examples, and buildable artifacts at every step.

The Dark Factory is a software development model where AI agents autonomously write, test, and ship code — with the lights off. No human writes code. No human reviews code. Specs go in, working software comes out. This guide is a practical implementation playbook that takes you from wherever you are today through every level of AI-driven development, all the way to a fully autonomous dark factory. Each level builds on the last. Each level delivers value on its own.

The Six Levels (0-5)

Dan Shapiro's framework maps AI coding maturity like autonomous driving levels. Most developers are stuck at Level 2 — pair-programming with AI, reviewing every diff, and believing they're faster when studies show they're actually 19% slower. The levels above exist. Teams are operating there today. Here's how to join them.

LevelNameYou Are…Key Unlock
0ManualWriting every character yourself
1Task DelegationHanding off isolated tasksLearning to prompt well
2CollaborativePair-programming with AIAGENTS.md + structured context
3Human-in-the-LoopReviewing diffs, not writing codeSpecs + holdout scenarios
4Spec-DrivenWriting specs, validating outcomesAutomated evaluation + auto-merge
5Dark FactoryDesigning systems, not featuresDigital twins + full pipeline
Human effortAI autonomyL0ManualL1DelegateL2CollaborateL3ReviewL4Spec-DrivenL5Dark Factorymore autonomous

Where Are You Now?

Be honest. Find the statement that matches your current workflow:

Level 0

I use AI for search or doc lookups, but I write all my code manually.

Level 1

I ask AI to write specific functions, tests, or boilerplate — then I edit the result.

Level 2

I work with AI like a pair partner. We go back and forth. I review and accept most suggestions.

Level 3

AI writes most of the code. I spend my time reading diffs and approving merges.

Level 4

I write specifications. AI implements. I check if scenarios pass, not how the code looks.

Level 5

Specs come from the issue tracker. Code ships without me seeing it. I design systems.

Found yourself? Good. Now read the section for your next level. Each transition below includes what changes, what you build, and a concrete action you can take today.

Level 0 → 1: Start Delegating

The shift: You stop treating AI as a search engine and start treating it as a junior developer who can execute specific, well-defined tasks.

What changes: Instead of writing a test yourself, you describe what needs testing and let the AI write it. Instead of writing boilerplate, you describe the pattern and let the AI generate it. You still own the architecture, the design, and every line that ships.

Try this now

Pick a task you'd normally do by hand — a unit test, a type definition, a data migration script. Write a clear prompt with these four elements:

# Good prompt structure for Level 1

## Context (what exists)
We have a UserService class in src/services/user.ts that has a
createUser(email, name) method. It validates email format, checks
for duplicates against the database, and returns a User object.

## Task (what to do)
Write unit tests for the createUser method.

## Constraints (how to do it)
- Use vitest
- Mock the database layer
- Cover: valid input, invalid email, duplicate email, empty name

## Output (what to produce)
A single test file: src/services/__tests__/user.test.ts

You're done with Level 1 when: You instinctively reach for AI for well-defined tasks and your prompts consistently produce usable output on the first try.

Level 1 → 2: Pair-Program With AI

The shift: You move from handing off isolated tasks to working continuously alongside AI on multi-step features. This is where most "AI-native" developers are today.

What changes: You introduce AGENTS.md — a structured file that gives AI agents persistent context about your project. This is the single highest-leverage thing you can do at any level.

Build this: Your first AGENTS.md

AGENTS.md is an open standard (launched by Google, OpenAI, Factory, Sourcegraph, Cursor) used by 60,000+ projects. It's a README for AI agents — build steps, conventions, and architecture rules in one predictable place. Create one in your project root:

# AGENTS.md

## Project
E-commerce API built with Express + TypeScript + PostgreSQL.

## Directory Structure
src/
├── routes/        # Express route handlers
├── services/      # Business logic (no HTTP concerns)
├── repositories/  # Database queries (no business logic)
├── middleware/     # Auth, logging, error handling
├── models/        # TypeScript types and Zod schemas
└── __tests__/     # Co-located test files

## Build & Test
- Install: `pnpm install`
- Dev: `pnpm dev`
- Test: `pnpm test`
- Lint: `pnpm lint`
- Build: `pnpm build`

## Conventions
- Error handling: always use AppError class, never throw raw errors
- Logging: use the shared logger (src/lib/logger.ts), structured JSON
- Validation: Zod schemas in models/, validated in middleware
- Naming: camelCase files, PascalCase classes, kebab-case routes

## Architecture Rules
- Routes → Services → Repositories (dependency flows inward)
- Services MUST NOT import from routes
- Repositories MUST NOT import from services
- All external API calls go through src/integrations/
- Never use `any` — prefer `unknown` with type narrowing

Make lint errors instructional. Agents respond dramatically better to lint messages that tell them what to do:

Weak (descriptive)Strong (instructional)
"Service depends on route layer""OrderService imports from routes/. Move shared type to models/order.ts"
"Missing error handling""Unhandled promise. Wrap in try/catch and throw AppError.from(err)"

You're done with Level 2 when: AI writes most code in a session. You spend your time steering direction, not typing. Your AGENTS.md is a living document that improves agent output daily.

Level 2 → 3: Become the Reviewer

The shift: You stop pair-programming and start managing AI output. The AI works on tasks independently. You review diffs and approve merges. This is the hardest transition — it feels like giving up control.

What changes: You introduce specifications as the input format. Instead of chatting with the AI about what to build, you write a structured spec before the AI starts. The AI implements from the spec. You review the result.

Build this: Your first NLSpec

NLSpec (Natural Language Specification) is structured English with formal constraints — precise enough for agents yet human-readable. The critical rule: agents cannot fill gaps with judgment. Every constraint must be explicit.

# specs/add-password-reset.md
---
type: feature
service: auth
priority: high
---

## Goal
Add email-based password reset flow.

## User Flow
1. User clicks "Forgot password" → enters email → submits
2. Server generates a signed token (JWT, 15-min expiry)
3. Server sends email with reset link containing token
4. User clicks link → enters new password → submits
5. Server validates token, updates password, invalidates token

## Constraints
- Token MUST be single-use (invalidate after successful reset)
- Token MUST expire after 15 minutes
- MUST rate-limit reset requests: max 3 per email per hour
- MUST NOT reveal whether the email exists in the system
- New password MUST meet existing password policy (min 8 chars, 1 number)
- MUST log all reset attempts (success and failure) for audit

## Interfaces
- POST /auth/forgot-password { email: string } → 200 (always, even if email not found)
- GET  /auth/reset-password?token=xxx → render reset form (or 400 if expired/invalid)
- POST /auth/reset-password { token: string, password: string } → 200 or 400

## Non-goals
- Do NOT add SMS-based reset
- Do NOT modify the existing login flow
- Do NOT add "security questions"

The workflow at Level 3

1. You write the spec (specs/add-password-reset.md)
2. You hand it to the coding agent:
   "Implement the feature described in specs/add-password-reset.md.
    Follow AGENTS.md conventions. Run tests before creating a PR."
3. Agent reads spec + AGENTS.md → writes code → runs tests → opens PR
4. You review the PR diff against the spec (not line-by-line code review)
5. You approve or request changes based on spec compliance

Warning: the J-curve. Teams moving from Level 2 to 3 often get slower before getting faster. Writing good specs takes practice. Your first few will be too vague and the agent output will disappoint. This is normal. The investment pays off when specs become reusable templates.

You're done with Level 3 when: You can hand a spec to an agent, walk away, and come back to a PR that mostly matches what you wanted. Your review time per PR drops below 10 minutes.

Level 3 → 4: Stop Reviewing Code

The shift: You replace human code review with automated scenario evaluation. This is the core innovation of the dark factory pattern. You write specs, you write scenarios, but you never look at the code.

What changes: You introduce holdout scenarios — acceptance tests written in plain English that the coding agent never sees. A separate evaluator agent tests the code against these scenarios. The coding agent and the evaluator are strictly isolated from each other.

Build this: Your first holdout scenarios

The critical rule: the coding agent never sees these scenarios. Ever. This is train/test separation — the same principle that prevents overfitting in machine learning. Research confirms reasoning models game specifications they can see, even with explicit counter-instructions.

CODING AGENTEVALUATOR AGENTWALLspecs/feature.mdAgent reads specsrc/ (generated code)code flows to evaluator (the only crossing)AGENTS.md (shared)scenarios/feature.mdHidden from coding agentRun scenarios 3x eachPASS 94% → auto-mergescenarios never cross
# scenarios/password-reset.md (HIDDEN from coding agent)

## Scenario: Happy path
Given a user with email "alice@test.com" exists
When they request a password reset
And they click the reset link within 15 minutes
And they submit a new valid password "NewPass123"
Then their password is updated
And the old password no longer works
And the reset token no longer works

## Scenario: Expired token
Given a user requested a password reset 16 minutes ago
When they click the reset link
Then they see an error "This link has expired"
And their password is unchanged

## Scenario: Email enumeration prevention
Given no user exists with email "nobody@test.com"
When someone requests a password reset for that email
Then the response is 200 OK (same as existing email)
And no email is sent

## Scenario: Rate limiting
Given a user requests password reset 3 times in one hour
When they request a 4th reset
Then the request is rate-limited
And no additional email is sent

## Scenario: Single-use token
Given a user successfully resets their password
When they try to use the same reset link again
Then they see an error "This link has already been used"

The evaluation loop

# The Level 4 workflow

1. You write the spec         → specs/add-password-reset.md
2. You write holdout scenarios → scenarios/password-reset.md (hidden)
3. Coding agent implements from spec (never sees scenarios)
4. Coding agent runs build + all tests locally (must pass)
5. Coding agent opens PR
6. Evaluator agent deploys PR to ephemeral environment
7. Evaluator agent tests against holdout scenarios
8. Each scenario runs 3 times (2-of-3 must pass to smooth LLM variance)
9. If ≥90% scenarios pass → auto-merge eligible
10. You review the pass/fail report, not the code

# File structure
project/
├── AGENTS.md              # Agent context (visible to all agents)
├── specs/                 # Feature specifications (visible to coding agent)
│   └── add-password-reset.md
├── scenarios/             # Holdout scenarios (ONLY visible to evaluator)
│   └── password-reset.md
└── src/                   # Generated code (you don't read this)

When to enable auto-merge

Don't rush this. Collect 20-30 PRs of data where you compare the evaluator's judgment against what you would have decided. Only enable auto-merge when these thresholds hold:

Scenario Pass Rate

Holdout scenarios must pass

≥ 90%

False Positive Rate

Evaluator passes code you would reject

< 5%

Override Rate

You override evaluator decisions

< 10%

You're done with Level 4 when: You write specs in the morning, review pass/fail reports in the evening, and never open a diff. Your role is specification author and system designer.

Level 4 → 5: The Dark Factory

The shift: You remove yourself from the loop entirely. Specs come from the issue tracker. Agents pick them up, implement, evaluate, and ship. You design the system — the factory itself — not individual features.

What changes: You introduce digital twins and full pipeline automation. Digital twins are behavioral clones of external services — Okta, Stripe, Slack, your database — that let agents test in complete isolation. Pipeline automation connects your issue tracker to the coding agent to the evaluator to deployment.

The digital twin concept

Creating high-fidelity clones of external services was always possible but never economically feasible. AI inverts this: agents build digital twins by analyzing public API documentation and producing self-contained binaries.

# Example: digital twin of Stripe

## What it does
- Implements the full Stripe API surface (charges, customers, subscriptions)
- Returns realistic responses matching Stripe's actual response shapes
- Simulates webhook delivery with configurable delays
- Supports error injection (declined cards, rate limits, network failures)
- Runs as a single binary — no external dependencies

## How agents build it
1. Agent reads Stripe API docs (public)
2. Agent generates an API server matching documented endpoints
3. Agent adds state management (in-memory or SQLite)
4. Agent adds configurable failure modes
5. You validate against your existing integration tests

## Why it matters
- Tests run in milliseconds, not seconds
- No API keys, no rate limits, no costs
- Test failure modes that are impossible to trigger on real Stripe
- Deterministic — same input always produces same output

The full pipeline

Issue TrackerJira / Linear / GitHub IssuesCoding AgentReads AGENTS.md + spec → implements → testsEphemeral EnvironmentPR deploys as container revisionEvaluator AgentRuns holdout scenarios 3x eachAuto-Merge Gate≥90% pass → merge · <90% → retryProduction DeployStandard CI/CD pipeline (unchanged)MonitoringAlerts become bug specs for next cyclefeedback loopretry on failure

You're at Level 5 when: The factory runs without you. You design new factories for new systems. You spend your time on product strategy, system architecture, and writing the scenarios that define "correct."

Real-World Results

At StrongDM, three engineers built the Attractor system using this pattern: 16,000 lines of Rust, 9,500 lines of Go, and 6,700 lines of TypeScript — from three markdown specification files. No human wrote code. No human reviewed code.

MetricTraditional DevDark Factory
Team of 8 output8 engineers25-30 engineer equivalent
Primary bottleneckDeveloper timeSpec quality
Dominant costSalaries~$1k/day per engineer-equivalent (tokens)
Quality gateHuman code reviewAutomated scenario evaluation
Feedback loopHours to daysMinutes

Quick-Start by Current Level

Pick your current level. Do the action item. Move to the next one.

At Level 0

30 min

Ask AI to write a unit test for your most complex function. Include context, task, constraints, and output format in your prompt.

At Level 1

1 hour

Create AGENTS.md in your project root with directory structure, build commands, conventions, and architecture rules.

At Level 2

2 hours

Write a spec for your next feature using the NLSpec format: Goal, Constraints, Interfaces, Non-goals. Hand it to the agent instead of chatting.

At Level 3

1 day

Write 5 holdout scenarios for a feature. Have a coding agent implement from spec. Manually run the evaluator. Compare evaluator judgment to yours.

At Level 4

1 week

Build a digital twin for one external service. Connect your issue tracker to the coding agent. Enable auto-merge for low-risk changes.

Pitfalls at Every Level

Vague specs (L3+)

Agents cannot infer intent. "Add auth" will fail. "Add OAuth2 PKCE flow with these 5 constraints" will succeed.

Leaking scenarios (L4+)

If the coding agent sees holdout scenarios, it optimizes for them instead of solving the problem. Strict isolation is non-negotiable.

Skipping AGENTS.md (all)

Without structured context, agents produce generic code regardless of how good your prompts are. This is the foundation for every level.

Premature auto-merge (L4)

Enable only after 20-30 PRs prove evaluator-human alignment. Rushing this erodes trust and introduces bugs.

The J-curve (L2→3)

Teams get slower before getting faster when learning to write specs. The dip is expected. Push through it.

Monolithic specs (L3+)

Large specs overwhelm context windows. One feature per spec. One responsibility per spec.

References