The Dark Factory Series
In Part 1, we mapped the six levels from manual coding to the dark factory. In Part 2, you set up your coding agent, wrote a production-grade AGENTS.md, and learned to decompose tasks. Now it's time for the hardest transition in the entire journey: stop reviewing code line-by-line, and start writing specifications that let machines validate themselves.
This is Level 3 → Level 4 — the leap from human-in-the-loop to spec-driven development. By the end of this guide, you'll have a working specification pipeline, holdout scenarios that gate quality without human review, and the confidence to let agents merge their own code.
Why Specifications Are the New Source Code
At Level 2, you pair-program with AI. You watch every diff. You catch mistakes in real time. It works — but it doesn't scale. The moment you try to run multiple agents in parallel, you become the bottleneck. Every PR sits in your review queue. Every context switch costs you 20 minutes.
Spec-driven development flips the model: you describe what should exist, and agents make it exist. Your job shifts from writing and reviewing code to writing specifications and validating outcomes. This isn't vibe coding — it's the opposite. Specifications are precise, testable, and version-controlled. They're the new source of truth.
Level 2
You write prompts, review every line of output, approve each change manually
Level 3
You write specs, review diffs against spec requirements, approve merges
Level 4
You write specs, holdout scenarios validate automatically, agents auto-merge
The industry is converging on this. GitHub shipped Spec Kit — an open-source toolkit for spec-driven workflows adopted by 75,000+ projects. AWS built Kiro, an entire IDE around spec-first development. Thoughtworks flagged SDD as a key engineering practice for 2025. This isn't a fad — it's the natural evolution of AI-assisted development.
Anatomy of a Feature Specification
A specification isn't a user story. It isn't a Jira ticket with three bullet points. A spec is a complete, unambiguous description of what must be true after the agent finishes. If the agent can misinterpret it, it's not specific enough.
Here's a real example. Compare the two approaches:
Bad: vague ticket
Add password reset functionality.
Users should be able to reset
their password via email.Good: precise spec
# Feature: Password Reset
## Goal
Allow users to securely reset
their password via email link.
## Requirements
1. POST /forgot-password accepts
{ email: string }
2. MUST generate a cryptographically
random token (≥32 bytes, base64url)
3. Token MUST expire after 1 hour
4. Token MUST be single-use
5. POST /reset-password accepts
{ token, newPassword }
6. Password MUST meet policy:
≥12 chars, 1 upper, 1 digit
## Constraints
- Rate limit: 3 requests/email/hour
- MUST NOT reveal if email exists
- MUST use constant-time comparison
## Edge Cases
- Expired token → 410 Gone
- Used token → 410 Gone
- Invalid email format → 400
- Valid format, unknown email → 200
(silent, no information leak)
## Non-Goals
- OAuth/SSO password reset
- Admin-initiated password reset
- Password history enforcementThe vague ticket leaves the agent to invent token lifetimes, security constraints, error handling, and rate limiting. The spec makes every decision explicit. The agent's job is execution, not judgment.
Two Spec Types: Features vs. Bugs
Not all work is the same, and specs should reflect that. The dark factory uses two distinct spec formats:
Feature Spec
Describes what to build: goal, requirements, constraints, edge cases.
# Feature: Rate Limiting
## Goal
Protect API endpoints from abuse.
## Requirements
1. MUST use sliding window
2. Default: 100 req/min/key
3. MUST return 429 + Retry-After
...Bug Spec
States only the symptom. Forces the agent to investigate — no solution bias.
# Bug: Duplicate Webhook Events
## Symptom
Customers report receiving the same
webhook event 2-3 times within a
60-second window.
## Reproduction
POST /webhooks/test with valid
payload → observe delivery logs.
## DO NOT prescribe the fix.This distinction matters. Feature specs are prescriptive — they tell the agent exactly what to build. Bug specs are diagnostic — they tell the agent what's wrong and let it investigate. Prescribing a fix in a bug spec introduces solution bias that often misses the root cause.
The Spec-Driven Pipeline
Here's the full pipeline from spec to shipped code. This is the architecture that replaces human code review.
The flow is simple: you write the spec → the agent implements it → a separate evaluator runs holdout scenarios → if ≥90% pass, the code auto-merges. If evaluation fails, the agent receives failure details and retries — without ever seeing the scenario source.
The critical insight: the coding agent and the evaluator are separate. The coding agent never sees the scenarios. This enforces a strict train/test separation — the same principle that prevents overfitting in machine learning.
Holdout Scenarios: The Quality Gate
Holdout scenarios are the core innovation that makes spec-driven development work without human review. Think of them as acceptance tests that the coding agent never sees. They're written by a human (or a separate spec-writing agent), stored outside the codebase, and only accessible to the evaluator.
Scenarios are written in plain-English BDD format. They describe user behavior, not implementation details:
# Scenario: Password Reset — Happy Path
Given a registered user with email "alice@example.com"
When they POST /forgot-password with { "email": "alice@example.com" }
Then the response status is 200
And an email is sent to "alice@example.com" containing a reset link
When they extract the token from the reset link
And POST /reset-password with { "token": "<extracted>", "newPassword": "NewSecure12!" }
Then the response status is 200
And they can log in with the new password
# Scenario: Password Reset — Expired Token
Given a registered user requests a password reset
And 61 minutes have elapsed
When they POST /reset-password with the expired token
Then the response status is 410
And the body contains { "error": "token_expired" }
# Scenario: Password Reset — Information Leak Prevention
Given "unknown@example.com" is not a registered email
When they POST /forgot-password with { "email": "unknown@example.com" }
Then the response status is 200
And no email is sent
# Attacker cannot distinguish registered from unregistered emailsHow Evaluation Works
The evaluator is a separate agent that:
- Reads the scenario file (never shared with the coding agent)
- Plans API calls and assertions to test each scenario
- Executes against an ephemeral deployment of the PR branch
- Runs each scenario 3 times — at least 2 of 3 must pass (handles non-determinism)
- Reports overall pass rate and per-scenario results
| Metric | Threshold | What it means |
|---|---|---|
| Pass Rate | ≥ 90% | Fraction of scenarios that pass (2 of 3 runs each) |
| False Positive Rate | < 5% | Scenarios that pass but shouldn't — detect evaluation bugs |
| Override Rate | < 10% | How often a human overrules the evaluator — measures trust |
When your override rate drops below 10%, you've validated that the evaluator is trustworthy. That's when you enable auto-merge.
Setting Up Your File Structure
The directory layout enforces the separation between what the coding agent sees and what remains hidden:
your-project/
├── specs/ # ← Visible to coding agent
│ ├── feature-password-reset.md
│ ├── feature-rate-limiting.md
│ └── bug-duplicate-webhooks.md
├── scenarios/ # ← HIDDEN from coding agent
│ ├── password-reset-happy.md # Only the evaluator reads these
│ ├── password-reset-edge.md
│ ├── rate-limiting-burst.md
│ └── webhooks-idempotency.md
├── AGENTS.md # ← Agent context (from Part 2)
├── src/ # ← Source code
│ └── ...
└── .github/
└── workflows/
└── evaluate.yml # ← CI pipeline for evaluationAdd this rule to your AGENTS.md to enforce the separation:
## Spec-Driven Rules
- MUST read the spec file before starting implementation
- MUST NOT read, reference, or access the scenarios/ directory
- MUST NOT write tests that mirror scenario descriptions
- MUST run build + lint + existing tests before opening a PR
- MUST include the spec file path in the PR descriptionWriting Specs That Actually Work
Most spec failures come from three root causes. Here's how to avoid each one:
| Failure Mode | Symptom | Fix |
|---|---|---|
| Ambiguity | Agent builds something reasonable but wrong | Add explicit MUST/MUST NOT for every decision point |
| Missing context | Agent ignores existing patterns, duplicates code | Reference specific files: "Follow patterns in src/auth/login.ts" |
| Scope creep | Agent adds unrequested features, over-engineers | Add explicit Non-Goals section |
The Spec Quality Checklist
Before handing a spec to an agent, verify:
Single responsibility
One feature or bug per spec — never bundle
Testable requirements
Every requirement has an observable, verifiable outcome
Explicit decisions
No implicit behavior — if the agent must choose, the spec must decide
File references
Point to existing patterns: "Follow src/models/user.ts"
Boundary conditions
Cover empty inputs, max values, concurrent access, error states
Non-goals listed
Prevent the agent from gold-plating with unrequested features
Digital Twins: Testing Without External Dependencies
At Level 4, agents run scenarios against real APIs. But real APIs have rate limits, cost money, and can't simulate failure modes on demand. The solution: digital twins — behavioral clones of external services that respond exactly like the real thing.
StrongDM's team pioneered this approach. They built digital twins of Okta, Jira, Slack, and Google services by feeding agents the full public API documentation and targeting 100% compatibility with official SDK client libraries.
Self-contained binaries
Each twin runs as a standalone binary — no external dependencies
State management
Twins maintain internal state across request sequences, like the real service
Failure simulation
Trigger rate limits, timeouts, auth failures, and partial outages on demand
Scale without cost
Run thousands of scenarios per hour with zero API bills or rate-limit hits
Building a digital twin is itself a spec-driven task. The spec is the service's API documentation. The scenarios are the official SDK test suites. When your twin passes 100% of the SDK's integration tests, it's production-ready.
# Spec: Stripe Digital Twin
## Goal
Behavioral clone of Stripe's Charges API for local scenario testing.
## Requirements
1. MUST implement POST /v1/charges (create)
2. MUST implement GET /v1/charges/:id (retrieve)
3. MUST implement POST /v1/refunds (refund)
4. MUST persist charges in-memory across requests
5. MUST validate API key format (sk_test_*)
6. MUST return identical response shapes to Stripe API docs
7. MUST simulate decline codes: card_declined, insufficient_funds, expired_card
## Compatibility Target
100% pass rate against stripe-node SDK test suite (charges module)
## Non-Goals
- Webhooks, subscriptions, or payment intents
- Persistent storage across process restartsBuilding the Evaluation Pipeline
Here's a concrete CI pipeline that runs holdout scenarios against every PR. This is the infrastructure that replaces human review:
# .github/workflows/evaluate.yml
name: Scenario Evaluation
on:
pull_request:
types: [opened, synchronize]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy ephemeral environment
run: |
# Deploy PR branch to a temporary environment
# Returns the base URL for scenario testing
echo "BASE_URL=https://pr-${{ github.event.number }}.preview.example.com" >> $GITHUB_ENV
- name: Run holdout scenarios
env:
SCENARIOS_DIR: ./scenarios # Hidden from coding agent
BASE_URL: ${{ env.BASE_URL }}
PASS_THRESHOLD: 0.9 # 90% pass rate required
RUNS_PER_SCENARIO: 3 # Each scenario runs 3x
run: |
# The evaluator agent:
# 1. Reads each scenario file
# 2. Plans API calls to test the described behavior
# 3. Executes against the ephemeral deployment
# 4. Scores pass/fail for each scenario (2/3 must pass)
# 5. Reports overall pass rate
npx evaluate-scenarios \
--scenarios $SCENARIOS_DIR \
--base-url $BASE_URL \
--threshold $PASS_THRESHOLD \
--runs $RUNS_PER_SCENARIO \
--output results.json
- name: Gate merge decision
run: |
PASS_RATE=$(jq '.passRate' results.json)
if (( $(echo "$PASS_RATE >= 0.9" | bc -l) )); then
echo "✅ Pass rate: ${PASS_RATE} — auto-merge approved"
gh pr merge ${{ github.event.number }} --auto --squash
else
echo "❌ Pass rate: ${PASS_RATE} — below threshold"
jq '.failures[]' results.json # Show which scenarios failed
exit 1
fiThe key elements: ephemeral deployment per PR, scenario isolation from the coding agent, triple execution for reliability, and a clear pass/fail threshold. Start with human review of evaluator results (Level 3). Enable auto-merge only after your override rate drops below 10% (Level 4).
The Level 3 → 4 Transition: When to Enable Auto-Merge
This is the most consequential decision in the dark factory journey. Auto-merging means no human looks at the code before it ships. You need earned confidence, not blind faith. Here's the progression:
Phase 1: Shadow Mode (2-4 weeks)
Run the evaluator on every PR, but still review manually. Compare your review decisions against the evaluator's gate decision. Track agreement rate.
Phase 2: Evaluator-Advised (2-4 weeks)
Only review PRs that the evaluator flags as failing. Let passing PRs sit for 24h, then review them in batches. Your override rate should be trending down.
Phase 3: Auto-Merge Low-Risk (ongoing)
Enable auto-merge for PRs that pass evaluation AND touch < 5 files AND don't modify auth/payment/infrastructure code. Keep human review for high-risk changes.
Phase 4: Full Auto-Merge
When override rate is < 10% for 30 consecutive days across all PR categories, enable full auto-merge. Monitor continuously.
Monitoring After Auto-Merge
Auto-merge doesn't mean autopilot. Track these metrics continuously:
# Daily health dashboard
┌─────────────────────┬──────────┬───────────┐
│ Metric │ Target │ Alert At │
├─────────────────────┼──────────┼───────────┤
│ Scenario pass rate │ ≥ 95% │ < 90% │
│ False positive rate │ < 3% │ > 5% │
│ Rollback rate │ < 2% │ > 5% │
│ Mean time to merge │ < 30min │ > 2hr │
│ Scenario coverage │ ≥ 80% │ < 70% │
│ Agent retry rate │ < 20% │ > 40% │
└─────────────────────┴──────────┴───────────┘Real-World Results
StrongDM's three-person team operates at Level 5, building Attractor — their open-source coding agent. Two founding rules: code must not be written by humans, and code must not be reviewed by humans. Their repository contains no hand-written code — just three NLSpec markdown files.
| Metric | Traditional (8 engineers) | Dark Factory (3 people) |
|---|---|---|
| Effective output | 8 engineers | 25-30 engineer equivalent |
| Feedback loop | Hours to days | Minutes |
| Quality gate | Human review | Automated scenario eval |
| Bottleneck | Developer time | Spec quality |
| Token cost | $0/day | ~$1,000/day/engineer |
The cost is real — approximately $1,000/day/engineer in LLM tokens. But compare that to the fully loaded cost of 25-30 engineers. The economics work if your specs are good enough to keep the retry rate low.
Pitfalls and How to Avoid Them
Leaking scenarios to the coding agent
If the coding agent can read scenario files, it will overfit — writing code that passes your specific scenarios but fails on real-world inputs. Enforce directory-level access controls.
Monolithic specs
A 500-line spec covering five features will confuse the agent. One spec = one feature = one PR. Split ruthlessly.
Premature auto-merge
Enabling auto-merge before you trust the evaluator is the fastest way to ship broken code. Shadow mode first, always.
Scenario rot
Scenarios must evolve with the product. If you add a feature but don't add scenarios, your quality gate has a blind spot. Treat scenario coverage like test coverage.
Underspecified edge cases
The agent will take the path of least resistance. If your spec doesn't mention what happens when the database is down, the agent won't handle it. Be paranoid about failure modes.
Your Two-Week Action Plan
You've been at Level 2 since Part 2. Here's how to reach Level 3 in two weeks, with a clear path to Level 4:
| Day | Task | Outcome |
|---|---|---|
| 1-2 | Write your first feature spec (use the template above) | One spec file in specs/ |
| 3 | Write 3-5 holdout scenarios for that feature | Scenarios in scenarios/ directory |
| 4-5 | Hand spec to agent, review output against spec (not line-by-line) | First spec-driven PR |
| 6-7 | Run scenarios manually against the PR branch | Manual evaluation pass/fail |
| 8-9 | Write specs + scenarios for a second feature | Growing spec library |
| 10 | Set up the CI evaluation pipeline | Automated scenario runs on PRs |
| 11-12 | Run 3+ spec-driven PRs through the pipeline in shadow mode | Evaluator agreement data |
| 13-14 | Review override rate, tune scenarios, assess readiness for Phase 2 | Level 3 operational |
Readiness Signals for Level 4
You're ready to enable auto-merge when:
- Override rate < 10% — you almost never disagree with the evaluator
- Scenario coverage ≥ 80% — most features have holdout scenarios
- Rollback rate < 5% — merged code rarely needs to be reverted
- Spec iteration < 3 — specs rarely need more than two rewrites before the agent succeeds
References
- StrongDM Attractor — open-source NLSpec-driven coding agent for software factories
- How StrongDM's AI Team Build Software Without Looking at the Code — deep dive into the real-world dark factory operation at StrongDM
- GitHub Spec Kit — open-source toolkit for spec-driven development workflows with 75K+ stars
- Spec-Driven Development With AI — GitHub Blog — GitHub's introduction to spec-driven development and Spec Kit
- Spec-Driven Development — Thoughtworks — industry analysis of SDD as a key engineering practice
- Kiro IDE — AWS's spec-driven agentic IDE for prototype-to-production development
- The Dark Factory Pattern — HackerNoon — architectural breakdown of the full dark factory pattern