What is spec-driven development?

Spec-driven development (SDD) is a workflow where humans write precise, testable specifications describing what must be true after a feature is implemented, and AI coding agents execute against those specs. The spec — not the code — becomes the source of truth. Specifications include explicit requirements (MUST/MUST NOT), constraints, edge cases, and non-goals so the agent has no room to improvise.

Why must the coding agent never see the scenarios?

If the coding agent can read the scenarios, it will overfit — writing code that passes those specific tests but fails on real-world inputs. This is the same train/test separation principle that prevents overfitting in machine learning. Enforce it with directory-level access controls and an explicit AGENTS.md rule that forbids reading the scenarios/ directory.

When should I enable auto-merge?

Enable auto-merge only after a staged rollout: shadow mode (2-4 weeks of comparing your reviews to the evaluator), evaluator-advised (only review PRs the evaluator flags as failing), then auto-merge for low-risk PRs (< 5 files, no auth/payment/infra). Move to full auto-merge only when override rate stays below 10% for 30 consecutive days, scenario coverage is at least 80%, and rollback rate is under 5%.

What pass-rate threshold should I use?

A 90% scenario pass rate is the recommended starting threshold for gating auto-merge. Run each scenario three times and require at least 2 of 3 to pass to absorb LLM non-determinism. Track false positive rate (< 5%) and override rate (< 10%) alongside pass rate — if any of those drift, tighten the threshold or pause auto-merge until you find the root cause.

AI Dark Factory Part 3: Spec-Driven Development

Q: What are holdout scenarios?

Holdout scenarios are acceptance tests written in plain-English BDD format (Given/When/Then) that the coding agent never sees. They live in a separate scenarios/ directory accessible only to the evaluator. A separate evaluator agent reads them, plans API calls, and runs each scenario against the PR branch — typically three times each, requiring 2 of 3 to pass — and reports an overall pass rate that gates the merge.

Q: What pass-rate threshold should I use?

A 90% scenario pass rate is the recommended starting threshold for gating auto-merge. Run each scenario three times and require at least 2 of 3 to pass to absorb LLM non-determinism. Track false positive rate (< 5%) and override rate (< 10%) alongside pass rate — if any of those drift, tighten the threshold or pause auto-merge until you find the root cause.

Spec-driven agentic coding for an AI dark factory: write precise specs, run holdout scenarios as quality gates, build an eval pipeline, and auto-merge safely.

New to the term? Read the AI dark factory primer for what it means, who coined it, and the FANUC analogy. This page is Part 3 of the implementation playbook.

The Dark Factory Series

Part 1: The Playbook|Part 2: Foundation Setup|Part 3: Spec-Driven Development|Part 4: Scaling the Factory|Part 5: Security & Governance

In Part 1, we mapped the six levels from manual coding to the dark factory. In Part 2, you set up your coding agent, wrote a production-grade AGENTS.md, and learned to decompose tasks. Now it's time for the hardest transition in the entire journey: stop reviewing code line-by-line, and start writing specifications that let machines validate themselves.

This is Level 3 → Level 4 — the leap from human-in-the-loop to spec-driven development. By the end of this guide, you'll have a working specification pipeline, holdout scenarios that gate quality without human review, and the confidence to let agents merge their own code.

Why Specifications Are the New Source Code

At Level 2, you pair-program with AI. You watch every diff. You catch mistakes in real time. It works — but it doesn't scale. The moment you try to run multiple agents in parallel, you become the bottleneck. Every PR sits in your review queue. Every context switch costs you 20 minutes.

Spec-driven development flips the model: you describe what should exist, and agents make it exist. Your job shifts from writing and reviewing code to writing specifications and validating outcomes. This isn't vibe coding — it's the opposite. Specifications are precise, testable, and version-controlled. They're the new source of truth.

Level 2

You write prompts, review every line of output, approve each change manually

Level 3

You write specs, review diffs against spec requirements, approve merges

Level 4

You write specs, holdout scenarios validate automatically, agents auto-merge

The industry is converging on this. GitHub shipped Spec Kit — an open-source toolkit for spec-driven workflows adopted by 75,000+ projects. AWS built Kiro, an entire IDE around spec-first development. Thoughtworks flagged SDD as a key engineering practice for 2025. This isn't a fad — it's the natural evolution of AI-assisted development.

What goes into a feature specification?

A specification isn't a user story. It isn't a Jira ticket with three bullet points. A spec is a complete, unambiguous description of what must be true after the agent finishes. If the agent can misinterpret it, it's not specific enough.

Here's a real example. Compare the two approaches:

Bad: vague ticket

text

Add password reset functionality.
Users should be able to reset
their password via email.

Good: precise spec

markdown

# Feature: Password Reset

## Goal
Allow users to securely reset
their password via email link.

## Requirements
1. POST /forgot-password accepts
   { email: string }
2. MUST generate a cryptographically
   random token (≥32 bytes, base64url)
3. Token MUST expire after 1 hour
4. Token MUST be single-use
5. POST /reset-password accepts
   { token, newPassword }
6. Password MUST meet policy:
   ≥12 chars, 1 upper, 1 digit

## Constraints
- Rate limit: 3 requests/email/hour
- MUST NOT reveal if email exists
- MUST use constant-time comparison

## Edge Cases
- Expired token → 410 Gone
- Used token → 410 Gone
- Invalid email format → 400
- Valid format, unknown email → 200
  (silent, no information leak)

## Non-Goals
- OAuth/SSO password reset
- Admin-initiated password reset
- Password history enforcement

The vague ticket leaves the agent to invent token lifetimes, security constraints, error handling, and rate limiting. The spec makes every decision explicit. The agent's job is execution, not judgment.

How do feature specs differ from bug specs?

Not all work is the same, and specs should reflect that. The dark factory uses two distinct spec formats:

Feature Spec

Describes what to build: goal, requirements, constraints, edge cases.

markdown

# Feature: Rate Limiting

## Goal
Protect API endpoints from abuse.

## Requirements
1. MUST use sliding window
2. Default: 100 req/min/key
3. MUST return 429 + Retry-After
...

Bug Spec

States only the symptom. Forces the agent to investigate — no solution bias.

markdown

# Bug: Duplicate Webhook Events

## Symptom
Customers report receiving the same
webhook event 2-3 times within a
60-second window.

## Reproduction
POST /webhooks/test with valid
payload → observe delivery logs.

## DO NOT prescribe the fix.

This distinction matters. Feature specs are prescriptive — they tell the agent exactly what to build. Bug specs are diagnostic — they tell the agent what's wrong and let it investigate. Prescribing a fix in a bug spec introduces solution bias that often misses the root cause.

How does the spec-driven pipeline work?

Here's the full pipeline from spec to shipped code. This is the architecture that replaces human code review.

The flow is simple: you write the spec → the agent implements it → a separate evaluator runs holdout scenarios → if ≥90% pass, the code auto-merges. If evaluation fails, the agent receives failure details and retries — without ever seeing the scenario source.

The critical insight: the coding agent and the evaluator are separate. The coding agent never sees the scenarios. This enforces a strict train/test separation — the same principle that prevents overfitting in machine learning.

What are holdout scenarios and why hide them from the coding agent?

Holdout scenarios are the core innovation that makes spec-driven development work without human review. Think of them as acceptance tests that the coding agent never sees. They're written by a human (or a separate spec-writing agent), stored outside the codebase, and only accessible to the evaluator.

Scenarios are written in plain-English BDD format. They describe user behavior, not implementation details:

markdown

# Scenario: Password Reset — Happy Path

Given a registered user with email "alice@example.com"
When they POST /forgot-password with { "email": "alice@example.com" }
Then the response status is 200
And an email is sent to "alice@example.com" containing a reset link

When they extract the token from the reset link
And POST /reset-password with { "token": "<extracted>", "newPassword": "NewSecure12!" }
Then the response status is 200
And they can log in with the new password

# Scenario: Password Reset — Expired Token

Given a registered user requests a password reset
And 61 minutes have elapsed
When they POST /reset-password with the expired token
Then the response status is 410
And the body contains { "error": "token_expired" }

# Scenario: Password Reset — Information Leak Prevention

Given "unknown@example.com" is not a registered email
When they POST /forgot-password with { "email": "unknown@example.com" }
Then the response status is 200
And no email is sent
# Attacker cannot distinguish registered from unregistered emails

How Evaluation Works

The evaluator is a separate agent that:

Reads the scenario file (never shared with the coding agent)
Plans API calls and assertions to test each scenario
Executes against an ephemeral deployment of the PR branch
Runs each scenario 3 times — at least 2 of 3 must pass (handles non-determinism)
Reports overall pass rate and per-scenario results

Metric	Threshold	What it means
Pass Rate	≥ 90%	Fraction of scenarios that pass (2 of 3 runs each)
False Positive Rate	< 5%	Scenarios that pass but shouldn't — detect evaluation bugs
Override Rate	< 10%	How often a human overrules the evaluator — measures trust

When your override rate drops below 10%, you've validated that the evaluator is trustworthy. That's when you enable auto-merge.

How should you organise specs and scenarios on disk?

The directory layout enforces the separation between what the coding agent sees and what remains hidden:

text

your-project/
├── specs/                          # ← Visible to coding agent
│   ├── feature-password-reset.md
│   ├── feature-rate-limiting.md
│   └── bug-duplicate-webhooks.md
├── scenarios/                      # ← HIDDEN from coding agent
│   ├── password-reset-happy.md     #    Only the evaluator reads these
│   ├── password-reset-edge.md
│   ├── rate-limiting-burst.md
│   └── webhooks-idempotency.md
├── AGENTS.md                       # ← Agent context (from Part 2)
├── src/                            # ← Source code
│   └── ...
└── .github/
    └── workflows/
        └── evaluate.yml            # ← CI pipeline for evaluation

Add this rule to your AGENTS.md to enforce the separation:

markdown

## Spec-Driven Rules
- MUST read the spec file before starting implementation
- MUST NOT read, reference, or access the scenarios/ directory
- MUST NOT write tests that mirror scenario descriptions
- MUST run build + lint + existing tests before opening a PR
- MUST include the spec file path in the PR description

Writing Specs That Actually Work

Most spec failures come from three root causes. Here's how to avoid each one:

Failure Mode	Symptom	Fix
Ambiguity	Agent builds something reasonable but wrong	Add explicit MUST/MUST NOT for every decision point
Missing context	Agent ignores existing patterns, duplicates code	Reference specific files: "Follow patterns in src/auth/login.ts"
Scope creep	Agent adds unrequested features, over-engineers	Add explicit Non-Goals section

The Spec Quality Checklist

Before handing a spec to an agent, verify:

Single responsibility

One feature or bug per spec — never bundle

Testable requirements

Every requirement has an observable, verifiable outcome

Explicit decisions

No implicit behavior — if the agent must choose, the spec must decide

File references

Point to existing patterns: "Follow src/models/user.ts"

Boundary conditions

Cover empty inputs, max values, concurrent access, error states

Non-goals listed

Prevent the agent from gold-plating with unrequested features

Digital Twins: Testing Without External Dependencies

At Level 4, agents run scenarios against real APIs. But real APIs have rate limits, cost money, and can't simulate failure modes on demand. The solution: digital twins — behavioral clones of external services that respond exactly like the real thing.

StrongDM's team pioneered this approach. They built digital twins of Okta, Jira, Slack, and Google services by feeding agents the full public API documentation and targeting 100% compatibility with official SDK client libraries.

Self-contained binaries

Each twin runs as a standalone binary — no external dependencies

State management

Twins maintain internal state across request sequences, like the real service

Failure simulation

Trigger rate limits, timeouts, auth failures, and partial outages on demand

Scale without cost

Run thousands of scenarios per hour with zero API bills or rate-limit hits

Building a digital twin is itself a spec-driven task. The spec is the service's API documentation. The scenarios are the official SDK test suites. When your twin passes 100% of the SDK's integration tests, it's production-ready.

markdown

# Spec: Stripe Digital Twin

## Goal
Behavioral clone of Stripe's Charges API for local scenario testing.

## Requirements
1. MUST implement POST /v1/charges (create)
2. MUST implement GET /v1/charges/:id (retrieve)
3. MUST implement POST /v1/refunds (refund)
4. MUST persist charges in-memory across requests
5. MUST validate API key format (sk_test_*)
6. MUST return identical response shapes to Stripe API docs
7. MUST simulate decline codes: card_declined, insufficient_funds, expired_card

## Compatibility Target
100% pass rate against stripe-node SDK test suite (charges module)

## Non-Goals
- Webhooks, subscriptions, or payment intents
- Persistent storage across process restarts

How do you build the evaluation CI pipeline?

Here's a concrete CI pipeline that runs holdout scenarios against every PR. This is the infrastructure that replaces human review. If you need a refresher on the syntax below — triggers, jobs, env, and PR auto-merge — see the GitHub Actions cheat sheet.

yaml

# .github/workflows/evaluate.yml
name: Scenario Evaluation

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy ephemeral environment
        run: |
          # Deploy PR branch to a temporary environment
          # Returns the base URL for scenario testing
          echo "BASE_URL=https://pr-${{ github.event.number }}.preview.example.com" >> $GITHUB_ENV

      - name: Run holdout scenarios
        env:
          SCENARIOS_DIR: ./scenarios  # Hidden from coding agent
          BASE_URL: ${{ env.BASE_URL }}
          PASS_THRESHOLD: 0.9        # 90% pass rate required
          RUNS_PER_SCENARIO: 3       # Each scenario runs 3x
        run: |
          # The evaluator agent:
          # 1. Reads each scenario file
          # 2. Plans API calls to test the described behavior
          # 3. Executes against the ephemeral deployment
          # 4. Scores pass/fail for each scenario (2/3 must pass)
          # 5. Reports overall pass rate
          npx evaluate-scenarios \
            --scenarios $SCENARIOS_DIR \
            --base-url $BASE_URL \
            --threshold $PASS_THRESHOLD \
            --runs $RUNS_PER_SCENARIO \
            --output results.json

      - name: Gate merge decision
        run: |
          PASS_RATE=$(jq '.passRate' results.json)
          if (( $(echo "$PASS_RATE >= 0.9" | bc -l) )); then
            echo "✅ Pass rate: ${PASS_RATE} — auto-merge approved"
            gh pr merge ${{ github.event.number }} --auto --squash
          else
            echo "❌ Pass rate: ${PASS_RATE} — below threshold"
            jq '.failures[]' results.json  # Show which scenarios failed
            exit 1
          fi

The key elements: ephemeral deployment per PR, scenario isolation from the coding agent, triple execution for reliability, and a clear pass/fail threshold. Start with human review of evaluator results (Level 3). Enable auto-merge only after your override rate drops below 10% (Level 4). Once auto-merge is on, you'll quickly hit scaling, cost, and observability concerns — Part 4 covers the multi-agent orchestration, budget gates, and circuit breakers you'll need next.

The Level 3 → 4 Transition: When to Enable Auto-Merge

This is the most consequential decision in the dark factory journey. Auto-merging means no human looks at the code before it ships. You need earned confidence, not blind faith. Here's the progression:

Phase 1: Shadow Mode (2-4 weeks)

Run the evaluator on every PR, but still review manually. Compare your review decisions against the evaluator's gate decision. Track agreement rate.

Phase 2: Evaluator-Advised (2-4 weeks)

Only review PRs that the evaluator flags as failing. Let passing PRs sit for 24h, then review them in batches. Your override rate should be trending down.

Phase 3: Auto-Merge Low-Risk (ongoing)

Enable auto-merge for PRs that pass evaluation AND touch < 5 files AND don't modify auth/payment/infrastructure code. Keep human review for high-risk changes.

Phase 4: Full Auto-Merge

When override rate is < 10% for 30 consecutive days across all PR categories, enable full auto-merge. Monitor continuously.

Monitoring After Auto-Merge

Auto-merge doesn't mean autopilot. Track these metrics continuously:

text

# Daily health dashboard
┌─────────────────────┬──────────┬───────────┐
│ Metric              │ Target   │ Alert At  │
├─────────────────────┼──────────┼───────────┤
│ Scenario pass rate  │ ≥ 95%    │ < 90%     │
│ False positive rate │ < 3%     │ > 5%      │
│ Rollback rate       │ < 2%     │ > 5%      │
│ Mean time to merge  │ < 30min  │ > 2hr     │
│ Scenario coverage   │ ≥ 80%    │ < 70%     │
│ Agent retry rate    │ < 20%    │ > 40%     │
└─────────────────────┴──────────┴───────────┘

Real-World Results

StrongDM's three-person team operates at Level 5, building Attractor — their open-source coding agent. Two founding rules: code must not be written by humans, and code must not be reviewed by humans. Their repository contains no hand-written code — just three NLSpec markdown files.

Metric	Traditional (8 engineers)	Dark Factory (3 people)
Effective output	8 engineers	25-30 engineer equivalent
Feedback loop	Hours to days	Minutes
Quality gate	Human review	Automated scenario eval
Bottleneck	Developer time	Spec quality
Token cost	$0/day	~$1,000/day/engineer

The cost is real — approximately $1,000/day/engineer in LLM tokens. But compare that to the fully loaded cost of 25-30 engineers. The economics work if your specs are good enough to keep the retry rate low.

Pitfalls and How to Avoid Them

Leaking scenarios to the coding agent

If the coding agent can read scenario files, it will overfit — writing code that passes your specific scenarios but fails on real-world inputs. Enforce directory-level access controls.

Monolithic specs

A 500-line spec covering five features will confuse the agent. One spec = one feature = one PR. Split ruthlessly.

Premature auto-merge

Enabling auto-merge before you trust the evaluator is the fastest way to ship broken code. Shadow mode first, always.

Scenario rot

Scenarios must evolve with the product. If you add a feature but don't add scenarios, your quality gate has a blind spot. Treat scenario coverage like test coverage.

Underspecified edge cases

The agent will take the path of least resistance. If your spec doesn't mention what happens when the database is down, the agent won't handle it. Be paranoid about failure modes.

Your Two-Week Action Plan

You've been at Level 2 since Part 2. Here's how to reach Level 3 in two weeks, with a clear path to Level 4:

Day	Task	Outcome
1-2	Write your first feature spec (use the template above)	One spec file in specs/
3	Write 3-5 holdout scenarios for that feature	Scenarios in scenarios/ directory
4-5	Hand spec to agent, review output against spec (not line-by-line)	First spec-driven PR
6-7	Run scenarios manually against the PR branch	Manual evaluation pass/fail
8-9	Write specs + scenarios for a second feature	Growing spec library
10	Set up the CI evaluation pipeline	Automated scenario runs on PRs
11-12	Run 3+ spec-driven PRs through the pipeline in shadow mode	Evaluator agreement data
13-14	Review override rate, tune scenarios, assess readiness for Phase 2	Level 3 operational

Readiness Signals for Level 4

You're ready to enable auto-merge when:

Override rate < 10% — you almost never disagree with the evaluator
Scenario coverage ≥ 80% — most features have holdout scenarios
Rollback rate < 5% — merged code rarely needs to be reverted
Spec iteration < 3 — specs rarely need more than two rewrites before the agent succeeds

References

StrongDM Attractor — open-source NLSpec-driven coding agent for software factories
How StrongDM's AI Team Build Software Without Looking at the Code — deep dive into the real-world dark factory operation at StrongDM
GitHub Spec Kit — open-source toolkit for spec-driven development workflows with 75K+ stars
Spec-Driven Development With AI — GitHub Blog — GitHub's introduction to spec-driven development and Spec Kit
Spec-Driven Development — Thoughtworks — industry analysis of SDD as a key engineering practice
Kiro IDE — AWS's spec-driven agentic IDE for prototype-to-production development
The Dark Factory Pattern — HackerNoon — architectural breakdown of the full dark factory pattern

Part 2: Setting Up Your AI Development Foundation

Up Next

Part 4: Scaling the Factory

AI Dark Factory Part 3: Spec-Driven Development

Why Specifications Are the New Source Code

Level 2

Level 3

Level 4

What goes into a feature specification?

Bad: vague ticket

Good: precise spec

How do feature specs differ from bug specs?

Feature Spec

Bug Spec

How does the spec-driven pipeline work?

What are holdout scenarios and why hide them from the coding agent?

How Evaluation Works

How should you organise specs and scenarios on disk?

Writing Specs That Actually Work

The Spec Quality Checklist

Single responsibility

Testable requirements

Explicit decisions

File references

Boundary conditions

Non-goals listed

Digital Twins: Testing Without External Dependencies

Self-contained binaries

State management

Failure simulation

Scale without cost

How do you build the evaluation CI pipeline?

The Level 3 → 4 Transition: When to Enable Auto-Merge

Phase 1: Shadow Mode (2-4 weeks)

Phase 2: Evaluator-Advised (2-4 weeks)

Phase 3: Auto-Merge Low-Risk (ongoing)

Phase 4: Full Auto-Merge

Monitoring After Auto-Merge

Real-World Results

Pitfalls and How to Avoid Them

Leaking scenarios to the coding agent

Monolithic specs

Premature auto-merge

Scenario rot

Underspecified edge cases

Your Two-Week Action Plan

Readiness Signals for Level 4

References

AI Dark Factory Part 2: Agent Setup & AGENTS.md

AI Dark Factory Part 4: Scaling Agentic Coding

Keep Reading

AI Dark Factory Part 2: Agent Setup & AGENTS.md

AI Dark Factory Playbook: From Autocomplete to Autonomous

AI Dark Factory Part 5: Security & Governance

AI Dark Factory Part 4: Scaling Agentic Coding

Frequently Asked Questions

What is spec-driven development?

What are holdout scenarios?

Why must the coding agent never see the scenarios?

When should I enable auto-merge?

What pass-rate threshold should I use?

Stay up to date

Related Guides

AI Dark Factory: Autonomous Coding Explained

Local LLMs for Coding in 2026: Models, Hardware, Runtimes

How to Build an MCP Server: TypeScript & Python (2026)

Related Cheatsheets

Cron Schedule Examples — 30+ Common Cron Expressions

Bash Scripting Cheat Sheet