env.dev

The Dark Factory Pattern Part 3: Spec-Driven Development

Master spec-driven development: write precise specifications, implement holdout scenarios as quality gates, build evaluation pipelines, and transition from human code review to automated auto-merge.

Last updated:

The Dark Factory Series

In Part 1, we mapped the six levels from manual coding to the dark factory. In Part 2, you set up your coding agent, wrote a production-grade AGENTS.md, and learned to decompose tasks. Now it's time for the hardest transition in the entire journey: stop reviewing code line-by-line, and start writing specifications that let machines validate themselves.

This is Level 3 → Level 4 — the leap from human-in-the-loop to spec-driven development. By the end of this guide, you'll have a working specification pipeline, holdout scenarios that gate quality without human review, and the confidence to let agents merge their own code.

Why Specifications Are the New Source Code

At Level 2, you pair-program with AI. You watch every diff. You catch mistakes in real time. It works — but it doesn't scale. The moment you try to run multiple agents in parallel, you become the bottleneck. Every PR sits in your review queue. Every context switch costs you 20 minutes.

Spec-driven development flips the model: you describe what should exist, and agents make it exist. Your job shifts from writing and reviewing code to writing specifications and validating outcomes. This isn't vibe coding — it's the opposite. Specifications are precise, testable, and version-controlled. They're the new source of truth.

Level 2

You write prompts, review every line of output, approve each change manually

Level 3

You write specs, review diffs against spec requirements, approve merges

Level 4

You write specs, holdout scenarios validate automatically, agents auto-merge

The industry is converging on this. GitHub shipped Spec Kit — an open-source toolkit for spec-driven workflows adopted by 75,000+ projects. AWS built Kiro, an entire IDE around spec-first development. Thoughtworks flagged SDD as a key engineering practice for 2025. This isn't a fad — it's the natural evolution of AI-assisted development.

Anatomy of a Feature Specification

A specification isn't a user story. It isn't a Jira ticket with three bullet points. A spec is a complete, unambiguous description of what must be true after the agent finishes. If the agent can misinterpret it, it's not specific enough.

ANATOMY OF A FEATURE SPECGoalWhat the feature achieves (1 sentence)ContextRelevant architecture, existing code, dependenciesRequirementsNumbered list of behaviors (MUST / MUST NOT)ConstraintsPerformance budgets, security rules, compatibilityEdge CasesBoundary conditions, error states, concurrencyNon-GoalsExplicitly out of scope to prevent over-engineering

Here's a real example. Compare the two approaches:

Bad: vague ticket

text
Add password reset functionality.
Users should be able to reset
their password via email.

Good: precise spec

markdown
# Feature: Password Reset

## Goal
Allow users to securely reset
their password via email link.

## Requirements
1. POST /forgot-password accepts
   { email: string }
2. MUST generate a cryptographically
   random token (≥32 bytes, base64url)
3. Token MUST expire after 1 hour
4. Token MUST be single-use
5. POST /reset-password accepts
   { token, newPassword }
6. Password MUST meet policy:
   ≥12 chars, 1 upper, 1 digit

## Constraints
- Rate limit: 3 requests/email/hour
- MUST NOT reveal if email exists
- MUST use constant-time comparison

## Edge Cases
- Expired token → 410 Gone
- Used token → 410 Gone
- Invalid email format → 400
- Valid format, unknown email → 200
  (silent, no information leak)

## Non-Goals
- OAuth/SSO password reset
- Admin-initiated password reset
- Password history enforcement

The vague ticket leaves the agent to invent token lifetimes, security constraints, error handling, and rate limiting. The spec makes every decision explicit. The agent's job is execution, not judgment.

Two Spec Types: Features vs. Bugs

Not all work is the same, and specs should reflect that. The dark factory uses two distinct spec formats:

Feature Spec

Describes what to build: goal, requirements, constraints, edge cases.

markdown
# Feature: Rate Limiting

## Goal
Protect API endpoints from abuse.

## Requirements
1. MUST use sliding window
2. Default: 100 req/min/key
3. MUST return 429 + Retry-After
...

Bug Spec

States only the symptom. Forces the agent to investigate — no solution bias.

markdown
# Bug: Duplicate Webhook Events

## Symptom
Customers report receiving the same
webhook event 2-3 times within a
60-second window.

## Reproduction
POST /webhooks/test with valid
payload → observe delivery logs.

## DO NOT prescribe the fix.

This distinction matters. Feature specs are prescriptive — they tell the agent exactly what to build. Bug specs are diagnostic — they tell the agent what's wrong and let it investigate. Prescribing a fix in a bug spec introduces solution bias that often misses the root cause.

The Spec-Driven Pipeline

Here's the full pipeline from spec to shipped code. This is the architecture that replaces human code review.

SPECIFICATIONFeature SpecRequirementsConstraintsEdge CasesCODING AGENTRead SpecImplementSelf-TestEVALUATORRun ScenariosScore Pass %Gate DecisionReportMERGE / SHIPAuto-MergeDeployFAIL → RETRYHuman writesAgent executesAgent validatesAutomated

The flow is simple: you write the spec → the agent implements it → a separate evaluator runs holdout scenarios → if ≥90% pass, the code auto-merges. If evaluation fails, the agent receives failure details and retries — without ever seeing the scenario source.

The critical insight: the coding agent and the evaluator are separate. The coding agent never sees the scenarios. This enforces a strict train/test separation — the same principle that prevents overfitting in machine learning.

Holdout Scenarios: The Quality Gate

Holdout scenarios are the core innovation that makes spec-driven development work without human review. Think of them as acceptance tests that the coding agent never sees. They're written by a human (or a separate spec-writing agent), stored outside the codebase, and only accessible to the evaluator.

STRICT SEPARATIONVISIBLE TO CODING AGENTspecs/feature-auth.mdspecs/feature-payments.mdAGENTS.mdSource code + testsHIDDEN — EVALUATOR ONLYscenarios/auth-login.mdscenarios/auth-edge-cases.mdscenarios/payment-flow.mdEvaluation harness

Scenarios are written in plain-English BDD format. They describe user behavior, not implementation details:

markdown
# Scenario: Password Reset — Happy Path

Given a registered user with email "alice@example.com"
When they POST /forgot-password with { "email": "alice@example.com" }
Then the response status is 200
And an email is sent to "alice@example.com" containing a reset link

When they extract the token from the reset link
And POST /reset-password with { "token": "<extracted>", "newPassword": "NewSecure12!" }
Then the response status is 200
And they can log in with the new password

# Scenario: Password Reset — Expired Token

Given a registered user requests a password reset
And 61 minutes have elapsed
When they POST /reset-password with the expired token
Then the response status is 410
And the body contains { "error": "token_expired" }

# Scenario: Password Reset — Information Leak Prevention

Given "unknown@example.com" is not a registered email
When they POST /forgot-password with { "email": "unknown@example.com" }
Then the response status is 200
And no email is sent
# Attacker cannot distinguish registered from unregistered emails

How Evaluation Works

The evaluator is a separate agent that:

  1. Reads the scenario file (never shared with the coding agent)
  2. Plans API calls and assertions to test each scenario
  3. Executes against an ephemeral deployment of the PR branch
  4. Runs each scenario 3 times — at least 2 of 3 must pass (handles non-determinism)
  5. Reports overall pass rate and per-scenario results
MetricThresholdWhat it means
Pass Rate≥ 90%Fraction of scenarios that pass (2 of 3 runs each)
False Positive Rate< 5%Scenarios that pass but shouldn't — detect evaluation bugs
Override Rate< 10%How often a human overrules the evaluator — measures trust

When your override rate drops below 10%, you've validated that the evaluator is trustworthy. That's when you enable auto-merge.

Setting Up Your File Structure

The directory layout enforces the separation between what the coding agent sees and what remains hidden:

text
your-project/
├── specs/                          # ← Visible to coding agent
│   ├── feature-password-reset.md
│   ├── feature-rate-limiting.md
│   └── bug-duplicate-webhooks.md
├── scenarios/                      # ← HIDDEN from coding agent
│   ├── password-reset-happy.md     #    Only the evaluator reads these
│   ├── password-reset-edge.md
│   ├── rate-limiting-burst.md
│   └── webhooks-idempotency.md
├── AGENTS.md                       # ← Agent context (from Part 2)
├── src/                            # ← Source code
│   └── ...
└── .github/
    └── workflows/
        └── evaluate.yml            # ← CI pipeline for evaluation

Add this rule to your AGENTS.md to enforce the separation:

markdown
## Spec-Driven Rules
- MUST read the spec file before starting implementation
- MUST NOT read, reference, or access the scenarios/ directory
- MUST NOT write tests that mirror scenario descriptions
- MUST run build + lint + existing tests before opening a PR
- MUST include the spec file path in the PR description

Writing Specs That Actually Work

Most spec failures come from three root causes. Here's how to avoid each one:

Failure ModeSymptomFix
AmbiguityAgent builds something reasonable but wrongAdd explicit MUST/MUST NOT for every decision point
Missing contextAgent ignores existing patterns, duplicates codeReference specific files: "Follow patterns in src/auth/login.ts"
Scope creepAgent adds unrequested features, over-engineersAdd explicit Non-Goals section

The Spec Quality Checklist

Before handing a spec to an agent, verify:

Single responsibility

One feature or bug per spec — never bundle

Testable requirements

Every requirement has an observable, verifiable outcome

Explicit decisions

No implicit behavior — if the agent must choose, the spec must decide

File references

Point to existing patterns: "Follow src/models/user.ts"

Boundary conditions

Cover empty inputs, max values, concurrent access, error states

Non-goals listed

Prevent the agent from gold-plating with unrequested features

Digital Twins: Testing Without External Dependencies

At Level 4, agents run scenarios against real APIs. But real APIs have rate limits, cost money, and can't simulate failure modes on demand. The solution: digital twins — behavioral clones of external services that respond exactly like the real thing.

StrongDM's team pioneered this approach. They built digital twins of Okta, Jira, Slack, and Google services by feeding agents the full public API documentation and targeting 100% compatibility with official SDK client libraries.

Self-contained binaries

Each twin runs as a standalone binary — no external dependencies

State management

Twins maintain internal state across request sequences, like the real service

Failure simulation

Trigger rate limits, timeouts, auth failures, and partial outages on demand

Scale without cost

Run thousands of scenarios per hour with zero API bills or rate-limit hits

Building a digital twin is itself a spec-driven task. The spec is the service's API documentation. The scenarios are the official SDK test suites. When your twin passes 100% of the SDK's integration tests, it's production-ready.

markdown
# Spec: Stripe Digital Twin

## Goal
Behavioral clone of Stripe's Charges API for local scenario testing.

## Requirements
1. MUST implement POST /v1/charges (create)
2. MUST implement GET /v1/charges/:id (retrieve)
3. MUST implement POST /v1/refunds (refund)
4. MUST persist charges in-memory across requests
5. MUST validate API key format (sk_test_*)
6. MUST return identical response shapes to Stripe API docs
7. MUST simulate decline codes: card_declined, insufficient_funds, expired_card

## Compatibility Target
100% pass rate against stripe-node SDK test suite (charges module)

## Non-Goals
- Webhooks, subscriptions, or payment intents
- Persistent storage across process restarts

Building the Evaluation Pipeline

Here's a concrete CI pipeline that runs holdout scenarios against every PR. This is the infrastructure that replaces human review:

yaml
# .github/workflows/evaluate.yml
name: Scenario Evaluation

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy ephemeral environment
        run: |
          # Deploy PR branch to a temporary environment
          # Returns the base URL for scenario testing
          echo "BASE_URL=https://pr-${{ github.event.number }}.preview.example.com" >> $GITHUB_ENV

      - name: Run holdout scenarios
        env:
          SCENARIOS_DIR: ./scenarios  # Hidden from coding agent
          BASE_URL: ${{ env.BASE_URL }}
          PASS_THRESHOLD: 0.9        # 90% pass rate required
          RUNS_PER_SCENARIO: 3       # Each scenario runs 3x
        run: |
          # The evaluator agent:
          # 1. Reads each scenario file
          # 2. Plans API calls to test the described behavior
          # 3. Executes against the ephemeral deployment
          # 4. Scores pass/fail for each scenario (2/3 must pass)
          # 5. Reports overall pass rate
          npx evaluate-scenarios \
            --scenarios $SCENARIOS_DIR \
            --base-url $BASE_URL \
            --threshold $PASS_THRESHOLD \
            --runs $RUNS_PER_SCENARIO \
            --output results.json

      - name: Gate merge decision
        run: |
          PASS_RATE=$(jq '.passRate' results.json)
          if (( $(echo "$PASS_RATE >= 0.9" | bc -l) )); then
            echo "✅ Pass rate: ${PASS_RATE} — auto-merge approved"
            gh pr merge ${{ github.event.number }} --auto --squash
          else
            echo "❌ Pass rate: ${PASS_RATE} — below threshold"
            jq '.failures[]' results.json  # Show which scenarios failed
            exit 1
          fi

The key elements: ephemeral deployment per PR, scenario isolation from the coding agent, triple execution for reliability, and a clear pass/fail threshold. Start with human review of evaluator results (Level 3). Enable auto-merge only after your override rate drops below 10% (Level 4).

The Level 3 → 4 Transition: When to Enable Auto-Merge

This is the most consequential decision in the dark factory journey. Auto-merging means no human looks at the code before it ships. You need earned confidence, not blind faith. Here's the progression:

Phase 1: Shadow Mode (2-4 weeks)

Run the evaluator on every PR, but still review manually. Compare your review decisions against the evaluator's gate decision. Track agreement rate.

Phase 2: Evaluator-Advised (2-4 weeks)

Only review PRs that the evaluator flags as failing. Let passing PRs sit for 24h, then review them in batches. Your override rate should be trending down.

Phase 3: Auto-Merge Low-Risk (ongoing)

Enable auto-merge for PRs that pass evaluation AND touch < 5 files AND don't modify auth/payment/infrastructure code. Keep human review for high-risk changes.

Phase 4: Full Auto-Merge

When override rate is < 10% for 30 consecutive days across all PR categories, enable full auto-merge. Monitor continuously.

Monitoring After Auto-Merge

Auto-merge doesn't mean autopilot. Track these metrics continuously:

text
# Daily health dashboard
┌─────────────────────┬──────────┬───────────┐
│ Metric              │ Target   │ Alert At  │
├─────────────────────┼──────────┼───────────┤
│ Scenario pass rate  │ ≥ 95%    │ < 90%     │
│ False positive rate │ < 3%     │ > 5%      │
│ Rollback rate       │ < 2%     │ > 5%      │
│ Mean time to merge  │ < 30min  │ > 2hr     │
│ Scenario coverage   │ ≥ 80%    │ < 70%     │
│ Agent retry rate    │ < 20%    │ > 40%     │
└─────────────────────┴──────────┴───────────┘

Real-World Results

StrongDM's three-person team operates at Level 5, building Attractor — their open-source coding agent. Two founding rules: code must not be written by humans, and code must not be reviewed by humans. Their repository contains no hand-written code — just three NLSpec markdown files.

MetricTraditional (8 engineers)Dark Factory (3 people)
Effective output8 engineers25-30 engineer equivalent
Feedback loopHours to daysMinutes
Quality gateHuman reviewAutomated scenario eval
BottleneckDeveloper timeSpec quality
Token cost$0/day~$1,000/day/engineer

The cost is real — approximately $1,000/day/engineer in LLM tokens. But compare that to the fully loaded cost of 25-30 engineers. The economics work if your specs are good enough to keep the retry rate low.

Pitfalls and How to Avoid Them

Leaking scenarios to the coding agent

If the coding agent can read scenario files, it will overfit — writing code that passes your specific scenarios but fails on real-world inputs. Enforce directory-level access controls.

Monolithic specs

A 500-line spec covering five features will confuse the agent. One spec = one feature = one PR. Split ruthlessly.

Premature auto-merge

Enabling auto-merge before you trust the evaluator is the fastest way to ship broken code. Shadow mode first, always.

Scenario rot

Scenarios must evolve with the product. If you add a feature but don't add scenarios, your quality gate has a blind spot. Treat scenario coverage like test coverage.

Underspecified edge cases

The agent will take the path of least resistance. If your spec doesn't mention what happens when the database is down, the agent won't handle it. Be paranoid about failure modes.

Your Two-Week Action Plan

You've been at Level 2 since Part 2. Here's how to reach Level 3 in two weeks, with a clear path to Level 4:

DayTaskOutcome
1-2Write your first feature spec (use the template above)One spec file in specs/
3Write 3-5 holdout scenarios for that featureScenarios in scenarios/ directory
4-5Hand spec to agent, review output against spec (not line-by-line)First spec-driven PR
6-7Run scenarios manually against the PR branchManual evaluation pass/fail
8-9Write specs + scenarios for a second featureGrowing spec library
10Set up the CI evaluation pipelineAutomated scenario runs on PRs
11-12Run 3+ spec-driven PRs through the pipeline in shadow modeEvaluator agreement data
13-14Review override rate, tune scenarios, assess readiness for Phase 2Level 3 operational

Readiness Signals for Level 4

You're ready to enable auto-merge when:

  1. Override rate < 10% — you almost never disagree with the evaluator
  2. Scenario coverage ≥ 80% — most features have holdout scenarios
  3. Rollback rate < 5% — merged code rarely needs to be reverted
  4. Spec iteration < 3 — specs rarely need more than two rewrites before the agent succeeds

References