env.dev

The Dark Factory Pattern Part 4: Scaling the Factory

Scale your dark factory with multi-agent orchestration, cost control, and production observability. Covers git worktrees, Agent Teams, model routing, budget gates, circuit breakers, and the economics of autonomous development.

Last updated:

Dark Factory Series

A single agent executing a single spec is a proof of concept. A factory is many agents running many specs in parallel, continuously, with predictable cost and quality. That's the gap between Level 4 and Level 5 — and most teams stall here. They have spec-driven development working for one feature at a time, but the moment they spin up three agents simultaneously, costs spike, merge conflicts multiply, and nobody knows which agent broke the build.

This guide covers the three systems you need to run a dark factory at scale: multi-agent orchestration (running agents in parallel without collisions), cost control (keeping your token bill under control as throughput grows), and observability (knowing what your factory is actually doing). By the end, you'll have a production-grade setup where you can push five specs in the morning and review five PRs after lunch — or let them auto-merge entirely.

HUMAN ORCHESTRATORspecs · priorities · decisionsAGENT 1 (worktree)feature/auth-refactorspec → implement → validate → PRAGENT 2 (worktree)fix/webhook-retryspec → implement → validate → PRAGENT 3 (worktree)feat/dashboard-chartsspec → implement → validate → PROBSERVABILITYmerge rate · rollback rate · token spendspec→merge latency · failure modesCOST CONTROLmodel routing · prompt caching · batching$/feature · budget gates · spend limits

Why Does the Single-Agent Model Break Down?

In Part 3, you built a pipeline: write a spec, hand it to an agent, agent implements and validates, you review the PR (or auto-merge). It works beautifully — for one spec at a time. But real codebases don't have one task. They have a backlog of 20-50 items at any moment. Running them sequentially means your factory has a throughput of maybe 3-4 features per day per human.

The fix seems obvious: run multiple agents. But naive parallelism introduces three failure modes:

File Collisions

Two agents edit the same file. One overwrites the other. The merged result has subtle bugs that neither agent's tests catch because each only validated its own changes.

Context Explosion

Each agent loads the full codebase context. Three parallel Opus sessions burn through tokens 3x faster. Without model routing, a team of five agents costs $1,000+/day.

Invisible Failures

An agent loops for 45 minutes on a failing test, burning tokens. Another silently produces code that passes its holdout scenarios but breaks an unrelated feature. You don't find out until production.

Google's 2025 DORA Report found that 90% AI adoption correlated with a 9% increase in bug rates, 91% increase in code review time, and 154% increase in PR size. These aren't problems with AI coding — they're problems with unstructured parallelism. The factory pattern solves them with isolation, routing, and monitoring.

How Do You Run Multiple Agents Without Collisions?

The answer is git worktrees — separate working directories that share the same repository history but have independent file systems and branches. Each agent gets its own worktree, works on its own branch, and never touches another agent's files. Merges happen through PRs, not through shared filesystem access.

Option 1: Manual Worktrees (DIY)

The simplest approach. Spin up Claude Code in separate worktrees using the --worktree flag:

bash
# Terminal 1 — agent works on auth refactor
claude --worktree feature-auth

# Terminal 2 — agent works on webhook fix
claude --worktree fix-webhooks

# Terminal 3 — agent works on dashboard
claude --worktree feat-dashboard

# Each gets its own branch and working directory at:
# .claude/worktrees/feature-auth/
# .claude/worktrees/fix-webhooks/
# .claude/worktrees/feat-dashboard/

This is the pattern most developers start with. You're the orchestrator — you decide which specs go to which agent, monitor progress manually, and handle merges. It works well for 2-3 parallel agents.

Option 2: Agent Teams (Coordinated)

For 3-5+ parallel agents, Claude Code's Agent Teams feature adds coordination on top of worktree isolation. One session acts as the team lead, spawning teammates that communicate through a shared task list and messaging system.

bash
# Enable agent teams (experimental)
# Add to .claude/settings.json:
# { "env": { "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1" } }

# Then tell Claude what you want:
> Create an agent team to implement these three specs in parallel.
> Each teammate should work in its own worktree.
> Use Sonnet for implementation, require plan approval before coding.
ApproachBest ForCoordinationToken Cost
Manual worktrees2-3 independent tasksYou manage everythingLowest
SubagentsFocused sub-tasksResults report backLow
Agent Teams3-5 collaborating tasksShared task list + messaging~7x single session
Headless automationCI/CD pipeline integrationScript-drivenVaries

The Decomposition Rule

Parallelism only works when tasks are truly independent. The decomposition rule: if two specs touch the same file, they must run sequentially. Your spec backlog should explicitly declare file ownership. When two specs overlap, the orchestrator (you, or the team lead agent) sequences them.

markdown
# Spec: Rate Limiting Middleware
files_owned: [src/middleware/rate-limit.ts, src/middleware/index.ts]
depends_on: []

# Spec: Auth Token Rotation
files_owned: [src/auth/token.ts, src/auth/refresh.ts]
depends_on: []

# Spec: Auth Audit Logging  ← depends on token rotation
files_owned: [src/auth/audit.ts, src/auth/token.ts]
depends_on: [auth-token-rotation]
# ↑ Can't run in parallel with token rotation — they share token.ts

How Do You Control Costs at Scale?

The average Claude Code developer spends ~$6/day. A dark factory running five parallel agents on Opus can burn $50-100/day without optimization. At team scale (10 engineers, each running 3-5 agents), that's $1,500-3,000/day. Unoptimized, the economics don't work. Optimized, they're transformative — the equivalent of hiring 5-10 additional engineers for a fraction of the cost.

$1000Naive(all Opus)$600+ Caching$350+ ModelRouting$200+ Batching$150OptimizedFactory

Monthly cost per developer running 3 parallel agents (estimated)

Strategy 1: Model Routing

Not every task needs Opus. The 80/20 rule applies: ~70% of coding tasks perform equally well on Sonnet, which costs 60% less ($3/$15 per MTok vs $5/$25). Route Opus to complex architectural decisions, multi-file refactors, and subtle bug fixes. Route Sonnet (or even Haiku for trivial tasks) to implementations with clear specs, test writing, and documentation.

text
# In your agent team prompt:
> Spawn implementation teammates using Sonnet.
> Use Opus only for the architect role that reviews plans.

# Or per-subagent in .claude/agents/:
---
model: sonnet
---
You are a test-writing agent. Given a spec, write comprehensive tests.

# For headless automation:
claude --model sonnet -p "Implement the rate-limiting spec"

Strategy 2: Prompt Caching

Prompt caching reduces repeated context costs by up to 90%. When every agent loads the same CLAUDE.md, AGENTS.md, and project structure, those tokens are cached after the first request. The 5-minute cache window means parallel agents running simultaneously all benefit — a cache hit costs just 10% of the standard input price.

Strategy 3: Context Discipline

The biggest hidden cost isn't the model — it's bloated context. Each unnecessary file read adds thousands of tokens that compound across every subsequent message. Factory-grade context discipline:

Scoped CLAUDE.md

Keep the root CLAUDE.md under 500 lines. Move workflow-specific instructions into skills that load on-demand.

Subagent delegation

Delegate test runs and log analysis to subagents. Verbose output stays in the subagent's context — only a summary returns.

Aggressive /clear

Clear context between unrelated tasks within the same session. Stale context wastes tokens on every subsequent message.

Minimal MCP servers

Each MCP server adds tool definitions to context even when idle. Disable servers you're not actively using.

Strategy 4: Budget Gates

Set hard limits so a runaway agent can't burn your monthly budget in a single loop. Use workspace spend limits in the Claude Console, and per-session awareness with /cost. A factory-grade rule: if any single spec execution exceeds $5, kill it and investigate. Most well-written specs complete for under $1.

json
// .claude/settings.json — cost-aware configuration
{
  "env": {
    "MAX_THINKING_TOKENS": "16000",
    "CLAUDE_CODE_EFFORT_LEVEL": "medium"
  }
}

// Use "high" effort + full thinking only for complex specs
// Use "medium" or "low" for straightforward implementations

How Do You Monitor a Factory You Can't See?

A dark factory that runs without visibility isn't autonomous — it's abandoned. The "dark" in dark factory means no human writes or reviews code, not that nobody is watching. You need three categories of metrics:

CategoryMetricHealthyAlarm
QualityHoldout scenario pass rate> 95%< 85%
Rollback rate (merged PRs reverted)< 5%> 10%
Production incidents from agent code< 1/month> 3/month
ThroughputSpecs completed per day5-15< 2
Spec-to-merge latency< 2 hours> 8 hours
Spec iteration count (rewrites needed)< 3> 5
CostToken spend per completed spec< $2> $10
Daily total spend per developer< $30> $80
Wasted tokens (loops, dead ends)< 15%> 30%

Building Your Dashboard

You don't need a fancy platform to start. The raw data is already available:

bash
# Track token spend per session
claude --output-format json -p "implement spec" | \
  jq '.[] | select(.type == "summary") | .cost'

# Track specs completed (count merged PRs from agent branches)
gh pr list --state merged --author @me --json number,title,mergedAt | \
  jq '[.[] | select(.title | startswith("feat:") or startswith("fix:"))] | length'

# Track rollback rate
MERGED=$(gh pr list --state merged --limit 100 --json number | jq length)
REVERTED=$(gh pr list --state merged --limit 100 --json labels | \
  jq '[.[] | select(.labels[]?.name == "reverted")] | length')
echo "Rollback rate: $(echo "scale=1; $REVERTED * 100 / $MERGED" | bc)%"

For production-grade monitoring, pipe these metrics into your existing observability stack (Grafana, Datadog, or even a spreadsheet). The key insight: treat your factory like a production service. It has SLOs, dashboards, and alerting — just like any other system you operate.

What Happens When an Agent Fails?

Every factory has defect rates. The question isn't "will agents fail" — it's "how fast do you detect and recover." There are three categories of failure in a dark factory:

Spec Failures

The agent can't implement the spec after 3+ iterations. The spec is ambiguous, contradictory, or beyond the agent's capability.

Fix: Improve the spec. Add examples, file references, and tighter constraints. If the agent still fails, the task needs a human.

Silent Regressions

The agent's code passes holdout scenarios but breaks an unrelated feature. This is the most dangerous failure mode.

Fix: Expand integration test coverage. Add cross-feature holdout scenarios. Run a full test suite (not just spec-specific) before merge.

Cost Runaways

An agent enters an infinite loop — retrying a failing build, reading unnecessary files, or thinking in circles. Token spend spikes.

Fix: Set MAX_THINKING_TOKENS, use workspace spend limits, and monitor /cost. Kill sessions that exceed $5 per spec.

Circuit Breakers

Borrow the circuit breaker pattern from distributed systems. If your factory hits any of these thresholds, stop auto-merging and investigate:

  • 3 rollbacks in 24 hours — something systemic is wrong (bad AGENTS.md, broken dependency, flawed spec template)
  • Holdout pass rate drops below 85% — the spec quality or scenario coverage has degraded
  • Daily spend exceeds 2x your rolling average — agents are looping or context is bloated

What Do Humans Do in a Dark Factory?

The shift from "developer" to "factory operator" is the hardest part of this transition — not technically, but psychologically. Your day looks radically different at Level 5:

ActivityTime Before (Level 2)Time After (Level 5)
Writing code60%< 5%
Reviewing code20%< 5%
Writing specs5%35%
Designing holdout scenarios0%20%
Monitoring factory metrics0%15%
Architecture decisions10%15%
Improving the factory itself5%10%

Deloitte research shows that 86% of HR leaders recognize "digital labor integration" as central to strategic planning. The dark factory isn't replacing developers — it's changing what developers do. You become the person who decides what should exist and how well it should work, rather than the person who types it into existence.

Your Four-Week Scaling Plan

Week 1: Parallel Foundations

Run two agents in manual worktrees on independent specs. Track token spend with /cost. Get comfortable with the workflow before adding coordination.

Week 2: Cost Optimization

Enable model routing: Sonnet for implementation, Opus for architecture. Measure your cost-per-spec baseline. Move verbose CLAUDE.md instructions into skills.

Week 3: Observability

Build your metrics pipeline: merged PRs per day, rollback rate, token spend per spec. Set up circuit breakers. Start with a spreadsheet — graduate to Grafana later.

Week 4: Scale to 5+ Agents

Try Agent Teams for coordinated work. Enable auto-merge for specs with > 95% holdout pass rate. Aim for 5-10 completed specs per day.

Does the Math Actually Work?

Let's be concrete. A mid-level developer costs a company roughly $150,000-200,000/year (fully loaded). A well-optimized dark factory setup:

Cost ComponentMonthlyAnnual
Claude API (optimized, 3-5 parallel agents)$200-400$2,400-4,800
Human orchestrator time (spec writing, monitoring)IncludedIncluded
Output: 5-15 features/dayvs 2-3/day without3-5x throughput

StrongDM reports spending roughly $1,000/day per engineer on tokens in their unoptimized factory. That sounds expensive until you realize their three-person team ships more code than a conventional team ten times their size. The ROI comes from throughput, not from cheap tokens.

For most teams, a more conservative approach works: $200-400/month/developer on tokens, producing 3-5x the output. That's a ~95% reduction in cost-per-feature compared to hiring additional developers.

Pitfalls When Scaling

Starting with Agent Teams

Agent Teams add 7x token overhead. Start with manual worktrees. Only graduate to teams when you need cross-agent coordination.

Skipping holdout scenarios

When throughput increases, quality gates become more important, not less. Never auto-merge without holdout validation — even if it slows the pipeline.

All Opus, all the time

Running five Opus agents in parallel is the fastest way to blow your budget. Route 70% of implementation work to Sonnet. Save Opus for architecture and complex reasoning.

No circuit breakers

Without automatic halt conditions, a single bad AGENTS.md change or broken dependency can trigger cascading failures across all agents.

Ignoring spec quality

At scale, spec quality is the bottleneck. A vague spec that takes 5 iterations to implement costs 5x more than a precise one that succeeds on the first pass.

Treating it like magic

The factory is a system. It needs maintenance, tuning, and continuous improvement. Allocate 10% of your time to improving the factory itself — better specs, better scenarios, better AGENTS.md.

Frequently Asked Questions

How many parallel agents should I start with?

Start with 2-3 agents in manual worktrees. This lets you learn the workflow (decomposing specs, tracking costs, handling merges) without the coordination overhead of Agent Teams. Scale to 5+ only after you have cost control and observability in place.

What does a dark factory cost per month?

With optimization (model routing, prompt caching, context discipline), expect $200-400/month/developer running 3-5 parallel agents. Without optimization, costs can exceed $1,000/month. StrongDM reports ~$1,000/day per engineer on their unoptimized factory, though most teams should aim for the $200-400/month range.

Can Agent Teams replace manual worktree management?

Agent Teams add coordinated task lists, inter-agent messaging, and automatic delegation on top of worktrees. They are ideal when agents need to share findings or depend on each other's output. For fully independent tasks, manual worktrees are cheaper and simpler. Agent Teams use roughly 7x more tokens than a single session.

How do I prevent merge conflicts between parallel agents?

Declare file ownership in each spec. If two specs touch the same file, run them sequentially. Git worktrees give each agent an isolated filesystem and branch. Merges happen through PRs, which trigger your normal CI/CD checks and holdout scenarios.

What metrics should I track first?

Start with three: specs completed per day (throughput), rollback rate (quality), and token spend per spec (cost). These give you the minimum viable dashboard to operate a factory. Add holdout pass rate and spec-to-merge latency once you have the basics.

When should I stop auto-merging and intervene?

Set circuit breakers: 3+ rollbacks in 24 hours, holdout pass rate below 85%, or daily spend exceeding 2x your rolling average. Any of these signals means something systemic changed — investigate before resuming auto-merge.

References