How many parallel agents should I start with?

Start with 2-3 agents in manual worktrees. This lets you learn the workflow (decomposing specs, tracking costs, handling merges) without the coordination overhead of Agent Teams. Scale to 5+ only after you have cost control and observability in place.

What does a dark factory cost per month?

With optimization (model routing, prompt caching, context discipline), expect $200-400/month/developer running 3-5 parallel agents. Without optimization, costs can exceed $1,000/month. StrongDM reports ~$1,000/day per engineer on their unoptimized factory, though most teams should aim for the $200-400/month range.

Can Agent Teams replace manual worktree management?

Agent Teams add coordinated task lists, inter-agent messaging, and automatic delegation on top of worktrees. They are ideal when agents need to share findings or depend on each other's output. For fully independent tasks, manual worktrees are cheaper and simpler. Agent Teams use roughly 7x more tokens than a single session.

How do I prevent merge conflicts between parallel agents?

Declare file ownership in each spec. If two specs touch the same file, run them sequentially. Git worktrees give each agent an isolated filesystem and branch. Merges happen through PRs, which trigger your normal CI/CD checks and holdout scenarios.

What metrics should I track first?

Start with three: specs completed per day (throughput), rollback rate (quality), and token spend per spec (cost). These give you the minimum viable dashboard to operate a factory. Add holdout pass rate and spec-to-merge latency once you have the basics.

When should I stop auto-merging and intervene?

Set circuit breakers: 3+ rollbacks in 24 hours, holdout pass rate below 85%, or daily spend exceeding 2x your rolling average. Any of these signals means something systemic changed — investigate before resuming auto-merge.

AI Dark Factory Part 4: Scaling Agentic Coding

Scale your AI dark factory: git worktrees, Agent Teams, model routing, prompt caching, budget gates, observability, and agentic coding economics that work.

New to the term? Read the AI dark factory primer for what it means, who coined it, and the FANUC analogy. This page is Part 4 of the implementation playbook.

Dark Factory Series

Part 1: The Playbook|Part 2: Foundation Setup|Part 3: Spec-Driven Development|Part 4: Scaling the Factory (you are here)|Part 5: Security & Governance

A single agent executing a single spec is a proof of concept. A factory is many agents running many specs in parallel, continuously, with predictable cost and quality. That's the gap between Level 4 and Level 5 — and most teams stall here. They have spec-driven development working for one feature at a time, but the moment they spin up three agents simultaneously, costs spike, merge conflicts multiply, and nobody knows which agent broke the build.

This guide covers the three systems you need to run a dark factory at scale: multi-agent orchestration (running agents in parallel without collisions), cost control (keeping your token bill under control as throughput grows), and observability (knowing what your factory is actually doing). By the end, you'll have a production-grade setup where you can push five specs in the morning and review five PRs after lunch — or let them auto-merge entirely.

Why Does the Single-Agent Model Break Down?

In Part 3, you built a pipeline: write a spec, hand it to an agent, agent implements and validates, you review the PR (or auto-merge). It works beautifully — for one spec at a time. But real codebases don't have one task. They have a backlog of 20-50 items at any moment. Running them sequentially means your factory has a throughput of maybe 3-4 features per day per human.

The fix seems obvious: run multiple agents. But naive parallelism introduces three failure modes:

File Collisions

Two agents edit the same file. One overwrites the other. The merged result has subtle bugs that neither agent's tests catch because each only validated its own changes.

Context Explosion

Each agent loads the full codebase context. Three parallel Opus sessions burn through tokens 3x faster. Without model routing, a team of five agents costs $1,000+/day.

Invisible Failures

An agent loops for 45 minutes on a failing test, burning tokens. Another silently produces code that passes its holdout scenarios but breaks an unrelated feature. You don't find out until production.

Google's 2025 DORA Report found that 90% AI adoption correlated with a 9% increase in bug rates, 91% increase in code review time, and 154% increase in PR size. These aren't problems with AI coding — they're problems with unstructured parallelism. The factory pattern solves them with isolation, routing, and monitoring.

How Do You Run Multiple Agents Without Collisions?

The answer is git worktrees — separate working directories that share the same repository history but have independent file systems and branches. Each agent gets its own worktree, works on its own branch, and never touches another agent's files. Merges happen through PRs, not through shared filesystem access.

Option 1: Manual Worktrees (DIY)

The simplest approach. Spin up Claude Code in separate worktrees using the --worktree flag:

bash

# Terminal 1 — agent works on auth refactor
claude --worktree feature-auth

# Terminal 2 — agent works on webhook fix
claude --worktree fix-webhooks

# Terminal 3 — agent works on dashboard
claude --worktree feat-dashboard

# Each gets its own branch and working directory at:
# .claude/worktrees/feature-auth/
# .claude/worktrees/fix-webhooks/
# .claude/worktrees/feat-dashboard/

This is the pattern most developers start with. You're the orchestrator — you decide which specs go to which agent, monitor progress manually, and handle merges. It works well for 2-3 parallel agents.

Option 2: Agent Teams (Coordinated)

For 3-5+ parallel agents, Claude Code's Agent Teams feature adds coordination on top of worktree isolation. One session acts as the team lead, spawning teammates that communicate through a shared task list and messaging system.

bash

# Enable agent teams (experimental)
# Add to .claude/settings.json:
# { "env": { "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1" } }

# Then tell Claude what you want:
> Create an agent team to implement these three specs in parallel.
> Each teammate should work in its own worktree.
> Use Sonnet for implementation, require plan approval before coding.

Approach	Best For	Coordination	Token Cost
Manual worktrees	2-3 independent tasks	You manage everything	Lowest
Subagents	Focused sub-tasks	Results report back	Low
Agent Teams	3-5 collaborating tasks	Shared task list + messaging	~7x single session
Headless automation	CI/CD pipeline integration	Script-driven	Varies

The Decomposition Rule

Parallelism only works when tasks are truly independent. The decomposition rule: if two specs touch the same file, they must run sequentially. Your spec backlog should explicitly declare file ownership. When two specs overlap, the orchestrator (you, or the team lead agent) sequences them.

markdown

# Spec: Rate Limiting Middleware
files_owned: [src/middleware/rate-limit.ts, src/middleware/index.ts]
depends_on: []

# Spec: Auth Token Rotation
files_owned: [src/auth/token.ts, src/auth/refresh.ts]
depends_on: []

# Spec: Auth Audit Logging  ← depends on token rotation
files_owned: [src/auth/audit.ts, src/auth/token.ts]
depends_on: [auth-token-rotation]
# ↑ Can't run in parallel with token rotation — they share token.ts

How Do You Control Costs at Scale?

The average Claude Code developer spends ~$6/day. A dark factory running five parallel agents on Opus can burn $50-100/day without optimization. At team scale (10 engineers, each running 3-5 agents), that's $1,500-3,000/day. Unoptimized, the economics don't work. Optimized, they're transformative — the equivalent of hiring 5-10 additional engineers for a fraction of the cost.

Monthly cost per developer running 3 parallel agents (estimated)

Strategy 1: Model Routing

Not every task needs Opus. The 80/20 rule applies: ~70% of coding tasks perform equally well on Sonnet, which costs 60% less ($3/$15 per MTok vs $5/$25). Route Opus to complex architectural decisions, multi-file refactors, and subtle bug fixes. Route Sonnet (or even Haiku for trivial tasks) to implementations with clear specs, test writing, and documentation.

text

# In your agent team prompt:
> Spawn implementation teammates using Sonnet.
> Use Opus only for the architect role that reviews plans.

# Or per-subagent in .claude/agents/:
---
model: sonnet
---
You are a test-writing agent. Given a spec, write comprehensive tests.

# For headless automation:
claude --model sonnet -p "Implement the rate-limiting spec"

On a Pro or Max subscription, headless claude -p and Agent SDK calls bill against a separate $20/$100/$200 monthly credit starting June 15, 2026, not your interactive limits. Once you scale past two or three parallel agents you outgrow that credit fast — see the Claude Agent SDK credits update for the cost-vs-API-key decision tree.

Strategy 2: Prompt Caching

Prompt caching reduces repeated context costs by up to 90%. When every agent loads the same CLAUDE.md, AGENTS.md, and project structure, those tokens are cached after the first request. The 5-minute cache window means parallel agents running simultaneously all benefit — a cache hit costs just 10% of the standard input price.

Strategy 3: Context Discipline

The biggest hidden cost isn't the model — it's bloated context. Each unnecessary file read adds thousands of tokens that compound across every subsequent message. Factory-grade context discipline:

Scoped CLAUDE.md

Keep the root CLAUDE.md under 500 lines. Move workflow-specific instructions into skills that load on-demand.

Subagent delegation

Delegate test runs and log analysis to subagents. Verbose output stays in the subagent's context — only a summary returns.

Aggressive /clear

Clear context between unrelated tasks within the same session. Stale context wastes tokens on every subsequent message.

Minimal MCP servers

Each MCP server adds tool definitions to context even when idle. Disable servers you're not actively using.

Strategy 4: Budget Gates

Set hard limits so a runaway agent can't burn your monthly budget in a single loop. Use workspace spend limits in the Claude Console, and per-session awareness with /cost. A factory-grade rule: if any single spec execution exceeds $5, kill it and investigate. Most well-written specs complete for under $1 — see Part 3 for what makes a spec well-written and how holdout scenarios catch failures before they burn tokens.

json

// .claude/settings.json — cost-aware configuration
{
  "env": {
    "MAX_THINKING_TOKENS": "16000",
    "CLAUDE_CODE_EFFORT_LEVEL": "medium"
  }
}

// Use "high" effort + full thinking only for complex specs
// Use "medium" or "low" for straightforward implementations

How Do You Monitor a Factory You Can't See?

A dark factory that runs without visibility isn't autonomous — it's abandoned. The "dark" in dark factory means no human writes or reviews code, not that nobody is watching. You need three categories of metrics:

Category	Metric	Healthy	Alarm
Quality	Holdout scenario pass rate	> 95%	< 85%
	Rollback rate (merged PRs reverted)	< 5%	> 10%
	Production incidents from agent code	< 1/month	> 3/month
Throughput	Specs completed per day	5-15	< 2
	Spec-to-merge latency	< 2 hours	> 8 hours
	Spec iteration count (rewrites needed)	< 3	> 5
Cost	Token spend per completed spec	< $2	> $10
	Daily total spend per developer	< $30	> $80
	Wasted tokens (loops, dead ends)	< 15%	> 30%

Building Your Dashboard

You don't need a fancy platform to start. The raw data is already available:

bash

# Track token spend per session
claude --output-format json -p "implement spec" | \
  jq '.[] | select(.type == "summary") | .cost'

# Track specs completed (count merged PRs from agent branches)
gh pr list --state merged --author @me --json number,title,mergedAt | \
  jq '[.[] | select(.title | startswith("feat:") or startswith("fix:"))] | length'

# Track rollback rate
MERGED=$(gh pr list --state merged --limit 100 --json number | jq length)
REVERTED=$(gh pr list --state merged --limit 100 --json labels | \
  jq '[.[] | select(.labels[]?.name == "reverted")] | length')
echo "Rollback rate: $(echo "scale=1; $REVERTED * 100 / $MERGED" | bc)%"

For production-grade monitoring, pipe these metrics into your existing observability stack (Grafana, Datadog, or even a spreadsheet). The key insight: treat your factory like a production service. It has SLOs, dashboards, and alerting — just like any other system you operate. For more gh patterns useful in scripted dashboards and CI alerts, see the GitHub Actions cheatsheet.

What Happens When an Agent Fails?

Every factory has defect rates. The question isn't "will agents fail" — it's "how fast do you detect and recover." There are three categories of failure in a dark factory:

Spec Failures

The agent can't implement the spec after 3+ iterations. The spec is ambiguous, contradictory, or beyond the agent's capability.

Fix: Improve the spec. Add examples, file references, and tighter constraints. If the agent still fails, the task needs a human.

Silent Regressions

The agent's code passes holdout scenarios but breaks an unrelated feature. This is the most dangerous failure mode.

Fix: Expand integration test coverage. Add cross-feature holdout scenarios. Run a full test suite (not just spec-specific) before merge.

Cost Runaways

An agent enters an infinite loop — retrying a failing build, reading unnecessary files, or thinking in circles. Token spend spikes.

Fix: Set MAX_THINKING_TOKENS, use workspace spend limits, and monitor /cost. Kill sessions that exceed $5 per spec.

Circuit Breakers

Borrow the circuit breaker pattern from distributed systems. If your factory hits any of these thresholds, stop auto-merging and investigate:

3 rollbacks in 24 hours — something systemic is wrong (bad AGENTS.md, broken dependency, flawed spec template)
Holdout pass rate drops below 85% — the spec quality or scenario coverage has degraded
Daily spend exceeds 2x your rolling average — agents are looping or context is bloated

What Do Humans Do in a Dark Factory?

The shift from "developer" to "factory operator" is the hardest part of this transition — not technically, but psychologically. Your day looks radically different at Level 5:

Activity	Time Before (Level 2)	Time After (Level 5)
Writing code	60%	< 5%
Reviewing code	20%	< 5%
Writing specs	5%	35%
Designing holdout scenarios	0%	20%
Monitoring factory metrics	0%	15%
Architecture decisions	10%	15%
Improving the factory itself	5%	10%

Deloitte research shows that 86% of HR leaders recognize "digital labor integration" as central to strategic planning. The dark factory isn't replacing developers — it's changing what developers do. You become the person who decides what should exist and how well it should work, rather than the person who types it into existence.

Your Four-Week Scaling Plan

Week 1: Parallel Foundations

Run two agents in manual worktrees on independent specs. Track token spend with /cost. Get comfortable with the workflow before adding coordination.

Week 2: Cost Optimization

Enable model routing: Sonnet for implementation, Opus for architecture. Measure your cost-per-spec baseline. Move verbose CLAUDE.md instructions into skills.

Week 3: Observability

Build your metrics pipeline: merged PRs per day, rollback rate, token spend per spec. Set up circuit breakers. Start with a spreadsheet — graduate to Grafana later.

Week 4: Scale to 5+ Agents

Try Agent Teams for coordinated work. Enable auto-merge for specs with > 95% holdout pass rate. Aim for 5-10 completed specs per day.

Does the Math Actually Work?

Let's be concrete. A mid-level developer costs a company roughly $150,000-200,000/year (fully loaded). A well-optimized dark factory setup:

Cost Component	Monthly	Annual
Claude API (optimized, 3-5 parallel agents)	$200-400	$2,400-4,800
Human orchestrator time (spec writing, monitoring)	Included	Included
Output: 5-15 features/day	vs 2-3/day without	3-5x throughput

StrongDM reports spending roughly $1,000/day per engineer on tokens in their unoptimized factory. That sounds expensive until you realize their three-person team ships more code than a conventional team ten times their size. The ROI comes from throughput, not from cheap tokens.

For most teams, a more conservative approach works: $200-400/month/developer on tokens, producing 3-5x the output. That's a ~95% reduction in cost-per-feature compared to hiring additional developers.

Pitfalls When Scaling

Starting with Agent Teams

Agent Teams add 7x token overhead. Start with manual worktrees. Only graduate to teams when you need cross-agent coordination.

Skipping holdout scenarios

When throughput increases, quality gates become more important, not less. Never auto-merge without holdout validation — even if it slows the pipeline.

All Opus, all the time

Running five Opus agents in parallel is the fastest way to blow your budget. Route 70% of implementation work to Sonnet. Save Opus for architecture and complex reasoning.

No circuit breakers

Without automatic halt conditions, a single bad AGENTS.md change or broken dependency can trigger cascading failures across all agents.

Ignoring spec quality

At scale, spec quality is the bottleneck. A vague spec that takes 5 iterations to implement costs 5x more than a precise one that succeeds on the first pass.

Treating it like magic

The factory is a system. It needs maintenance, tuning, and continuous improvement. Allocate 10% of your time to improving the factory itself — better specs, better scenarios, better AGENTS.md.

References

Orchestrate Teams of Claude Code Sessions — official documentation for Agent Teams: setup, coordination, and best practices
Manage Costs Effectively — Claude Code Docs — token tracking, team spend limits, model routing, and context reduction strategies
Common Workflows — Claude Code Docs — git worktrees for parallel sessions, subagent delegation, and session management
How StrongDM's AI Team Build Software Without Looking at the Code — the original deep dive into a real-world dark factory running at production scale
StrongDM Software Factory — StrongDM's public-facing documentation of their autonomous software development system
Unlocking Exponential Value with AI Agent Orchestration — Deloitte — enterprise adoption economics, the $8.5B market projection, and organizational transformation
Dark Factory Architecture: How Level 4 Actually Works — Infralovers — architectural patterns for NLSpec, holdout-set scenarios, and digital twin universes

Part 3: Spec-Driven Development

Part 5: Security & Governance

AI Dark Factory Part 4: Scaling Agentic Coding

Why Does the Single-Agent Model Break Down?

File Collisions

Context Explosion

Invisible Failures

How Do You Run Multiple Agents Without Collisions?

Option 1: Manual Worktrees (DIY)

Option 2: Agent Teams (Coordinated)

The Decomposition Rule

How Do You Control Costs at Scale?

Strategy 1: Model Routing

Strategy 2: Prompt Caching

Strategy 3: Context Discipline

Scoped CLAUDE.md

Subagent delegation

Aggressive /clear

Minimal MCP servers

Strategy 4: Budget Gates

How Do You Monitor a Factory You Can't See?

Building Your Dashboard

What Happens When an Agent Fails?

Spec Failures

Silent Regressions

Cost Runaways

Circuit Breakers

What Do Humans Do in a Dark Factory?

Your Four-Week Scaling Plan

Week 1: Parallel Foundations

Week 2: Cost Optimization

Week 3: Observability

Week 4: Scale to 5+ Agents

Does the Math Actually Work?

Pitfalls When Scaling

Starting with Agent Teams

Skipping holdout scenarios

All Opus, all the time

No circuit breakers

Ignoring spec quality

Treating it like magic

References

AI Dark Factory Part 3: Spec-Driven Development

AI Dark Factory Part 5: Security & Governance

Keep Reading

AI Dark Factory Part 2: Agent Setup & AGENTS.md

AI Dark Factory Part 5: Security & Governance

AI Dark Factory Part 3: Spec-Driven Development

AI Dark Factory: Autonomous Coding Explained

Frequently Asked Questions

How many parallel agents should I start with?

What does a dark factory cost per month?

Can Agent Teams replace manual worktree management?

How do I prevent merge conflicts between parallel agents?

What metrics should I track first?

When should I stop auto-merging and intervene?

Stay up to date

Related Guides

AI Dark Factory Playbook: From Autocomplete to Autonomous

Local LLMs for Coding in 2026: Models, Hardware, Runtimes

How to Build an MCP Server: TypeScript & Python (2026)

Related Cheatsheets

Kubernetes (kubectl) Cheat Sheet