Dan Shapiro, CEO of Glowforge and a Wharton research fellow, published The Five Levels: from Spicy Autocomplete to the Dark Factory on January 23, 2026. The framing borrows directly from NHTSA's 2013 driving-automation taxonomy and asks the same question for code: how much of the work is the human doing, and how much is the model? Shapiro's answer is a six-rung ladder (zero-indexed, like the driving levels) from copy-pasted ChatGPT snippets at the bottom to a fully autonomous software factory at the top.
The framework caught on fast. Simon Willison amplified it five days later and started calling Level 5 the "Dark Factory" — the lights are off because robots do not need to see. The reason it landed is that it gives teams a shared vocabulary for an argument they were already having: are we actually using AI, or are we just typing faster? Shapiro's blunt claim is that roughly 90% of self-described AI-native developers are stuck at Level 2 and do not realise it.
Why a level system at all?
NHTSA's driving levels worked because they replaced marketing words ("self-driving", "autopilot") with a numbered ladder that forced honesty about who is responsible when something goes wrong. Shapiro's point is that AI coding has the same vocabulary problem in 2026 — "agentic", "autonomous", "AI-first" mean nothing without specifying who reads the diff. Levels make the question precise: at Level 2, the human reads every line; at Level 4, the human reads specs and test results, not code; at Level 5, nobody reads the code at all.
Level 0 — Spicy Autocomplete
Definition: The model suggests a few characters or a snippet, and a human types the rest. Not a character hits disk without explicit human approval.
What changes: Almost nothing. The IDE is faster. The code is unmistakably yours.
Who is here today: Anyone using the original GitHub Copilot (launched June 2021) for tab-completion only, or copy-pasting from ChatGPT. Plenty of senior engineers in regulated industries deliberately stop here.
Stuck point: Mistaking autocomplete speedups for an AI coding practice. The model never reads the codebase, never runs the tests, and never gets a chance to do real work.
Level 1 — The Coding Intern
Definition: You hand the model discrete, low-stakes tasks — a unit test, a docstring, a regex, a small refactor — and review the output line by line.
What changes: You start trusting the model with self-contained chunks. You still type all the architecture.
Who is here today: Most teams that adopted Copilot Chat or Cursor's inline edit (Cmd-K) but turned off agent mode. Shapiro's framing: you offload the boilerplate and feel productive, but you are still moving at the rate you can type and review.
Stuck point: The work the model does is exactly the work you already knew how to do — so the speedup is real but capped. Nothing about your workflow has structurally changed.
Level 2 — The Junior Developer
Definition: Pair programming with the model. The agent edits multiple files, runs commands, proposes diffs. You read every line before accepting.
What changes: The model now writes the first draft of most code you ship. Your job inside the editor flips from author to reviewer-with-veto.
Who is here today: Per Shapiro, roughly 90% of developers who describe themselves as "AI-native". Cursor agent mode, Windsurf Cascade, and Claude Code in single-task mode all sit here when the human is reviewing every diff before accept. This is also where most team-wide rollouts of GitHub Copilot agent mode (GA April 2025) end up.
Stuck point: Shapiro is direct — "level 2, and every level after it, feels like you are done. But you are not done." Climbing past 2 means giving up the comfort of reading every line, which most engineers refuse to do because Level 2 already feels like science fiction compared to 2022.
Level 3 — The Developer
Definition: Most code is generated by the model. You become a full-time code reviewer with multiple agents running in parallel tabs. You spend the day reading PRs the model opened against itself.
What changes: Concurrency. You stop thinking in terms of "the task I'm working on" and start thinking in terms of "the four agents I have in flight and which one needs me next".
Who is here today: Heavy Claude Code users running multiple worktrees, Devin operators with several sessions open, and most of the experienced practitioners writing public case studies in early 2026. Shapiro notes most people who reach Level 3 plateau here because reviewing four diffs at once feels like it "got worse, not better".
Stuck point: Reviewer fatigue. Without test-suite discipline and clear specs feeding the agents, throughput at Level 3 is bottlenecked by how many diffs one human can read in a day.
Level 4 — The Engineering Team
Definition: You stop reviewing code and start reviewing specs, plans, and test results. Your day looks like an engineering manager's — write a spec, argue with the agent about it, schedule the work, walk away for twelve hours, come back, check whether the tests pass.
What changes: The unit of work is the spec, not the diff. Skills, plans, test suites, and harness rules become the artefacts you actually craft. The code itself is a build artefact.
Who is here today: Shapiro publicly places himself at Level 4 running Glowforge engineering. The bridge from 3 to 4 is heavy investment in harness engineering — hooks, evaluators, sandboxes, and durable memory — because nobody is reading every line any more.
Stuck point: Spec drift. If the spec is sloppy, the agents ship sloppy code that passes tests because the tests were sloppy too. Level 4 is where bad specs become production incidents.
Level 5 — The Dark (Software) Factory
Definition: Nobody reviews the code. The system takes a goal in plain English, decomposes it, writes the implementation, generates and runs the tests, fixes its own bugs, and ships. Humans design the factory; the factory builds the software.
What changes: The job is no longer software engineering in the traditional sense. The work is designing the agents, the harness, the evaluation suites, and the rollback machinery — and proving the resulting software is correct without having read it.
Who is here today: Vanishingly few teams. The most-cited public example is StrongDM's engineering organisation, profiled in February 2026: per their own description, AI-produced code must not be reviewed by humans, and the engineering team's job is curating tests, harness rules, and architectural patterns. Most other claimed Level 5 stories collapse on inspection into Level 3 with extra automation.
Stuck point: The blast radius problem. Without exhaustive evaluators and sandboxing, "ship without review" is "page everyone at 3am". Level 5 is a statement about your test and rollback infrastructure, not your model.
The five levels at a glance
| Level | Human role | Unit of work | Representative tool |
|---|---|---|---|
| 0 — Spicy Autocomplete | Author, every keystroke | Lines of code | Original Copilot, ChatGPT copy-paste |
| 1 — Coding Intern | Author + delegate boilerplate | Functions, snippets | Cursor inline edit, Copilot Chat |
| 2 — Junior Developer | Reviewer of every diff | Multi-file change | Cursor agent, Windsurf Cascade |
| 3 — Developer | Concurrent reviewer, multi-agent | Whole feature | Claude Code worktrees, Devin |
| 4 — Engineering Team | Spec author + harness designer | Spec + test suite | Long-horizon agents on a hardened harness |
| 5 — Dark Factory | Factory designer, never reads code | Goal in English | StrongDM-style closed-loop pipeline |
Why do most teams plateau at Level 2?
Two reasons, and they reinforce each other. First, Level 2 already feels like a giant leap from 2022 — the agent edits real files across a real repo and the diff is mostly correct. The instinct is to declare victory. Second, the move to Level 3 requires giving up the line-by-line review habit, which feels like negligence to anyone trained before 2024. The result is teams who buy Cursor or Claude Code, settle into pair-programming mode, and stop. Shapiro's claim that 90% of AI-native developers are at Level 2 is a directional read on a Glowforge survey rather than a public benchmark, but the pattern matches what you see in conference talks through 2026.
Which level am I at?
- •If you read every character before it lands in a file: Level 0 or 1.
- •If the model writes most of the code but you read every diff: Level 2.
- •If you have multiple agents running in parallel and your day is reviewing their PRs: Level 3.
- •If you write specs, walk away for hours, and only look at test results and merge buttons: Level 4.
- •If nobody on the team reads any AI-produced code and the tests are the contract: Level 5.
How to climb a level
- 1.0 → 1: Pick three recurring chores (tests, docstrings, regex) and stop typing them. Always delegate.
- 2.1 → 2: Turn agent mode on for a real multi-file task. Accept that the diff includes files you would not have edited yourself.
- 3.2 → 3: Run a second worktree or a second agent tab. Force yourself into reviewer mode by removing the option to write the code yourself in parallel.
- 4.3 → 4: Invest in the harness. Hooks, evaluators, sandboxes, durable memory. Stop reading every diff; start trusting the test suite.
- 5.4 → 5: The hard one. Build evaluators good enough that nobody needs to read the code, plus rollback machinery good enough that a bad ship is recoverable in minutes.
Where the levels connect to the rest
The jump from Level 3 to Level 4 is the same jump described by harness engineering — it is not a model upgrade, it is engineering work around the model. The day-to-day shape of Level 3 work is documented in agentic coding workflows (the plan-execute-verify loop). Climbing from 2 to 4 without trustworthy tests is a fast route to a Level 5 outage, which is why test-driven development with AI becomes load-bearing past Level 2. The prompt-side counterpart, used heavily by Level 4 operators, is system prompts for agentic coding.