Tobi Lütke Made a 20-Year-Old Codebase 53% Faster Overnight. Here's How.

TL;DR

“Agents” in production typically split into four species: coding harnesses, dark factories, auto research, and orchestration frameworks.

Briefing Cornell Notes

Briefing

“Agents” aren’t one thing. In production, LLM systems that use tools and feedback loops tend to fall into four distinct “species,” and mixing them up leads to wasted effort, brittle systems, and quality failures. The practical takeaway: choose the agent type based on the goal—coding a task, coordinating a project, optimizing a metric, or routing workflow steps—rather than based on the model name powering the system.

The first species is the coding harness: an agent that effectively stands in for a developer to write and modify code inside a controlled environment. It can read and write files, run searches, and use the tools provided in its context. Variants differ in how safely they operate (for example, one approach prefers a virtual machine for isolation, while another tends to work directly on a local laptop), but the underlying pattern is the same: a human remains the quality gate while the agent executes coding work. Scale comes from decomposition—breaking a large project into well-defined chunks that can be handled by multiple single-threaded agents in parallel. The human’s job shifts toward planning and approving work, not doing the coding itself.

When projects get larger, coding harnesses evolve into project-level multi-agent systems like Cursor’s approach. Instead of one long-running agent holding the whole mental model, a planner agent manages a queue of tasks and spawns short-lived “execution” agents to solve specific subproblems. Success depends on the planner’s ability to track tasks, maintain context, and evaluate whether each execution agent delivered correct results. Cursor’s experience also highlights a design principle: keep the harness conceptually simple so it can scale—adding extra management layers can backfire.

The second species is the dark factory: an architecture that minimizes human involvement in the middle of the software pipeline. The system takes a specification, iterates automatically, and only proceeds when software passes evaluations. Humans typically focus on getting intent and requirements right at the start, and on accountability at the end (often reviewing code or monitoring production). The core idea is to remove bottlenecks: agents can push work forward quickly, and humans can’t always keep up during rapid iteration. Risk management matters—enterprise teams often require human review even when the middle is automated.

The third species is auto research, which is not about producing working software directly. It’s about optimizing a measurable metric via repeated experiments—essentially a hill-climbing loop. A metric is mandatory; without one, there’s no “research” to optimize. Examples include improving runtime performance in a codebase (as with Toby Lütke’s Liquid presentation framework) or tuning model settings, and the same approach can apply to business metrics like conversion rates if sufficient data exists.

The fourth species is orchestration frameworks, which coordinate handoffs between specialized agents—writer to editor, drafter to researcher, or ticket routing in customer success. Orchestration is essentially workflow routing (A to B), but it can feel heavy because it requires careful context and prompt management at every joint. It becomes worth the complexity when the scale is large enough—thousands or millions of routed items—so the coordination overhead pays off.

A final cheat sheet ties it together: use coding harnesses when judgment is the gate; use project-level multi-agent harnesses when humans still judge but work must be parallelized; use dark factories when evaluations and specifications are strong enough to automate the middle; use auto research when the target is a metric; and use orchestration when the problem is routing multi-step work. The warning is direct: don’t try to use auto research to build software, or force long-running coding harnesses into tasks that really require orchestration or human-driven creativity.

Cornell Notes

LLM “agents” in production usually fall into four different species, each optimized for a different kind of job. Coding harnesses replace a developer for coding tasks, often scaling through decomposition and parallel single-threaded agents; project-level harnesses add a planner that spawns short-lived execution agents. Dark factories automate the middle of software generation until evaluations pass, reducing human bottlenecks while keeping accountability at the start and end. Auto research optimizes a measurable metric through repeated experiments (hill-climbing), not by directly building software. Orchestration frameworks route work between specialized agents, but their complexity only pays off at large scale.

What makes a coding harness a “coding harness,” and how does it scale beyond one agent?

A coding harness is built around executing coding work: the agent reads and writes files, uses search, and runs with the tools placed into its context so it can modify a codebase toward a task. Scaling usually comes from decomposition—splitting a large project into chunks that are well-defined enough to assign to multiple single-threaded agents. The human’s role becomes planning, chunking, and judging outputs, rather than writing every line of code.

How do project-level multi-agent coding systems differ from single-agent harnesses?

Project-level systems shift management from the human to an agentic planner. A planner agent tracks tasks and context, then spawns short-running execution agents that each tackle one subproblem. The planner must also assess whether each execution result was done well, using evaluations or checks. The key design lesson mentioned is to keep the harness simple so it can scale; adding extra management layers can reduce effectiveness.

What is the defining feature of a dark factory, and where does human involvement fit?

A dark factory is designed to minimize human involvement in the middle of the process. After a specification is provided, the system iterates automatically until software passes evaluations, then proceeds. Humans typically focus on requirements and intent at the start and on accountability at the end—often reviewing code or monitoring production. The goal is to avoid human bottlenecks when agent iteration is fast.

Why is auto research fundamentally different from coding, and what requirement must be met?

Auto research optimizes a metric via repeated experiments, rather than producing working software as the direct objective. A metric is non-negotiable: without a measurable target, there’s no hill-climbing loop to run. Examples include optimizing runtime performance for a codebase (like Liquid) or tuning model settings; business metrics like conversion rate can also work if enough data points exist.

When does orchestration become worth the complexity?

Orchestration coordinates handoffs between specialized agents (A to B), such as routing a ticket through research, drafting, and closure steps. It requires significant prompt/context/procedure management at each joint, so it can feel heavy. The text frames the decision as a scale question: orchestration is worth it when volume is high enough—thousands, millions, or tens of millions of routed items—so coordination overhead is justified by the throughput gains.

How should teams choose among the four species using the “cheat sheet” logic?

If the human’s judgment is the quality gate, start with coding harnesses. If the work is large and needs parallelization, use project-level multi-agent harnesses where a planner and executor agents work against evals but humans still judge. If specifications and evals are strong enough to automate the middle, move toward dark factories while keeping human accountability at key points. If the goal is improving a measurable metric through experiments, use auto research. If the goal is routing multi-step workflow tasks, use orchestration frameworks.

Review Questions

You’re building a system that iterates until software passes evals with minimal human checks midstream. Which agent species fits best, and what human roles remain?
A team wants to improve conversion rate using LLM-driven experiments. Why does this align with auto research rather than a coding harness?
What design change enables project-level multi-agent coding systems to scale: decomposition, planning/execution roles, or workflow routing? Explain briefly.

Key Points

1
“Agents” in production typically split into four species: coding harnesses, dark factories, auto research, and orchestration frameworks.
2
Coding harnesses replace developer work for coding tasks using tools like file access and search, with human judgment acting as the quality gate.
3
Project-level multi-agent coding systems use a planner agent to spawn short-lived execution agents and track/evaluate results, aiming for scalable simplicity.
4
Dark factories automate the middle of software generation until evaluations pass, reducing human bottlenecks while keeping accountability at the start and end.
5
Auto research optimizes a measurable metric through repeated experiments (hill-climbing); it requires a clear optimization target.
6
Orchestration frameworks route work between specialized agents, but their overhead only pays off at large workflow scale.
7
Choosing the wrong species for the goal—such as using auto research to build software—creates failure modes and wasted engineering effort.

Highlights

The “species” framework reframes agent building as goal-driven engineering: task execution, project coordination, metric optimization, or workflow routing.

Project-scale coding harnesses work by planner-managed short-lived executor agents, not by one agent holding the entire project in mind.

Dark factories are defined by evaluation-gated automation with humans mostly at the beginning and end, not in the middle.

Auto research is metric-shaped optimization via experiments; without a metric, it isn’t auto research.

Orchestration is powerful but heavy—its value depends on routing scale being large enough to justify context and handoff complexity.

Topics

Agent Species
Coding Harnesses
Dark Factories
Auto Research
Orchestration Frameworks