Tobi Lütke Made a 20-Year-Old Codebase 53% Faster Overnight. Here's How.
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
“Agents” in production typically split into four species: coding harnesses, dark factories, auto research, and orchestration frameworks.
Briefing
“Agents” aren’t one thing. In production, LLM systems that use tools and feedback loops tend to fall into four distinct “species,” and mixing them up leads to wasted effort, brittle systems, and quality failures. The practical takeaway: choose the agent type based on the goal—coding a task, coordinating a project, optimizing a metric, or routing workflow steps—rather than based on the model name powering the system.
The first species is the coding harness: an agent that effectively stands in for a developer to write and modify code inside a controlled environment. It can read and write files, run searches, and use the tools provided in its context. Variants differ in how safely they operate (for example, one approach prefers a virtual machine for isolation, while another tends to work directly on a local laptop), but the underlying pattern is the same: a human remains the quality gate while the agent executes coding work. Scale comes from decomposition—breaking a large project into well-defined chunks that can be handled by multiple single-threaded agents in parallel. The human’s job shifts toward planning and approving work, not doing the coding itself.
When projects get larger, coding harnesses evolve into project-level multi-agent systems like Cursor’s approach. Instead of one long-running agent holding the whole mental model, a planner agent manages a queue of tasks and spawns short-lived “execution” agents to solve specific subproblems. Success depends on the planner’s ability to track tasks, maintain context, and evaluate whether each execution agent delivered correct results. Cursor’s experience also highlights a design principle: keep the harness conceptually simple so it can scale—adding extra management layers can backfire.
The second species is the dark factory: an architecture that minimizes human involvement in the middle of the software pipeline. The system takes a specification, iterates automatically, and only proceeds when software passes evaluations. Humans typically focus on getting intent and requirements right at the start, and on accountability at the end (often reviewing code or monitoring production). The core idea is to remove bottlenecks: agents can push work forward quickly, and humans can’t always keep up during rapid iteration. Risk management matters—enterprise teams often require human review even when the middle is automated.
The third species is auto research, which is not about producing working software directly. It’s about optimizing a measurable metric via repeated experiments—essentially a hill-climbing loop. A metric is mandatory; without one, there’s no “research” to optimize. Examples include improving runtime performance in a codebase (as with Toby Lütke’s Liquid presentation framework) or tuning model settings, and the same approach can apply to business metrics like conversion rates if sufficient data exists.
The fourth species is orchestration frameworks, which coordinate handoffs between specialized agents—writer to editor, drafter to researcher, or ticket routing in customer success. Orchestration is essentially workflow routing (A to B), but it can feel heavy because it requires careful context and prompt management at every joint. It becomes worth the complexity when the scale is large enough—thousands or millions of routed items—so the coordination overhead pays off.
A final cheat sheet ties it together: use coding harnesses when judgment is the gate; use project-level multi-agent harnesses when humans still judge but work must be parallelized; use dark factories when evaluations and specifications are strong enough to automate the middle; use auto research when the target is a metric; and use orchestration when the problem is routing multi-step work. The warning is direct: don’t try to use auto research to build software, or force long-running coding harnesses into tasks that really require orchestration or human-driven creativity.
Cornell Notes
LLM “agents” in production usually fall into four different species, each optimized for a different kind of job. Coding harnesses replace a developer for coding tasks, often scaling through decomposition and parallel single-threaded agents; project-level harnesses add a planner that spawns short-lived execution agents. Dark factories automate the middle of software generation until evaluations pass, reducing human bottlenecks while keeping accountability at the start and end. Auto research optimizes a measurable metric through repeated experiments (hill-climbing), not by directly building software. Orchestration frameworks route work between specialized agents, but their complexity only pays off at large scale.
What makes a coding harness a “coding harness,” and how does it scale beyond one agent?
How do project-level multi-agent coding systems differ from single-agent harnesses?
What is the defining feature of a dark factory, and where does human involvement fit?
Why is auto research fundamentally different from coding, and what requirement must be met?
When does orchestration become worth the complexity?
How should teams choose among the four species using the “cheat sheet” logic?
Review Questions
- You’re building a system that iterates until software passes evals with minimal human checks midstream. Which agent species fits best, and what human roles remain?
- A team wants to improve conversion rate using LLM-driven experiments. Why does this align with auto research rather than a coding harness?
- What design change enables project-level multi-agent coding systems to scale: decomposition, planning/execution roles, or workflow routing? Explain briefly.
Key Points
- 1
“Agents” in production typically split into four species: coding harnesses, dark factories, auto research, and orchestration frameworks.
- 2
Coding harnesses replace developer work for coding tasks using tools like file access and search, with human judgment acting as the quality gate.
- 3
Project-level multi-agent coding systems use a planner agent to spawn short-lived execution agents and track/evaluate results, aiming for scalable simplicity.
- 4
Dark factories automate the middle of software generation until evaluations pass, reducing human bottlenecks while keeping accountability at the start and end.
- 5
Auto research optimizes a measurable metric through repeated experiments (hill-climbing); it requires a clear optimization target.
- 6
Orchestration frameworks route work between specialized agents, but their overhead only pays off at large workflow scale.
- 7
Choosing the wrong species for the goal—such as using auto research to build software—creates failure modes and wasted engineering effort.