The only AutoResearch tutorial you’ll ever need

TL;DR

AutoResearch runs an autonomous experiment loop that keeps only changes that improve a single locked metric and discards the rest.

Briefing Cornell Notes

Briefing

AutoResearch is an open-source loop that lets an AI agent run experiments autonomously, keep changes that improve a single measurable metric, and discard the rest—turning “model training” into a general-purpose method for self-improvement. The practical payoff is speed and scale: with a fixed time budget, the system can iterate through hundreds of experiments overnight, making each attempt directly comparable and reducing the temptation to “cheat” by training longer or optimizing the wrong objective.

At the center of the method is a strict file-and-metric structure. A human-defined file (program.md) sets the goal, constraints, and rules. A second file (train.py) is the only one the agent is allowed to modify; it can change code, configuration, prompts, or even math—anything intended to improve performance. A third file (prepare.py) defines the evaluation metric and scoring logic, and the agent cannot touch it. That separation matters because it prevents the agent from rewriting the yardstick to manufacture better results. The loop itself is simple: the agent proposes a hypothesis, edits train.py, runs a short training/evaluation cycle (the transcript cites about five minutes as a typical training window), then checks the score. If the metric improves, the change is committed to git history; if not, the system resets and repeats.

The broader claim is that AutoResearch isn’t limited to optimizing machine learning models. Any domain with (1) a clear scalar metric, (2) automated evaluation without human judgment, and (3) a single modifiable file can be turned into an autonomous experiment loop. That includes marketing A/B tests (email copy, ad creatives, landing pages, headlines, thumbnails, YouTube titles), software performance tuning (speeding up codebases or making open-source models run faster locally), prompt engineering (searching for better system instructions and even trying different languages or difficulty levels), and trading strategies. In trading, the transcript gives a concrete example: scoring experiments by the Sharpe ratio to balance returns against risk.

Where the approach breaks down is also explicit. If “better” is subjective—brand design, UX, or pricing decisions without an objective proxy—AutoResearch can drift toward random or misleading optimizations. Even with automation, a bad metric can produce confident improvement in the wrong direction.

The tutorial then demonstrates a first loop built from Karpathy’s repository. Using an IDE (VS Code or cursor) and a coding agent (the transcript mentions Claude Code and Codex CLI), the workflow clones the AutoResearch repo, creates a small website project, and benchmarks baseline performance using Puppeteer. Next, it writes a new program.md tailored to website-load-time optimization, then runs the experiment loop. In the example, the median load time drops from roughly 50 ms to the low 30s and then down further (the transcript reports 33 ms, 28 ms, and continues toward 25 ms), illustrating how quickly iterative metric-driven changes can compound when evaluation is automated.

The closing message ties the method to a larger vision: distributed, recursive self-improvement where AI agents run continuous experiments across many machines—an “end goal” for frontier labs—while individuals can start small by building their own metric-driven loop.

Cornell Notes

AutoResearch is a loop for autonomous experimentation: an AI agent proposes changes, runs an automated evaluation, and keeps only the edits that improve a single measurable metric. The system relies on a strict separation of responsibilities: program.md defines the goal and constraints, train.py is the only file the agent can modify, and prepare.py locks the scoring metric so the agent can’t “cheat.” With a fixed time budget, experiments become comparable, enabling hundreds of iterations overnight. The same pattern can optimize far more than ML training—anything with an objective scalar metric and automated evaluation—though it fails when outcomes are subjective or the metric is poorly chosen.

What makes AutoResearch different from ordinary “agentic” coding or hyperparameter search?

AutoResearch enforces a closed loop with a locked evaluation. The agent can only edit train.py, while prepare.py defines the metric and scoring logic and cannot be modified. program.md sets the rules and constraints. After each experiment, the system either commits the change to git history (if the metric improves) or resets and tries again (if it worsens). That structure turns experimentation into a reliable, metric-driven optimization process rather than open-ended tinkering.

Why does the transcript emphasize a fixed time budget for experiments?

A time box makes experiments directly comparable. If one candidate gets more time to train or run than another, it can win simply by doing more work. By allocating the same time budget to each attempt, the loop ensures the comparison reflects the quality of the proposed change, not the amount of compute or duration.

What are the three file roles (program.md, train.py, prepare.py) and why does each matter?

program.md is the human-defined control panel: it sets the goal, constraints, and rules for the agent. train.py is the single editable target: the agent changes it to implement hypotheses (code, config, prompts, or other optimizable elements). prepare.py is the locked evaluation: it defines what “better” means via a metric and scoring script. Preventing edits to prepare.py stops the agent from gaming the score by rewriting the yardstick.

In what kinds of tasks does the transcript claim AutoResearch can work well outside ML?

Any measurable workflow with automated scoring can fit the pattern. Examples given include marketing A/B tests (email copy, ad creatives, landing pages, headlines, thumbnails, YouTube titles), software performance tuning (making codebases faster or running models more efficiently on a laptop/phone), prompt engineering (searching for better system instructions and even trying different languages or difficulty levels), and trading strategies (testing buy/sell rules and scoring by Sharpe ratio).

When does AutoResearch fail or become unreliable?

It struggles when “better” is subjective or slow to evaluate. The transcript calls out brand design, UX, and pricing as areas where outcomes often depend on human judgment. It also warns that a bad metric leads to confident optimization of the wrong thing—because the loop will dutifully maximize whatever number it’s given.

Review Questions

How does locking prepare.py prevent the agent from producing misleading improvements?
Why does time-boxing experiments matter for fairness and interpretability of results?
Give one example of a task you could measure with a single scalar metric and outline what would correspond to program.md, train.py, and prepare.py.

Key Points

1
AutoResearch runs an autonomous experiment loop that keeps only changes that improve a single locked metric and discards the rest.
2
A strict file separation—program.md (rules), train.py (only editable), and prepare.py (locked scoring)—prevents metric gaming.
3
Time-boxing experiments makes results comparable by ensuring each attempt gets the same compute budget.
4
The method generalizes beyond ML when outcomes can be reduced to an objective scalar metric with automated evaluation.
5
Tasks with subjective success criteria (e.g., UX or brand design) are poor fits because the agent can’t reliably judge “better.”
6
A bad metric can cause the system to optimize the wrong objective while still reporting improvement.
7
A practical first loop can be built by benchmarking a baseline (e.g., via Puppeteer) and then running a metric-driven program.md experiment loop to iteratively reduce load time.

Highlights

AutoResearch keeps improvements by committing successful edits and uses git reset to roll back failures, turning evaluation into a tight optimization cycle.

The agent can only modify train.py; prepare.py defines the metric and is off-limits, which blocks cheating.

Marketing, prompt engineering, trading, and performance tuning can all use the same loop if “better” is measurable and automatically scored.

The tutorial’s website example shows load time dropping from about 50 ms into the 30 ms range within minutes, then continuing downward as experiments accumulate.

Topics

AutoResearch Loop
Metric-Driven Optimization
Agentic Experimentation
program.md train.py prepare.py
Puppeteer Benchmarking