The only AutoResearch tutorial you’ll ever need
Based on David Ondrej's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
AutoResearch runs an autonomous experiment loop that keeps only changes that improve a single locked metric and discards the rest.
Briefing
AutoResearch is an open-source loop that lets an AI agent run experiments autonomously, keep changes that improve a single measurable metric, and discard the rest—turning “model training” into a general-purpose method for self-improvement. The practical payoff is speed and scale: with a fixed time budget, the system can iterate through hundreds of experiments overnight, making each attempt directly comparable and reducing the temptation to “cheat” by training longer or optimizing the wrong objective.
At the center of the method is a strict file-and-metric structure. A human-defined file (program.md) sets the goal, constraints, and rules. A second file (train.py) is the only one the agent is allowed to modify; it can change code, configuration, prompts, or even math—anything intended to improve performance. A third file (prepare.py) defines the evaluation metric and scoring logic, and the agent cannot touch it. That separation matters because it prevents the agent from rewriting the yardstick to manufacture better results. The loop itself is simple: the agent proposes a hypothesis, edits train.py, runs a short training/evaluation cycle (the transcript cites about five minutes as a typical training window), then checks the score. If the metric improves, the change is committed to git history; if not, the system resets and repeats.
The broader claim is that AutoResearch isn’t limited to optimizing machine learning models. Any domain with (1) a clear scalar metric, (2) automated evaluation without human judgment, and (3) a single modifiable file can be turned into an autonomous experiment loop. That includes marketing A/B tests (email copy, ad creatives, landing pages, headlines, thumbnails, YouTube titles), software performance tuning (speeding up codebases or making open-source models run faster locally), prompt engineering (searching for better system instructions and even trying different languages or difficulty levels), and trading strategies. In trading, the transcript gives a concrete example: scoring experiments by the Sharpe ratio to balance returns against risk.
Where the approach breaks down is also explicit. If “better” is subjective—brand design, UX, or pricing decisions without an objective proxy—AutoResearch can drift toward random or misleading optimizations. Even with automation, a bad metric can produce confident improvement in the wrong direction.
The tutorial then demonstrates a first loop built from Karpathy’s repository. Using an IDE (VS Code or cursor) and a coding agent (the transcript mentions Claude Code and Codex CLI), the workflow clones the AutoResearch repo, creates a small website project, and benchmarks baseline performance using Puppeteer. Next, it writes a new program.md tailored to website-load-time optimization, then runs the experiment loop. In the example, the median load time drops from roughly 50 ms to the low 30s and then down further (the transcript reports 33 ms, 28 ms, and continues toward 25 ms), illustrating how quickly iterative metric-driven changes can compound when evaluation is automated.
The closing message ties the method to a larger vision: distributed, recursive self-improvement where AI agents run continuous experiments across many machines—an “end goal” for frontier labs—while individuals can start small by building their own metric-driven loop.
Cornell Notes
AutoResearch is a loop for autonomous experimentation: an AI agent proposes changes, runs an automated evaluation, and keeps only the edits that improve a single measurable metric. The system relies on a strict separation of responsibilities: program.md defines the goal and constraints, train.py is the only file the agent can modify, and prepare.py locks the scoring metric so the agent can’t “cheat.” With a fixed time budget, experiments become comparable, enabling hundreds of iterations overnight. The same pattern can optimize far more than ML training—anything with an objective scalar metric and automated evaluation—though it fails when outcomes are subjective or the metric is poorly chosen.
What makes AutoResearch different from ordinary “agentic” coding or hyperparameter search?
Why does the transcript emphasize a fixed time budget for experiments?
What are the three file roles (program.md, train.py, prepare.py) and why does each matter?
In what kinds of tasks does the transcript claim AutoResearch can work well outside ML?
When does AutoResearch fail or become unreliable?
Review Questions
- How does locking prepare.py prevent the agent from producing misleading improvements?
- Why does time-boxing experiments matter for fairness and interpretability of results?
- Give one example of a task you could measure with a single scalar metric and outline what would correspond to program.md, train.py, and prepare.py.
Key Points
- 1
AutoResearch runs an autonomous experiment loop that keeps only changes that improve a single locked metric and discards the rest.
- 2
A strict file separation—program.md (rules), train.py (only editable), and prepare.py (locked scoring)—prevents metric gaming.
- 3
Time-boxing experiments makes results comparable by ensuring each attempt gets the same compute budget.
- 4
The method generalizes beyond ML when outcomes can be reduced to an objective scalar metric with automated evaluation.
- 5
Tasks with subjective success criteria (e.g., UX or brand design) are poor fits because the agent can’t reliably judge “better.”
- 6
A bad metric can cause the system to optimize the wrong objective while still reporting improvement.
- 7
A practical first loop can be built by benchmarking a baseline (e.g., via Puppeteer) and then running a metric-driven program.md experiment loop to iteratively reduce load time.