Evals in Action: From Frontier Research to Production Applications

TL;DR

OpenAI treats evaluation as a steering signal for expensive training runs, using it to intervene and adjust progress toward real outcomes.

Briefing Cornell Notes

Briefing

OpenAI’s evaluation push centers on a simple problem: classic AI benchmarks can show progress on test-like tasks while failing to predict whether models can actually perform valuable work. To close that gap, OpenAI built GDP eval (GDP val), a benchmark designed around real, economically meaningful jobs—then tracks model capability using blinded pairwise comparisons against industry experts. The goal is to steer expensive training runs with clearer signals and to anticipate how AI will affect labor, rather than waiting years for usage data to show up in GDP.

GDP val replaces “straight-A student” style testing with “work-ready” evaluation. OpenAI describes classic academic benchmarks like SAT/LSAT-style reasoning and high-school math competitions as useful but limited—especially once models approach near-perfect scores while still struggling with real-world tasks. The new benchmark is structured from the U.S. economy: it starts with the top nine sectors contributing to U.S. GDP (each at over 5%), then selects the top five knowledge-work occupations per sector by wage contribution (from the Bureau of Labor Statistics), and finally includes the majority of tracked job tasks for each occupation. The result is over a thousand tasks spanning areas such as real estate, CAD design, retail, and more.

Tasks on GDP val are intentionally long-horizon and multimodal, often requiring days or weeks to complete and involving tools, images, and other real-world artifacts. Examples include a real estate agent creating a property brochure from photos and research, a manufacturing engineer producing a 3D CAD model of a cable reel stand, and a film/video editor assembling a high-energy intro reel with audio and video.

OpenAI grades models via pairwise expert grading: model outputs are compared directly to human expert deliverables, with graders blinded to which is which. The metric is a win rate—how often experts prefer the model output. On this scale, OpenAI reports that GPT-4o (referred to as “GBT40” in the transcript) scored under 20% win rate on GDP val in spring 2024, meaning experts would usually choose to do the work themselves. Over roughly 18 months, GPT-5 (“GBT5”) is reported to be nearing 40% win rate, implying that in about half the cases experts would find the model output comparable or preferable. If the trend holds, OpenAI suggests models could reach parity with industry professionals soon, with some external models (e.g., Claude) already close.

The GDP val team also stresses what the benchmark is—and isn’t—meant to measure. It targets tasks with clear inputs and outputs, not the full messy workflow of real jobs (prioritization, iteration, manager feedback, and deciding what to work on). Future versions are intended to incorporate more of that workflow complexity.

For builders using OpenAI’s API, OpenAI’s product team adds a second layer: practical eval tooling for applications and agents. The eval product aims to make evaluation less manual by supporting visual dataset building, trace-based grading for agent runs, automated prompt optimization from failures, third-party model support via OpenRouter and bring-your-own-key, and enterprise controls like zero data retention and enterprise key management. A demo illustrates how a multi-agent investment analysis system can be evaluated node-by-node and end-to-end using graders, annotations, and trace grading—then iteratively improved by automatically rewriting prompts to meet rubric criteria. The overarching message: start evals early, use real user data, capture expert preferences, and automate the repetitive parts so teams can iterate faster without shipping on vibes.

Cornell Notes

OpenAI argues that AI evaluation must measure real work, not just test performance. To track frontier progress and guide training, it built GDP val, a benchmark of over 1,000 long-horizon, multimodal tasks derived from U.S. GDP sectors and knowledge-work occupations. Models are scored with blinded pairwise expert grading, producing a win rate that reflects whether industry professionals would prefer the model’s output over doing the task themselves. OpenAI reports that GPT-4o scored under 20% win rate on GDP val in spring 2024, while GPT-5 is nearing 40% over the following 18 months, suggesting movement toward parity. For application builders, OpenAI also offers an eval product that supports dataset creation, trace grading for agents, automated prompt optimization, and cross-model evaluation via OpenRouter.

Why do classic benchmarks stop being a reliable progress signal once models score near-perfect?

Classic benchmarks can saturate: once models hit ~100% on test-like tasks, they may still fail at real-world work where constraints are different. OpenAI uses the “straight A student” analogy—passing exams doesn’t guarantee job performance. The transcript gives an example from early GPT-4o training: despite being close to perfect on academic benchmarks, the model still wasn’t ready for real work, so a new evaluation signal was needed.

How is GDP val constructed, and what makes its tasks different from typical benchmark problems?

GDP val is built from the U.S. economy: it starts with the top nine sectors contributing to U.S. GDP (each >5%), then selects the top five knowledge-work occupations per sector by wage contribution (Bureau of Labor Statistics). For each occupation, it includes the majority of tracked job tasks. The benchmark emphasizes long-horizon, multimodal tasks that often take days or weeks and require tools and artifacts like images—e.g., making a real estate brochure from photos and research, generating a 3D CAD model, or producing a video intro with audio.

What scoring method does OpenAI use for GDP val, and what does “win rate” mean?

OpenAI uses pairwise expert grading: a model output is compared against a human expert deliverable, and graders are blinded to which is which. The grader chooses the preferred output, and results are aggregated into win rate—the percentage of comparisons where the model output is preferred. This is intended to be an unbiased measure of capability on real-world work that can be tracked over time.

What do the reported GDP val win rates imply about GPT-4o vs GPT-5?

OpenAI reports that GPT-4o scored less than 20% win rate on GDP val in spring 2024, meaning experts would usually not choose the model output over doing the work themselves. Over about 18 months, GPT-5 is reported to be close to 40% win rate, implying that in roughly half the cases experts would prefer the model output or consider it comparable to human work. The transcript frames this as evidence of a trajectory toward parity.

How does OpenAI’s eval product help teams evaluate agentic systems beyond single prompts?

For agent and multi-agent systems, the product uses traces—logs of agent runs that are hard to interpret at scale. Trace grading lets teams run graders over completed traces, pinpoint failing traces and problematic spans, and focus debugging. The workflow also includes dataset building (including human annotations), LLM judge graders, and automated prompt optimization that rewrites prompts using annotations and failures to speed iteration.

What is the end-to-end evaluation workflow shown in the investment-fund demo?

The demo breaks a multi-agent system into components (nodes) and evaluates a single node first. It then builds a dataset with sample inputs and ground-truth columns, runs generations, adds expert annotations (ratings and free-text feedback), and creates an LLM judge grader with a rubric (e.g., include upside/downside, peer comparison, and buy/sell/hold; verify financial figures match ground truth). After grading reveals failures (like missing peer comparisons), an optimize step automatically rewrites the prompt using the collected signals. Finally, traces are graded end-to-end to catch issues like whether sources are authoritative and whether the final report includes a clear rating.

Review Questions

What economic and labor-market data sources does GDP val use to select tasks, and how does that selection process shape what the benchmark measures?
Why does trace grading matter more for agents than evaluating only final outputs, and what kinds of failures can it help locate?
How does automated prompt optimization use annotations, outputs, and prompts to reduce iteration time compared with manual prompt editing?

Key Points

1
OpenAI treats evaluation as a steering signal for expensive training runs, using it to intervene and adjust progress toward real outcomes.
2
GDP val is built from U.S. GDP sectors and knowledge-work occupations, producing a large set of long-horizon, multimodal tasks tied to economically valuable work.
3
Blinded pairwise expert grading yields a win rate that reflects whether industry professionals would prefer model outputs over human deliverables.
4
Reported results show GPT-4o under 20% win rate on GDP val in spring 2024, while GPT-5 is nearing 40%, suggesting movement toward expert parity.
5
OpenAI’s eval product for builders supports dataset creation, trace-based grading for agents, and automated prompt optimization to accelerate iteration.
6
The eval tooling supports cross-model evaluation via OpenRouter and bring-your-own-key, and includes enterprise controls like zero data retention and enterprise key management.
7
OpenAI cautions that GDP val measures tasks with clear inputs/outputs, not the full end-to-end messiness of real job workflows (prioritization, iteration, and decision-making).

Highlights

GDP val scores models with blinded pairwise comparisons against industry experts, turning “real work” into a trackable win-rate metric.

OpenAI reports a sharp gap between GPT-4o (<20% win rate) and GPT-5 (~40% win rate) on GDP val, implying that test-like competence doesn’t automatically translate to job performance.

The eval product’s trace grading targets agent failures at the span level, so debugging can focus on the specific parts of a run that break the rubric.

Automated prompt optimization rewrites prompts using datasets, annotations, and grading outcomes—aiming to replace slow manual iteration with faster loops.

Topics

Frontier Evals
GDP Val
Expert Grading
Agent Trace Grading
Automated Prompt Optimization

Mentioned

GDP
GPT
LLM
CAD
GDP val
Bureau of Labor Statistics
BLS