Can You Trust OpenAI Press Releases?

TL;DR

Treat “near-human” benchmark language as marketing unless the evaluation method, comparator versions, and prompting setup are clearly consistent and reproducible.

Briefing Cornell Notes

Briefing

AI labs’ press releases routinely present benchmark numbers as proof of “near-human” capability, but those figures often hinge on selective reporting, mismatched evaluation methods, and prompt-dependent testing—making the headline comparisons unreliable for deciding which model is actually better for real work. The most consequential takeaway is practical: small benchmark gaps can be meaningless, while the biggest “wins” in press materials may come from how results are measured and framed rather than from a model’s general competence.

A central example is Anthropic’s Claude 3 Opus announcement, which claimed “near human” comprehension and reported scores slightly above OpenAI’s GPT-4 on major benchmarks such as the Massive Multitask Language Understanding (MMLU) set. The numbers looked decisive at first glance—Opus at 86.8% versus GPT-4 at 86.4%—but the comparison was complicated by provenance: Anthropic’s GPT-4 reference performance came from an earlier OpenAI release blog (March 2023) for the original public GPT-4 model, not the later GPT-4 Turbo variants that were closer to the time of Claude 3’s release. Even when the gap is small, the framing can still steer readers toward the wrong conclusion about which model family is truly ahead.

The transcript also highlights how “near-human” language can be almost unfalsifiable. Without a clear definition of what “near human” means, percentages and categories become marketing tools—one person’s “near human” can be another’s “not meaningfully better.” The discussion argues that consumers and journalists should treat these claims as starting points, not endpoints.

Other press releases are described as more misleading through methodological asymmetries. One recurring issue is Chain-of-Thought prompting: some models are evaluated with step-by-step reasoning prompts that can boost performance, while others are tested without them. OpenAI’s GPT-4o (discussed as “jypy 40” in the transcript) is used as an example where reported gains on MMLU appear larger when Chain-of-Thought is used, yet shrink when the evaluation is aligned to the methods used for competing models. The same pattern shows up in how many benchmarks are included and which subsets are chosen—OpenAI reportedly used only six out of ten standard benchmarks in one comparison, while Anthropic used a different subset.

A separate case involves OpenAI’s dangerous capability evaluation after GPT-4, where improvements appeared in component sub-tasks (students and experts performing better on parts of a bioweapon design workflow). But the reported headline significance depended on statistical choices: treating multiple subcomponents as independent tests and applying a correction that effectively raised the bar. When the task is assessed holistically with that adjustment, the transcript claims the model’s overall improvement disappears.

The broader message is not “never trust benchmarks,” but “don’t trust headline numbers.” Benchmarks can be useful for rough orientation, especially when the cited methods, comparator models, prompts, and benchmark subsets are consistent and reproducible. When those details are missing—or when comparisons rely on selective reporting, mismatched prompting, or shifting evaluation setups—benchmark deltas can reflect measurement strategy more than real-world value. The practical prescription is to verify claims against the specific tasks that matter, ideally by running independent tests on representative problems, because model behavior changes over time and press releases rarely capture the full context needed to judge usefulness.

Cornell Notes

AI press releases often turn benchmark results into confident claims of “near-human” performance, but the underlying comparisons can be distorted by selective reporting and evaluation differences. Examples include Claude 3 Opus being compared to an older GPT-4 reference rather than newer GPT-4 Turbo variants, and GPT-4-style results looking better when Chain-of-Thought prompting is used while competitors are tested without it. The transcript also points to statistical framing in a dangerous capability study, where component improvements can vanish under a holistic significance test. The takeaway: benchmarks are reference points, not definitive proof—small percentage gaps may not translate into real-world advantage, so task-specific verification matters.

Why can a benchmark score comparison between two models mislead readers even when the percentages are close?

Because the comparison may not be apples-to-apples. In the Claude 3 Opus example, the reported GPT-4 baseline came from OpenAI’s March 2023 release blog for the original public GPT-4 model, not the later GPT-4 Turbo variants that were current at the time. That means the “slight edge” can reflect which version was used as the comparator rather than a true capability advantage at release time.

How does Chain-of-Thought prompting change what benchmark numbers mean?

Chain-of-Thought can boost performance by encouraging step-by-step reasoning. The transcript argues that some press evaluations use Chain-of-Thought for one model family while evaluating competitors without it, making headline gains look larger than they would under consistent methods. It also notes that even within the same model family, prompting style can affect outcomes—Palm is described as performing worse with Chain-of-Thought than with direct querying.

What does “selective reporting” look like in model press releases?

Companies can choose which benchmarks to emphasize, which subsets to include, and which results to highlight. The transcript cites cases where only a subset of standard benchmarks is reported (e.g., OpenAI using six out of ten in one comparison) and where bespoke in-house benchmarks are emphasized—both of which can make a model appear stronger than it would under a broader, standardized evaluation.

Why can statistical adjustments erase apparent improvements in multi-part evaluations?

When an evaluation splits a task into multiple subcomponents, treating each subcomponent as an independent hypothesis changes the significance threshold. The transcript describes a bioweapon-design evaluation where participants improved on parts of the workflow, but after applying a correction that effectively raises the bar for significance across five subtests, the overall holistic improvement is reported as not statistically significant.

What’s the practical difference between “benchmark capability” and “real-world usefulness”?

Benchmarks often measure narrow, reproducible skills (like multiple-choice accuracy or unit-test pass rates) that may not correlate strongly with the user’s actual economic value. The transcript argues that small benchmark deltas—especially under inconsistent methods—may not reflect meaningful differences in day-to-day tasks such as coding assistance, writing, or workflow acceleration.

If press releases are unreliable, what verification approach does the transcript recommend?

Look for the specific benchmark claims and the exact evaluation setup: which benchmarks, which comparator models, which prompts (including whether Chain-of-Thought is used), and which methods are applied. When feasible, run independent tests on representative problems tailored to the user’s workflow, and re-check periodically because model behavior can change over time through stealth updates or API/UI adjustments.

Review Questions

What specific factors—beyond the headline percentage—can make two benchmark comparisons invalid or misleading?
How would you design a task-specific evaluation to compare two LLMs for coding assistance while controlling for prompting differences?
In a multi-part experiment, how can treating subcomponents as independent hypotheses change whether an overall improvement is considered statistically significant?

Key Points

1
Treat “near-human” benchmark language as marketing unless the evaluation method, comparator versions, and prompting setup are clearly consistent and reproducible.
2
Check whether the baseline model version matches the time and variant being claimed (e.g., original GPT-4 vs GPT-4 Turbo), since version mismatches can flip conclusions.
3
Assume Chain-of-Thought prompting can inflate scores; compare models only when evaluated with the same prompting conditions.
4
Watch for selective benchmark reporting: different benchmark subsets, in-house benchmarks, and omitted results can make a model look better than it is.
5
Statistical framing matters: corrections for multiple hypotheses can turn component-level improvements into non-significant holistic results.
6
Small benchmark gaps often don’t translate into meaningful user advantage, especially when methodological differences dominate.
7
The most reliable approach is task-specific testing on representative problems, ideally with a fixed prompt and evaluation rubric you control.

Highlights

Claude 3 Opus’s “slight edge” over GPT-4 is complicated by the fact that the GPT-4 baseline referenced an older OpenAI release rather than the newer GPT-4 Turbo variants.

Chain-of-Thought can be a hidden variable: headline gains may reflect step-by-step prompting being applied to one model while competitors are tested without it.

In the bioweapons evaluation example, component improvements can disappear once statistical significance is assessed with corrections across multiple subtests and the task is evaluated holistically.

“Near-human” claims are portrayed as effectively unfalsifiable without clear definitions and consistent evaluation methods, making them poor decision tools for consumers and journalists.

Topics

AI Press Releases
LLM Benchmarks
Chain-of-Thought
Evaluation Methodology
Statistical Significance

Mentioned

MMLU
GPT-4
GPT-4 Turbo
MMLU
GPQA