SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors

TL;DR

MMLU’s single-character, immediate-answer format can push models to commit before deeper reasoning, increasing error risk.

Briefing Cornell Notes

Briefing

A widely used language-model benchmark—MMLU—has been found to contain enough flawed, ambiguous, or misformatted questions that reported “near-human” scores can be meaningfully overstated. After running large-scale evaluations of GPT models with a prompt-engineering framework (“SmartGPT”), the researchers report an adjusted MMLU score of 89.0, and argue that removing clear test errors and dataset issues would likely push results higher. The stakes are practical: when benchmark differences shrink to tenths of a percent, even a small error rate in the benchmark itself can distort claims about model capability and mislead downstream comparisons.

The work starts with a technical constraint: MMLU forces answers into a single immediate character choice (A/B/C/D). That format encourages early commitment—often before deeper reasoning—and can amplify hallucinations or mistakes. SmartGPT was designed to counter this by (1) using carefully crafted exemplars so the model learns the expected output format, (2) adding self-consistency via sampling multiple candidate answers and taking the majority, and (3) using self-reflection to choose among competing explanations when time and compute allow. In earlier experiments, the team had already shown that these techniques can raise performance on reasoning-heavy tasks; here, they apply the approach at scale.

The headline result is a new unofficial record: 88.4 on MMLU using GPT-4 with reduced “power” relative to the full SmartGPT method, largely because fully automated grading and large-scale reflection were too expensive and too labor-intensive to run at maximal strength. The team also reports that their GPT-3.5 run improved by 3.7 points (from 70 to 73.7) when evaluated across the full MMLU question set. Crucially, the researchers say they did not rely on GPT models to grade their own outputs, citing known failures of self-grading.

But the benchmark-breaking moment comes from human verification. By tracing “wrong” answers back to their original sources, the team claims to have uncovered widespread issues: missing context in question statements, incorrect answer keys, misordered multiple-choice options, and even cases where the “correct” answer is not among the listed choices. They highlight recurring problems in domains like virology, college chemistry, and other subject areas, plus formatting and ambiguity problems—such as grammatical issues, unclear dependencies between multi-part arguments, and questions with no single defensible answer. They estimate that after accounting for clear errors and ambiguities, the effective score becomes 89.0 rather than 88.4.

The researchers argue that this matters because modern model comparisons increasingly hinge on tiny deltas. When leaderboard gaps are around 0.1%, a benchmark error rate of 1–3% can swamp the signal—especially as models approach human-expert performance. The conclusion is not just “MMLU is imperfect,” but a call for more rigorous, professionally vetted benchmarking: unambiguous questions, consistent formatting, blind human grading, and broader difficulty ranges. They also demonstrate the practical value of SmartGPT methods in a medical-style multiple-choice scenario, showing that adding exemplars, self-consistency, and reflection can flip repeated wrong answers into correct ones—without claiming it’s safe for real diagnosis.

Cornell Notes

SmartGPT-style prompting can raise multiple-choice performance on MMLU by changing how models commit to answers. The key constraints of MMLU—single-character A/B/C/D answers that must be produced immediately—can cause early commitment before deeper reasoning, increasing hallucinations and errors. Using curated exemplars, self-consistency (sampling multiple outputs and taking the majority), and sometimes self-reflection, the team reports an unofficial GPT-4 MMLU score of 88.4 and a revised “effective” score of 89.0 after human auditing of problematic questions. The broader takeaway is that benchmark validity matters: the researchers claim they found missing context, incorrect answer keys, misordered options, and ambiguous items, which can distort small leaderboard differences that are now used to judge near-human capability.

Why does MMLU’s A/B/C/D format tend to produce avoidable errors in otherwise capable models?

MMLU requires the final answer to be a single character (A, B, C, or D) and to be produced immediately. That setup encourages the model to commit to an answer in the first tokens, before it has effectively performed deeper reasoning. The transcript links this to a “hallucination snowball” dynamic: once the model commits early, later tokens often justify or continue that commitment. It also notes that transformers prioritize fluency and coherence, which can come at the expense of factuality when rushed.

What two prompt-engineering techniques are credited with improving MMLU performance beyond standard evaluation?

First, curated exemplars: the team argues that models should not be forced to guess the expected output format, so they teach the model—via bespoke examples—to end with the final answer in the required single-character format. Second, self-consistency: instead of taking the highest-probability (greedy) answer from one decoding path, they sample multiple candidate answers and take the majority. The transcript claims this can substantially change results because the best answer may not be the most likely single first guess.

Why did the team avoid “self-grading” with GPT models when benchmarking?

They say asking GPT-4 to grade its own answers would be unscientific and inaccurate, pointing to prior evidence that self-evaluation can fail. Instead, they rely on human grading (and auto-grading only where exact-match character checking is appropriate), because the benchmark’s structure can hide errors that a simple match would miss.

What kinds of problems did human auditing find inside MMLU that could inflate or deflate scores?

The transcript lists multiple categories: missing statements or incomplete context, incorrect answer keys, misordered answer options, and formatting/grammar issues that can confuse models. It also describes cases where the “correct” answer appears not to be among the listed options, plus ambiguous items where multiple answers could be defensible depending on interpretation or source disagreement.

How do SmartGPT methods change outcomes in a medical-style multiple-choice example?

In a UK NHS-style diagnosis question, the model repeatedly selects the same incorrect option (e.g., SLE/lupus) when given no exemplars and no time for structured reasoning. Adding exemplars and applying self-consistency (multiple samples with majority vote) shifts the answer toward the correct diagnosis (e.g., sarcoidosis). Finally, self-reflection is used to choose between competing explanations, and the transcript claims the correct choice becomes consistent in that small demonstration.

Review Questions

What specific MMLU constraint (timing and output format) is most responsible for early-commitment errors, and how do SmartGPT’s methods counter it?
Describe how self-consistency differs from greedy decoding, and why that difference can change the final multiple-choice answer.
List at least three categories of MMLU issues the researchers claim to have found during human auditing, and explain how each could affect benchmark scores.

Key Points

1
MMLU’s single-character, immediate-answer format can push models to commit before deeper reasoning, increasing error risk.
2
SmartGPT’s improvements rely on curated exemplars, self-consistency (majority vote across samples), and sometimes self-reflection to select among explanations.
3
The reported unofficial GPT-4 MMLU score of 88.4 is presented alongside a revised “effective” score of 89.0 after human auditing of clear benchmark problems.
4
Human verification found alleged issues including missing context, incorrect answer keys, misordered options, formatting/grammar problems, and ambiguous questions with no single clean answer.
5
Benchmark deltas of tenths of a percent can be dominated by benchmark validity problems when models approach human-expert accuracy.
6
The team argues for professionally vetted benchmarking: unambiguous questions, consistent formatting, blind human grading, and broader difficulty coverage.

Highlights

SmartGPT targets MMLU’s biggest structural weakness: forcing the final A/B/C/D choice too early encourages premature commitment and error cascades.

Human auditing reportedly uncovered missing statements, wrong answer keys, and misordered options—problems that can distort leaderboard comparisons.

Self-consistency is framed as a practical fix for greedy decoding: the best answer may not be the single most probable first guess.

A medical-style multiple-choice demo claims that exemplars + majority sampling + reflection can flip repeated wrong answers into correct ones.

The call is for a new benchmarking standard—rigorously vetted and blind-graded—because tiny score differences now drive major capability claims.

Topics

MMLU Benchmark
SmartGPT Prompting
Self-Consistency
Benchmark Validity
Model Evaluation

Mentioned

Josh Stapleton
Philip
Paul Cristiano
Dario Amodei
MMLU
AGI
GPT
TED
MIT