SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
MMLU’s single-character, immediate-answer format can push models to commit before deeper reasoning, increasing error risk.
Briefing
A widely used language-model benchmark—MMLU—has been found to contain enough flawed, ambiguous, or misformatted questions that reported “near-human” scores can be meaningfully overstated. After running large-scale evaluations of GPT models with a prompt-engineering framework (“SmartGPT”), the researchers report an adjusted MMLU score of 89.0, and argue that removing clear test errors and dataset issues would likely push results higher. The stakes are practical: when benchmark differences shrink to tenths of a percent, even a small error rate in the benchmark itself can distort claims about model capability and mislead downstream comparisons.
The work starts with a technical constraint: MMLU forces answers into a single immediate character choice (A/B/C/D). That format encourages early commitment—often before deeper reasoning—and can amplify hallucinations or mistakes. SmartGPT was designed to counter this by (1) using carefully crafted exemplars so the model learns the expected output format, (2) adding self-consistency via sampling multiple candidate answers and taking the majority, and (3) using self-reflection to choose among competing explanations when time and compute allow. In earlier experiments, the team had already shown that these techniques can raise performance on reasoning-heavy tasks; here, they apply the approach at scale.
The headline result is a new unofficial record: 88.4 on MMLU using GPT-4 with reduced “power” relative to the full SmartGPT method, largely because fully automated grading and large-scale reflection were too expensive and too labor-intensive to run at maximal strength. The team also reports that their GPT-3.5 run improved by 3.7 points (from 70 to 73.7) when evaluated across the full MMLU question set. Crucially, the researchers say they did not rely on GPT models to grade their own outputs, citing known failures of self-grading.
But the benchmark-breaking moment comes from human verification. By tracing “wrong” answers back to their original sources, the team claims to have uncovered widespread issues: missing context in question statements, incorrect answer keys, misordered multiple-choice options, and even cases where the “correct” answer is not among the listed choices. They highlight recurring problems in domains like virology, college chemistry, and other subject areas, plus formatting and ambiguity problems—such as grammatical issues, unclear dependencies between multi-part arguments, and questions with no single defensible answer. They estimate that after accounting for clear errors and ambiguities, the effective score becomes 89.0 rather than 88.4.
The researchers argue that this matters because modern model comparisons increasingly hinge on tiny deltas. When leaderboard gaps are around 0.1%, a benchmark error rate of 1–3% can swamp the signal—especially as models approach human-expert performance. The conclusion is not just “MMLU is imperfect,” but a call for more rigorous, professionally vetted benchmarking: unambiguous questions, consistent formatting, blind human grading, and broader difficulty ranges. They also demonstrate the practical value of SmartGPT methods in a medical-style multiple-choice scenario, showing that adding exemplars, self-consistency, and reflection can flip repeated wrong answers into correct ones—without claiming it’s safe for real diagnosis.
Cornell Notes
SmartGPT-style prompting can raise multiple-choice performance on MMLU by changing how models commit to answers. The key constraints of MMLU—single-character A/B/C/D answers that must be produced immediately—can cause early commitment before deeper reasoning, increasing hallucinations and errors. Using curated exemplars, self-consistency (sampling multiple outputs and taking the majority), and sometimes self-reflection, the team reports an unofficial GPT-4 MMLU score of 88.4 and a revised “effective” score of 89.0 after human auditing of problematic questions. The broader takeaway is that benchmark validity matters: the researchers claim they found missing context, incorrect answer keys, misordered options, and ambiguous items, which can distort small leaderboard differences that are now used to judge near-human capability.
Why does MMLU’s A/B/C/D format tend to produce avoidable errors in otherwise capable models?
What two prompt-engineering techniques are credited with improving MMLU performance beyond standard evaluation?
Why did the team avoid “self-grading” with GPT models when benchmarking?
What kinds of problems did human auditing find inside MMLU that could inflate or deflate scores?
How do SmartGPT methods change outcomes in a medical-style multiple-choice example?
Review Questions
- What specific MMLU constraint (timing and output format) is most responsible for early-commitment errors, and how do SmartGPT’s methods counter it?
- Describe how self-consistency differs from greedy decoding, and why that difference can change the final multiple-choice answer.
- List at least three categories of MMLU issues the researchers claim to have found during human auditing, and explain how each could affect benchmark scores.
Key Points
- 1
MMLU’s single-character, immediate-answer format can push models to commit before deeper reasoning, increasing error risk.
- 2
SmartGPT’s improvements rely on curated exemplars, self-consistency (majority vote across samples), and sometimes self-reflection to select among explanations.
- 3
The reported unofficial GPT-4 MMLU score of 88.4 is presented alongside a revised “effective” score of 89.0 after human auditing of clear benchmark problems.
- 4
Human verification found alleged issues including missing context, incorrect answer keys, misordered options, formatting/grammar problems, and ambiguous questions with no single clean answer.
- 5
Benchmark deltas of tenths of a percent can be dominated by benchmark validity problems when models approach human-expert accuracy.
- 6
The team argues for professionally vetted benchmarking: unambiguous questions, consistent formatting, blind human grading, and broader difficulty coverage.