GPT 4 is Smarter than You Think: Introducing SmartGPT

TL;DR

SmartGPT improves GPT-4 outputs by combining an optimized step-by-step prompt with multi-sample generation, reflection, and a researcher/resolver selection dialogue.

Briefing Cornell Notes

Briefing

SmartGPT’s core claim is that GPT-4’s benchmark performance can be materially improved—not by changing the model, but by wrapping it in a multi-step prompting and self-checking workflow. The approach combines (1) a stronger “let’s work this out step-by-step” prompt, (2) generating multiple candidate outputs under controlled randomness, (3) prompting the model to reflect on errors, and (4) running a short “researcher → resolver” dialogue to select a final answer. The practical payoff is that many of GPT-4’s mistakes—especially those that are detectable after the fact—can be corrected before the user ever sees the result.

The video opens with a concrete example from a TED Talk drying-clothes puzzle: “If it took five hours to dry five clothes, how long to dry 30 clothes?” GPT-4’s standard response lands at 30 hours, while SmartGPT consistently produces the correct 25-hour answer and includes the assumptions behind its reasoning. That sets up the broader argument: published benchmark scores may understate GPT-4’s real capability because they often measure a single-shot response rather than a system that can verify and revise its own work.

SmartGPT is built around three prompting improvements that have recently shown measurable gains. First is step-by-step prompting, where the model uses the input token space for intermediate computation rather than relying solely on internal activations. Second is reflection: by asking for error-spotting, GPT-4 sometimes identifies flaws in its own draft answers. Third is dialogue-based resolution: multiple outputs can be compared, and a final “resolver” can choose the best option after the “researcher” stage surfaces reasoning and potential issues.

To test the idea, the creator manually ran many trials on selected slices of the MMLU benchmark, focusing on topics expected to be difficult for GPT-4—such as formal logic. In a small formal-logic subset, direct zero-shot prompting produced about 68% accuracy. Adding step-by-step prompting raised it to roughly 74–75%. The full SmartGPT pipeline, using reflection and the researcher/resolver selection step, reached about 84%. Across tests, the pattern was consistent: the resolver corrected roughly half of the errors GPT-4 made under the baseline setup. The video also reports that SmartGPT’s gains were weaker on tasks involving counting, division, multiplication, or character-level details—cases where the model’s wrong answer may be too subtle for self-critique to catch.

The video then connects these improvements to broader stakes. An AI governance researcher’s heuristic ties an MMLU score near 95 to “agi-like” abilities. Since GPT-4’s reported MMLU score is 86.4, correcting about half of errors could plausibly push performance into the low 90s—potentially narrowing the gap toward that 95 threshold. The creator also cites recent papers that optimize prompts via automatic prompt engineering and show large gains from step-by-step reasoning, while noting that researchers still lack a full theoretical explanation for why these prompting techniques work.

Finally, the video outlines future upgrades: more generic few-shot examples, longer and richer multi-agent dialogue, tuning sampling temperature across stages, and integrating tools like calculators or code interpreters to eliminate arithmetic and counting failures. The overall message is less about “smarter GPT-4” and more about “smarter usage”—a system design that extracts more reliable reasoning from the same underlying model, while raising questions about how thoroughly model vendors test capabilities before release.

Cornell Notes

SmartGPT is a prompting system designed to extract more reliable performance from GPT-4 than single-shot benchmark runs typically capture. It combines an optimized step-by-step prompt, multiple sampled outputs, a reflection step that sometimes catches the model’s own mistakes, and a short “researcher → resolver” dialogue to pick the best final answer. In the creator’s MMLU subset tests (formal logic), accuracy rose from about 68% (zero-shot) to roughly 74–75% with step-by-step prompting, and to about 84% with the full pipeline—suggesting the resolver can fix around half of GPT-4’s baseline errors. The gains are strongest when errors are reasoning-level and detectable after the fact, and weaker on arithmetic/counting details where self-checking often misses mistakes. The approach also points to future gains via tool integration and better multi-agent dialogue.

Why does step-by-step prompting improve GPT-4’s accuracy in this system?

The video ties the improvement to recent reasoning about “using the input space for computation.” Instead of relying on hidden internal activations, the model performs intermediate work as discrete tokens in the prompt (e.g., “let’s work this out in a step-by-step way”). The creator also notes that an improved step-by-step prompt—found via automatic prompt engineering in a recent paper—beats a simpler “let’s think step by step” baseline (reported as ~81% → ~86% → ~89% accuracy in the referenced comparison).

How does reflection and multi-output sampling help catch GPT-4 mistakes?

GPT-4’s outputs vary because sampling uses a temperature (the creator suggests a default around 0.5). That means repeated runs can produce both correct and incorrect answers. SmartGPT generates multiple candidate outputs, then prompts the model to detect errors in those outputs. When reflection succeeds, the system can correct reasoning-level mistakes; when it fails, the resolver may still choose the best option among candidates. The video emphasizes that this does not work for every question—some errors are too subtle or the model cannot reliably spot them.

What is the “researcher → resolver” structure, and why does it matter?

After producing candidate reasoning, SmartGPT uses a two-stage dialogue: a “researcher” step that produces intermediate work and a “resolver” step that selects the final answer. The creator frames this as a modular way to apply different abilities separately—first generating and then judging—rather than trying to do everything in one prompt. In the formal-logic MMLU subset, this resolver step is what lifts accuracy from the mid-70s to about 84%, consistent with correcting roughly half of baseline errors.

What did the creator find on MMLU, and how large were the gains?

On a formal-logic slice of MMLU (first 25 questions), zero-shot prompting scored about 68%. Adding the optimized step-by-step prompt raised accuracy to around 74–75%. The full SmartGPT pipeline (with reflection and researcher/resolver selection) reached about 84%. The video also reports similar patterns on other MMLU topics (e.g., college math: ~40% zero-shot, ~53.5% with step-by-step, ~60% after resolution), though the resolver sometimes could not fully correct all errors.

Where does SmartGPT struggle most?

The creator repeatedly points to arithmetic and counting-like failures—division, multiplication, and character counting. In these cases, GPT-4 may get the high-level logic right but still make simple errors that neither the researcher nor resolver reliably detects. The proposed fix is tool integration (calculator, code interpreter, character-counting utilities) so the system can verify exact computations rather than relying on internal reasoning alone.

How do the results relate to AGI-like capability claims tied to MMLU?

The video references an AI governance researcher’s heuristic that an MMLU score near 95 corresponds to “agi-like” abilities. Since GPT-4’s published MMLU score is 86.4, the creator argues that correcting about half of errors could raise performance into the low 90s (roughly 93), narrowing the gap toward 95. The claim is framed as plausible rather than guaranteed, and the creator notes that even if 95 is not reached, the human expert level on MMLU is about 89.8, which SmartGPT-like systems might approach.

Review Questions

In SmartGPT, which component is responsible for the biggest jump beyond step-by-step prompting, and what mechanism does it use to select the final answer?
Why might reflection fail on some MMLU questions even when the model’s reasoning seems plausible?
What kinds of tasks does the creator say benefit most from tool integration, and how would calculators or code interpreters change the error profile?

Key Points

1
SmartGPT improves GPT-4 outputs by combining an optimized step-by-step prompt with multi-sample generation, reflection, and a researcher/resolver selection dialogue.
2
In formal-logic MMLU subset tests, accuracy rose from about 68% (zero-shot) to roughly 74–75% with step-by-step prompting, and to about 84% with the full pipeline.
3
The resolver step appears to correct a large fraction of GPT-4’s baseline errors—often described as roughly half—when mistakes are reasoning-level and detectable after the fact.
4
SmartGPT’s gains are weaker on arithmetic and counting details (division, multiplication, character counting), where self-checking often misses errors.
5
The video argues that benchmark results may understate GPT-4’s capability because many evaluations measure single-shot answers rather than verification-and-revision workflows.
6
Future improvements proposed include generic few-shot prompting, longer multi-agent dialogue, temperature scheduling across stages, and integrating external tools like calculators or code interpreters.
7
The approach raises questions about how thoroughly model vendors test capabilities before release and whether real-world performance can exceed published benchmark ceilings.

Highlights

SmartGPT corrected a drying-clothes word problem where standard GPT-4 reportedly answered 30 hours; SmartGPT consistently produced the correct 25-hour result with stated assumptions.

On a formal-logic slice of MMLU, the system moved from ~68% (zero-shot) to ~74–75% (step-by-step) and ~84% after reflection plus researcher/resolver resolution.

Across tests, the resolver step often fixed about half of GPT-4’s baseline errors, but arithmetic/counting mistakes remained a persistent weak spot.

The video links improved MMLU performance to governance-style thresholds, suggesting that error correction could plausibly move scores toward the low-90s even if 95 remains uncertain.

Topics

SmartGPT
Prompt Engineering
MMLU
Self-Reflection
Multi-Step Reasoning

Mentioned

GPT-4
MMLU
TED