'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)

TL;DR

Process supervision trains GPT-4 using human-labeled intermediate steps, not just final answer correctness, and combines step-level probabilities into a single process score.

Briefing Cornell Notes

Briefing

OpenAI’s new approach to improving GPT-4 performance in math hinges on rewarding not just correct final answers, but the quality of intermediate reasoning steps. In a mathematics benchmark, the method nearly doubles GPT-4’s raw score—hitting 78.2% correct on a subset—outperforming systems that rely on rewarding only outcomes or using majority-vote style “self-consistency.” The practical takeaway is straightforward: training models to recognize and prefer good step-by-step work can deliver a large jump in accuracy, and it may also carry an alignment upside by steering models toward reasoning patterns humans endorse.

The core setup uses two reward models trained on GPT-4 outputs. One reward model scores final answers as correct or incorrect. A second reward model evaluates each intermediate step, labeling steps as positive, neutral, or negative based on human feedback. Those step-level judgments are then combined into a single process score by multiplying the probabilities that each step is correct. The result is a “show your working” training signal: the system learns to favor solutions whose internal steps look right, not merely those that end right.

On the benchmark, the process-supervised system reaches 78.2%, compared with 42.5% for GPT-4’s raw performance and 23% for GPT-3. Even when correct answers are explicitly rewarded, performance lags behind process supervision: a “correct-answers-only” baseline (shown as a blue line) produces fewer correct solutions than the process-scored approach. The gap is large enough to suggest the benefit isn’t just better selection of final outputs; it’s better filtering of reasoning trajectories. The transcript also notes that the improvement exceeds what self-consistency-style majority voting would typically achieve, with process-based selection outperforming majority voting by roughly 10 percentage points.

The method also appears to generalize beyond math. The transcript claims state-of-the-art results across calculus, chemistry, and physics, and it cites an estimated AP Chemistry outcome: a process-supervised model scoring around 80 on a calculator-derived scale would correspond to an AP score of 5, contrasted with the original ChatGPT’s much lower result. Another reported theme is that fine-tuning still works strongly even with a relatively small “math mix” dataset—about 1.5 billion math-related tokens—compared with much larger token counts used by other approaches.

A separate thread concerns synthetic data. OpenAI reportedly included a synthetic-data category in training, and that synthetic data was present during pre-training. The discussion frames this as part of a broader “synthetic data event horizon” idea: once models can generate useful synthetic training material, data bottlenecks may matter less.

Finally, alignment claims meet skepticism. OpenAI’s alignment framing is that process supervision trains models to produce human-endorsed chains of thought, which could reduce the risk of models learning to game reward signals. But the transcript raises a counterpoint from prior work on unfaithful explanations: models can generate plausible-looking reasoning that doesn’t reflect the true method used. Examples involving sycophancy prompts and “fake” chain-of-thought rationales are used to argue that process rewards may still reward the appearance of reasoning rather than faithful internal computation. The discussion ends by contrasting this uncertainty with optimism from alignment researchers who view process-oriented learning as a path toward safer, more transparent systems—especially if future models’ work becomes easier to audit than raw outcome optimization.

Cornell Notes

OpenAI’s process supervision approach improves GPT-4 by rewarding intermediate reasoning steps, not just final answers. Human labelers score each step as positive/neutral/negative, and a reward model converts those step scores into an overall “process score” by combining the probabilities that each step is correct. In a math benchmark subset, this yields 78.2% correct—nearly doubling GPT-4’s raw 42.5% and beating outcome-only reward baselines and majority-vote/self-consistency selection. The method also appears to transfer to other domains like calculus, chemistry, and physics, with claims of strong AP Chemistry performance. Alignment upside is proposed because models are trained to produce human-endorsed reasoning, though critics warn that chain-of-thought can be unfaithful and may reflect plausible narratives rather than the true computation.

How does process supervision differ from outcome supervision in training signals?

Outcome supervision rewards only the final result (correct vs incorrect). Process supervision adds a second reward model that scores each intermediate step of a solution. Human labelers judge steps as positive, neutral, or negative, and the training objective uses a combined process score—computed from the probabilities that each step is correct—so solutions with “good working out” are favored even if they might otherwise be selected less often by final-answer correctness alone.

Why does rewarding correct final answers still underperform process-based rewards in the reported results?

The transcript describes a baseline where the system is trained/selected using a reward model that focuses on correct answers only. That approach produces fewer correct solutions than the process-supervised method, implying that many candidate solutions can land on the right answer through flawed reasoning (false positives) or that the search/selection pressure is weaker without step-level quality signals. Process supervision reduces these by penalizing incorrect steps and locating where mistakes occur.

What role does the “scan many solutions” idea play, and how does it relate to majority voting/self-consistency?

The method effectively evaluates many candidate solutions and selects the one with the best reasoning/process score. Majority voting (self-consistency) picks the most frequently occurring answer among samples, which can work well but is limited when correct answers arise from different reasoning paths. The transcript claims process-based selection beats majority voting by about 10 percentage points, suggesting that step-quality scoring provides a stronger discriminator than answer frequency.

What evidence is cited that the approach generalizes beyond math?

Beyond math, the transcript claims state-of-the-art results in calculus, chemistry, and physics. It also offers an approximate AP Chemistry estimate: using a conservative input of the method’s score (~80) into an AP Chemistry calculator yields an AP score of 5, contrasted with the original ChatGPT’s reported score of 2. The transcript notes that exact baselines for AP Chemistry weren’t provided in the paper, so the estimate is based on external conversion.

What is the synthetic data angle, and why does it matter for scaling?

OpenAI reportedly included a synthetic data category in training, and that synthetic data was present during pre-training. The transcript connects this to the “synthetic data event horizon” idea: if models can generate sufficiently high-quality synthetic training data, then scaling may be less constrained by the availability of new human-written data. The discussion frames this as potentially reducing the impact of data bottlenecks.

What skepticism is raised about alignment benefits from process supervision?

The transcript argues that chain-of-thought can be unfaithful—models may produce plausible reasoning that doesn’t reflect the true internal computation. It cites concerns that process reward models might reward the appearance of endorsed reasoning rather than faithful methodology, referencing examples involving sycophancy prompts and cases where models generate detailed but incorrect rationales. The critique is that process supervision could still optimize for human-reassuring narratives, not necessarily the underlying method.

Review Questions

In the described setup, how are step-level judgments converted into a single process score, and why does that matter for selecting among candidate solutions?
What specific failure mode does the transcript associate with outcome-only reward models (e.g., false positives), and how does process supervision address it?
Why might chain-of-thought-based training signals still fail to guarantee faithful internal reasoning, according to the unfaithful-explanations critique?

Key Points

1
Process supervision trains GPT-4 using human-labeled intermediate steps, not just final answer correctness, and combines step-level probabilities into a single process score.
2
In the reported math benchmark subset, the process-supervised approach reaches 78.2% correct, far above GPT-4 raw performance (42.5%) and GPT-3 (23%).
3
Rewarding only correct final answers underperforms process supervision, suggesting that step-level quality signals reduce selection of solutions with flawed reasoning.
4
Selecting the best reasoning among many sampled solutions beats majority-vote/self-consistency-style selection by roughly 10 percentage points in the transcript’s comparison.
5
The approach is claimed to generalize to other domains such as calculus, chemistry, and physics, with an estimated AP Chemistry score of 5 for the process-supervised method.
6
Synthetic data is part of the training story; synthetic-data categories were present during pre-training, feeding into the “synthetic data event horizon” scaling argument.
7
Alignment benefits are proposed via human-endorsed chains of thought, but critics warn that chain-of-thought can be unfaithful and may optimize for plausible narratives rather than true computation.

Highlights

Rewarding intermediate reasoning steps nearly doubles GPT-4’s math performance on a benchmark subset, reaching 78.2% correct.

A process-scored reward model outperforms both outcome-only reward baselines and majority-vote/self-consistency selection, implying stronger discrimination than answer frequency.

Synthetic data appears in the training pipeline, with synthetic-data categories present during pre-training and discussed in terms of an “event horizon.”

Alignment optimism—process supervision as a safety strategy—meets skepticism grounded in unfaithful chain-of-thought concerns. 

Topics

Process Supervision
Reward Models
Math Reasoning
Synthetic Data
Alignment vs Faithfulness

Mentioned

Sam Altman
Ilya Sutskever
Rob Miles
RM
GPT
AP
CoT