GPT-5: Everything You Need to Know So Far

TL;DR

Greg Brockman’s remarks about scaling compute and Jason Wei’s “massive GPU training” reaction point to GPT-5’s full-scale training run being underway.

Briefing Cornell Notes

Briefing

OpenAI’s full-scale GPT-5 training run appears to be underway, with safety red-teaming already positioned for the next phase of testing. The strongest signals come from Greg Brockman’s remarks about scaling up to “maximally harness” computing resources for the biggest model yet, alongside Jason Wei’s reaction to the “massive GPU training” milestone. The timing matters because it suggests GPT-5 is moving from smaller, earlier training checkpoints into the longer, higher-stakes work of safety evaluation and capability validation—an arc that typically takes months rather than days.

Additional evidence points to safety readiness rather than a vague “soon” announcement. OpenAI closed applications for its red teaming network, and those red teamers were told they’d learn their application status by the end of last year. In practice, that implies the red team workforce is already in place to begin testing as GPT-5 progresses through checkpoints. Those checkpoints are important: even before a final model is fully trained, teams can evaluate intermediate versions, meaning OpenAI could effectively have “GPT-4.2”-style capability snapshots before the complete GPT-5 release.

The capability direction described by OpenAI insiders and aligned research is clear: GPT-5 is expected to “think for longer” by laying out reasoning steps and then verifying them. Sam Altman and other executives frame this as a shift toward more interactive, stepwise explanations that users can judge for reasonableness. The transcript also ties this to OpenAI’s “let’s verify step-by-step” work, where sampling a base model thousands of times and selecting outputs with higher-rated reasoning steps produced large gains in math and strong results across STEM. The key mechanism is parallelization: generate many candidate reasoning traces, then use a verifier-like process to pick the best.

On reliability, the same logic scales. If GPT-4 can be improved by repeatedly sampling and selecting stronger answers, GPT-5’s larger training and improved reasoning verification could make it far more dependable—especially for tasks where a single response can be “almost right” but not consistently correct. The transcript further connects this to prior approaches in coding and math, including DeepMind’s AlphaCode 2, which used massive sampling to reach high contest performance.

Hardware and model size expectations also enter the picture. In an interview, Etched AI CEO Gavin Uberti suggested GPT-5 could have roughly 10× the parameter count of GPT-4, potentially driven by larger embedding dimensions, more layers, and more experts in a mixture-of-experts style design. While exact numbers remain speculative, the underlying theme is that GPT-5’s performance gains likely come from both scale and better internal checking.

Finally, the transcript argues for a late-2024 release path rather than an immediate launch. The prediction lands toward the end of November 2024, based on multi-month training time plus extended safety testing, and also on avoiding the most contentious political period in the U.S. The release, it suggests, may arrive in staged checkpoints—capabilities rolling out over time rather than all at once—while OpenAI continues to push multimodality (speech, images, video) and, most importantly, reasoning reliability.

Cornell Notes

Signals from OpenAI leadership and researchers indicate GPT-5’s full-scale training run is underway, with red-teaming already positioned for safety testing as checkpoints emerge. The expected leap centers on longer, stepwise reasoning that can be verified—supported by OpenAI’s “let’s verify step-by-step” results, where sampling thousands of reasoning traces and selecting the best boosted math and STEM performance. Reliability improvements are framed as a scaling of the same idea: generate many attempts, then choose outputs with stronger reasoning. Hardware and architecture speculation points to much larger scale (possibly ~10× GPT-4 parameters) via embedding dimension, layers, and expert count. A late-2024, staged rollout is predicted, factoring in training duration, safety cycles, and political timing risks.

What evidence suggests GPT-5 training has moved beyond early experiments?

Greg Brockman’s comments emphasize “maximally harnessing” OpenAI’s computing resources and scaling beyond precedent, which aligns with a full training run rather than a small-model pilot. Jason Wei’s reaction to “launching a massive GPU training” reinforces that a large compute job is underway. Separately, OpenAI closing red-teaming network applications and telling applicants their status by the end of last year implies red teamers are ready to start safety testing as GPT-5 progresses through checkpoints.

Why do checkpoints matter for when GPT-5 capabilities appear?

Even before final training completes, models pass through intermediate checkpoints—like save points in a game. Those checkpoints can be deployed for evaluation, meaning OpenAI could have earlier, partial versions of GPT-5 capabilities available well before the full system is finished. The transcript links this to the idea of an interim “GPT-4.2”-type stage before the complete GPT-5 release.

How does “thinking for longer” connect to verifier-based gains?

The transcript describes GPT-5 laying out reasoning steps before answering, then checking those steps internally or externally. It ties this to OpenAI’s “let’s verify step-by-step” approach: sample the base model many times, score reasoning traces, and select the outputs with higher-rated steps. In the cited results, using thousands of samples and picking the best reasoning steps roughly doubled math performance, with strong effects across STEM.

What reliability improvement is implied by sampling and selection?

If a model is asked the same kind of question repeatedly, the chance of getting a strong answer rises when the system can evaluate and choose among many candidates. The transcript frames this as moving from “one good answer out of many” toward consistently selecting the best reasoning trace—an approach described as already present in OpenAI’s verification work and potentially amplified for GPT-5.

What architectural scale changes are suggested for GPT-5?

Gavin Uberti (Etched AI) speculated GPT-5 could have around 10× GPT-4’s parameter count. He attributes that to a combination of larger embedding dimensions (more granularity/nuance per token), more layers (deeper pattern recognition), and doubling the number of experts (in a mixture-of-experts style design). The transcript treats these as estimates, not confirmed specs.

Why predict a late-2024 rollout instead of an immediate release?

The transcript’s timeline combines (1) multi-month training time for a model of GPT-5’s scale, (2) additional months for safety testing—citing that GPT-4 was tested for 6–8 months before release—and (3) a desire to avoid releasing high-impact capabilities (especially multimodal persuasion tools) during the most contentious U.S. election period. That leads to a predicted end-of-November 2024 timeframe, likely with staged checkpoints into 2025.

Review Questions

What specific mechanism in “let’s verify step-by-step” produces large gains, and why does sampling help?
How do checkpoints change the practical timeline for when users might see GPT-5-like capabilities?
Which factors (training duration, safety testing, and political timing) drive the late-2024 release prediction?

Key Points

1
Greg Brockman’s remarks about scaling compute and Jason Wei’s “massive GPU training” reaction point to GPT-5’s full-scale training run being underway.
2
OpenAI’s red-teaming network closure and prior status timeline suggest safety testing is ready to begin as GPT-5 moves through checkpoints.
3
Checkpoint-based evaluation means intermediate GPT-5 capability snapshots could appear before the final model is fully trained.
4
GPT-5’s expected performance jump centers on longer, stepwise reasoning paired with verification and selection of stronger reasoning traces.
5
OpenAI’s “let’s verify step-by-step” results highlight how thousands of samples plus a verifier-like selection process can substantially improve math and STEM outcomes.
6
Reliability gains are framed as a scaling of sampling-and-selection: more attempts plus better selection yields more consistently strong answers.
7
A late-2024, staged rollout is predicted based on training time, safety testing length, and the desire to avoid election-related controversy.

Highlights

Red-teaming readiness appears to be in motion: applications closed, and red teamers were expected to know their status by the end of last year—consistent with safety work starting during GPT-5 checkpointing.

The core technical bet is verifier-style reasoning: generate many reasoning traces, score them, and keep the best—reported to roughly double math performance in cited experiments.

The reliability thesis is practical: repeated sampling plus reasoning-step selection can turn occasional “good” answers into more dependable outputs.

The release forecast lands near the end of November 2024, with staged checkpoints rather than a single all-at-once launch.

Topics

GPT-5 Training
Reasoning Verification
Red Teaming
Model Scale
Multimodal Voice

Mentioned

Greg Brockman
Jason Wei
Sam Altman
Bill Gates
Gavin Uberti
Dario Amodei
Ben Newhouse
Andre Kathy
Peter Werf
GPT
GPU
LLM
STEM
MoE