GPT 5 - What They Didn't Say

TL;DR

GPT-5 is described as a multi-model system that routes prompts to different variants, likely balancing quality with lower inference cost.

Briefing Cornell Notes

Briefing

OpenAI’s GPT-5 rollout is being framed less as a single leap in raw intelligence and more as a cost-and-workflow upgrade: ChatGPT-5 behaves like a multi-model system that routes prompts to different “thinking” and “non-thinking” models, enabling faster responses and cheaper inference—especially for everyday questions. That design matters because it changes what users experience (latency, pricing, and consistency) and because it suggests GPT-5’s gains may come as much from orchestration and tooling as from model architecture alone.

The presentation itself drew criticism for looking staged and poorly rehearsed, with observers pointing to obvious slide/benchmark inconsistencies—such as a claim that 52% is higher than 69%—and arguing that more careful internal checking could have prevented confusion. Beyond aesthetics, the bigger technical concern is that GPT-5’s headline performance numbers may be less definitive than they appear. Benchmarks are described as oversaturated, with many models clustered near the top, making small differences hard to interpret. A specific example cited is SWE-bench: one professor associated with an open-source agentic coding framework says OpenAI excluded 23 problematic cases out of 500, which would shift the reported score. The critique extends to earlier controversies where datasets used for evaluation were later found to have leaked into pre-training.

Still, GPT-5 is portrayed as a meaningful step forward—just not in the “GPT-4 wow factor” way. The transcript notes that GPT-4 to GPT-5 may not match the magnitude of the earlier GPT-3 to GPT-4 jump, and that GPT-4.5 was originally intended to be GPT-5 but was deprecated due to cost. GPT-5’s “system” nature is treated as the key differentiator: it appears to include routing logic and an agentic loop where code generation can trigger self-testing and feedback during the writing process. That combination is positioned as particularly beneficial for coding, math, and other tasks where iterative verification helps.

OpenAI’s emphasis also leans toward creative writing, expression, and health-related use cases. The transcript credits large-scale post-training teams—split into subteams for code, health, and creative expression—for improvements, while also praising the decision to push harder on health assistance despite the risk of hallucinated medical guidance. The practical pitch is that many people want a second opinion, help preparing questions for doctors, and clearer explanations.

Pricing and speed are presented as the most concrete advantages. The “big” GPT-5 system is described as $1.25 in and $10 out, with cheaper mini and nano variants at $0.25 in / $2 out and $0.05 in / $0.40 out per million tokens, respectively. A major capability highlight is a 400,000-token context window and up to 128,000 tokens output, enabling workflows like rewriting or editing long documents in one pass. However, the transcript flags missing features: GPT-5 is said to support images but not audio, and it lacks a real-time API.

Overall, the release is characterized as interesting and potentially disruptive—especially for agentic coding—yet underwhelming compared with earlier breakthroughs. The remaining question is whether GPT-5’s router-based behavior and system design will frustrate users, and whether its cost advantage will translate into sustained dominance over competitors such as Claude for coding tasks.

Cornell Notes

GPT-5 is portrayed as a multi-model “system” rather than a single model: prompts can be routed to different variants, including deeper reasoning for hard problems and faster non-reasoning models for simpler queries. That routing approach is presented as a major driver of lower latency and lower inference cost, which in turn enables aggressive pricing across GPT-5, GPT-5 Mini, and GPT-5 Nano. The transcript also highlights agent-like behavior during coding, where the model can test generated code and feed results back into its next steps. Benchmarks and evaluation methodology are questioned, including claims that some SWE-bench instances were excluded. Despite that skepticism, the release emphasizes creative writing, health assistance, and strong long-context capabilities (400K context; up to 128K output).

Why does the “router” matter for how GPT-5 performs and costs money?

The transcript says GPT-5 in ChatGPT functions as a system that routes each prompt/context to the best model for the job. Harder tasks may go to a deeper reasoning model, while simpler questions can be handled by a quick, cheaper non-reasoning model. This reduces wasted compute on easy prompts—especially important at ChatGPT’s scale (the transcript cites ~700 million users). It also implies that “thinking” may not be automatic for every request; users may need to prompt for it to get consistent reasoning behavior.

What evidence is cited that GPT-5 is more than a single model?

Two signals are emphasized: (1) routing between reasoning and non-reasoning variants, and (2) tool-like behavior during coding where the model can test code and incorporate feedback as it generates. Together, those features resemble an agentic loop even during normal chat, which the transcript links to improvements in coding and math-style tasks that benefit from verification.

What benchmark skepticism is raised, and what example is used?

The transcript argues many benchmarks are oversaturated near the top, making small score differences less meaningful. It also points to evaluation choices, specifically SWE-bench: a professor associated with an open-source agentic coding framework claims OpenAI did not evaluate 23 of 500 instances, which would change the effective score. The transcript says details appear in a GPT-5 system card rather than the main presentation/blog.

How do ARC Challenge results and cost/speed claims fit together?

ARC Challenge scores are described as “good” but not leading on AGI-1 and not strong on AGI-2. The transcript also claims GPT-5 is cheaper and faster, suggesting efficiency gains (possibly from less compute and/or lower precision like FP4) even if it doesn’t top every reasoning benchmark. The overall message is that efficiency improvements may outpace headline “AGI” performance.

What are the key pricing and context-window numbers, and why are they practically important?

The transcript lists: GPT-5 big at $1.25 in / $10 out per million tokens; GPT-5 Mini at $0.25 in / $2 out; and GPT-5 Nano at $0.05 in / $0.40 out. It highlights a 400,000-token context window and up to 128,000 tokens output, enabling workflows like inserting a whole novel and rewriting or editing it in one go. It also notes the knowledge cutoff is October of the previous year, which affects how current the model’s information can be.

What capability gaps are flagged for GPT-5 and its smaller variants?

The transcript says GPT-5 supports images but not audio and does not support a real-time API. It adds that GPT-5 Mini also lacks audio and real-time support. The author speculates these may arrive later, but as presented, the feature set is narrower than some multimodal expectations.

Review Questions

How does routing between reasoning and non-reasoning variants change user experience compared with a single-model approach?
What does the SWE-bench exclusion claim imply about interpreting benchmark scores for GPT-5?
Which GPT-5 pricing tier would you choose for a long-context editing task, and how do the output-token limits affect that choice?

Key Points

1
GPT-5 is described as a multi-model system that routes prompts to different variants, likely balancing quality with lower inference cost.
2
Routing and agent-like coding loops (including code testing and feedback) are presented as major reasons GPT-5 can improve coding and math-style tasks.
3
Benchmark results are treated with caution due to oversaturated leaderboards and possible evaluation exclusions (e.g., SWE-bench instances).
4
OpenAI’s post-training emphasis is said to include creative writing/expression and health assistance, supported by large, specialized post-training teams.
5
Pricing is positioned as a standout advantage: GPT-5 big ($1.25 in / $10 out), Mini ($0.25 in / $2 out), and Nano ($0.05 in / $0.40 out) per million tokens.
6
GPT-5’s long-context capability is highlighted by a 400,000-token context window and up to 128,000 output tokens, enabling whole-document rewrite workflows.
7
Feature gaps are flagged: GPT-5 supports images but not audio and lacks a real-time API (also absent in Mini).

Highlights

GPT-5 is portrayed less as a single model breakthrough and more as an orchestrated system that routes prompts to reasoning or fast variants to cut compute costs.

A key coding claim is agentic behavior: generated code can be tested and revised using feedback during the same interaction.

The transcript emphasizes long-context practicality—400K context with up to 128K output—making large-scale rewriting feasible in one pass.

Evaluation methodology is questioned, including claims that SWE-bench excluded 23 instances out of 500, affecting reported performance.

Pricing is framed as unusually aggressive, with GPT-5 big at $1.25 in / $10 out and much cheaper Mini and Nano tiers. 

Topics

GPT-5 System
Agentic Coding
Benchmark Methodology
Routing and Inference
Pricing and Context Windows

Mentioned

Sam Altman
Francois Cholet
Graham Newberg
Simon Wilson
FP4
API
AGI
SWE Bench
ARC
CMU