GPT 5 - What They Didn't Say
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GPT-5 is described as a multi-model system that routes prompts to different variants, likely balancing quality with lower inference cost.
Briefing
OpenAI’s GPT-5 rollout is being framed less as a single leap in raw intelligence and more as a cost-and-workflow upgrade: ChatGPT-5 behaves like a multi-model system that routes prompts to different “thinking” and “non-thinking” models, enabling faster responses and cheaper inference—especially for everyday questions. That design matters because it changes what users experience (latency, pricing, and consistency) and because it suggests GPT-5’s gains may come as much from orchestration and tooling as from model architecture alone.
The presentation itself drew criticism for looking staged and poorly rehearsed, with observers pointing to obvious slide/benchmark inconsistencies—such as a claim that 52% is higher than 69%—and arguing that more careful internal checking could have prevented confusion. Beyond aesthetics, the bigger technical concern is that GPT-5’s headline performance numbers may be less definitive than they appear. Benchmarks are described as oversaturated, with many models clustered near the top, making small differences hard to interpret. A specific example cited is SWE-bench: one professor associated with an open-source agentic coding framework says OpenAI excluded 23 problematic cases out of 500, which would shift the reported score. The critique extends to earlier controversies where datasets used for evaluation were later found to have leaked into pre-training.
Still, GPT-5 is portrayed as a meaningful step forward—just not in the “GPT-4 wow factor” way. The transcript notes that GPT-4 to GPT-5 may not match the magnitude of the earlier GPT-3 to GPT-4 jump, and that GPT-4.5 was originally intended to be GPT-5 but was deprecated due to cost. GPT-5’s “system” nature is treated as the key differentiator: it appears to include routing logic and an agentic loop where code generation can trigger self-testing and feedback during the writing process. That combination is positioned as particularly beneficial for coding, math, and other tasks where iterative verification helps.
OpenAI’s emphasis also leans toward creative writing, expression, and health-related use cases. The transcript credits large-scale post-training teams—split into subteams for code, health, and creative expression—for improvements, while also praising the decision to push harder on health assistance despite the risk of hallucinated medical guidance. The practical pitch is that many people want a second opinion, help preparing questions for doctors, and clearer explanations.
Pricing and speed are presented as the most concrete advantages. The “big” GPT-5 system is described as $1.25 in and $10 out, with cheaper mini and nano variants at $0.25 in / $2 out and $0.05 in / $0.40 out per million tokens, respectively. A major capability highlight is a 400,000-token context window and up to 128,000 tokens output, enabling workflows like rewriting or editing long documents in one pass. However, the transcript flags missing features: GPT-5 is said to support images but not audio, and it lacks a real-time API.
Overall, the release is characterized as interesting and potentially disruptive—especially for agentic coding—yet underwhelming compared with earlier breakthroughs. The remaining question is whether GPT-5’s router-based behavior and system design will frustrate users, and whether its cost advantage will translate into sustained dominance over competitors such as Claude for coding tasks.
Cornell Notes
GPT-5 is portrayed as a multi-model “system” rather than a single model: prompts can be routed to different variants, including deeper reasoning for hard problems and faster non-reasoning models for simpler queries. That routing approach is presented as a major driver of lower latency and lower inference cost, which in turn enables aggressive pricing across GPT-5, GPT-5 Mini, and GPT-5 Nano. The transcript also highlights agent-like behavior during coding, where the model can test generated code and feed results back into its next steps. Benchmarks and evaluation methodology are questioned, including claims that some SWE-bench instances were excluded. Despite that skepticism, the release emphasizes creative writing, health assistance, and strong long-context capabilities (400K context; up to 128K output).
Why does the “router” matter for how GPT-5 performs and costs money?
What evidence is cited that GPT-5 is more than a single model?
What benchmark skepticism is raised, and what example is used?
How do ARC Challenge results and cost/speed claims fit together?
What are the key pricing and context-window numbers, and why are they practically important?
What capability gaps are flagged for GPT-5 and its smaller variants?
Review Questions
- How does routing between reasoning and non-reasoning variants change user experience compared with a single-model approach?
- What does the SWE-bench exclusion claim imply about interpreting benchmark scores for GPT-5?
- Which GPT-5 pricing tier would you choose for a long-context editing task, and how do the output-token limits affect that choice?
Key Points
- 1
GPT-5 is described as a multi-model system that routes prompts to different variants, likely balancing quality with lower inference cost.
- 2
Routing and agent-like coding loops (including code testing and feedback) are presented as major reasons GPT-5 can improve coding and math-style tasks.
- 3
Benchmark results are treated with caution due to oversaturated leaderboards and possible evaluation exclusions (e.g., SWE-bench instances).
- 4
OpenAI’s post-training emphasis is said to include creative writing/expression and health assistance, supported by large, specialized post-training teams.
- 5
Pricing is positioned as a standout advantage: GPT-5 big ($1.25 in / $10 out), Mini ($0.25 in / $2 out), and Nano ($0.05 in / $0.40 out) per million tokens.
- 6
GPT-5’s long-context capability is highlighted by a 400,000-token context window and up to 128,000 output tokens, enabling whole-document rewrite workflows.
- 7
Feature gaps are flagged: GPT-5 supports images but not audio and lacks a real-time API (also absent in Mini).