ChatGPT-5 Full Review: 5 Real-World Tests & The AI Race
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
ChatGPT-5 is described as a router over multiple underlying models, with special training emphasized for healthcare and coding-heavy work.
Briefing
ChatGPT-5’s biggest real-world edge isn’t just “smarter answers”—it’s a noticeable jump in coding reliability and task execution, especially when users lean into the model’s intended workflow (notably “think hard” and canvas-first usage). In hands-on tests, the model repeatedly produced working code and usable outputs, but it also exposed a sharp environment gap: the same coding prompt could succeed in one interface and fail badly in another.
On the product side, ChatGPT-5 is described as a router over multiple underlying models, with special training emphasized most strongly in healthcare. During the live coverage, a cancer survivor compared ChatGPT-5 with earlier models, and the pitch was that medical advice should be more accurate than what a typical large language model delivers. The reviewer—without claiming medical expertise—frames the results as “anecdotally” better, pointing to benchmark claims and suggesting that accuracy improvements matter most where errors carry high stakes.
Coding is where the performance story becomes concrete. The model’s “mixture of models” approach is said to be especially strong for coding and app-building, and the live demos leaned hard on “vibe coding” to let users generate apps quickly. But the most striking finding comes from a side-by-side test: a complex Japan travel itinerary applet (with interactive day-by-day navigation and interest-based toggles like ramen-heavy or temple-heavy days) worked in ChatGPT-5’s canvas environment—delivering a fully functional, clickable app with real destinations. The same prompt sent to Lovable using ChatGPT-5 produced a “white screen of death,” with no interactivity and a clear failure.
That environment sensitivity shows up again in how the model handles structured tasks. For a Gantt chart request about the Apollo 13 mission, ChatGPT-5 could research and outline the build components and critical path, but struggled to render a readable chart directly. When asked to code the chart, it generated a full Gantt chart that was followable—still not perfect visually, but materially more usable.
Beyond coding, the reviewer ran a “gnarly” data-analysis test designed to mimic messy business reality: three entangled CSVs with inconsistent formatting and a SQL injection attack embedded in one file. The goal was to detect duplicates, surface the injection risk, and produce an auditable explanation with a clear picture of employee counts across overloaded and underloaded projects. Results varied dramatically depending on prompting mode. “Think hard” versions of ChatGPT-5 scored highest, beating Claude Code and multiple OpenAI variants, while vanilla ChatGPT-5 without the extra reasoning mode scored the lowest—making ChatGPT-5 both the best and worst performer in the same test set.
The takeaway is less about hype and more about fit. ChatGPT-5 is portrayed as a strong “daily driver” for non-coders too—improved writing quality, better multimodal reading (including handwriting), and more natural drafting and graphing. Yet it still hallucinates occasionally and can overbuild buggy applets, so checkpointing and verification remain necessary. In the broader debate—model rankings, betting-market swings, and backlash—the reviewer argues the practical question is where reliability and high-stakes accuracy improve enough to matter, especially in medical and other correctness-critical domains.
Cornell Notes
ChatGPT-5 is presented as a router over multiple models, with special emphasis on higher-stakes domains like healthcare and a major push toward coding and app-building. In real tests, it delivered working code and usable outputs—especially in the canvas environment—and showed a big interface-dependent gap where the same prompt could fail elsewhere. A “gnarly” business-data task (messy CSVs plus a SQL injection trap) produced the clearest lesson: prompting mode matters, with “think hard” variants scoring best while vanilla ChatGPT-5 scored worst in the same comparison set. Overall, the model’s reliability gains—plus improved writing and multimodal reading—make it feel more like a reasoning partner, but it still hallucinates and can generate fragile applets that require checkpointing.
What structural change is described for ChatGPT-5, and why does it matter for performance?
Why did the Japan travel itinerary test succeed in one environment but fail in another?
What does the “gnarly CSV” test reveal about prompting, and what were the key results?
How did ChatGPT-5 perform on the Apollo 13 Gantt chart task, and what pattern emerged?
What reliability and safety cautions still apply despite the improvements?
Review Questions
- In the reviewer’s comparisons, how did “think hard” change outcomes on the messy CSV + SQL injection test?
- What evidence supports the claim that ChatGPT-5’s coding performance depends on the interface environment (canvas vs. Lovable)?
- Why might asking for code produce better results than asking for a direct formatted artifact (e.g., the Apollo 13 Gantt chart)?
Key Points
- 1
ChatGPT-5 is described as a router over multiple underlying models, with special training emphasized for healthcare and coding-heavy work.
- 2
Healthcare accuracy is positioned as a major investment area, with the claim that medical advice should be more accurate than typical large language model outputs.
- 3
Coding performance can be highly environment-dependent: a complex itinerary app worked in ChatGPT-5 canvas but failed in Lovable with a white-screen output.
- 4
Prompting mode materially changes results on correctness-critical tasks; “think hard” variants outperformed vanilla ChatGPT-5 in the same test set.
- 5
ChatGPT-5 shows a recurring strength in tasks where it can generate executable artifacts (code, charts) rather than only formatted text.
- 6
Despite improvements, hallucinations still occur and generated applets may be brittle, so checkpointing and verification are recommended.
- 7
The practical debate over “best model” rankings is framed as less important than matching the model to the user’s correctness and reliability needs.