ChatGPT-5 Full Review: 5 Real-World Tests & The AI Race

TL;DR

ChatGPT-5 is described as a router over multiple underlying models, with special training emphasized for healthcare and coding-heavy work.

Briefing Cornell Notes

Briefing

ChatGPT-5’s biggest real-world edge isn’t just “smarter answers”—it’s a noticeable jump in coding reliability and task execution, especially when users lean into the model’s intended workflow (notably “think hard” and canvas-first usage). In hands-on tests, the model repeatedly produced working code and usable outputs, but it also exposed a sharp environment gap: the same coding prompt could succeed in one interface and fail badly in another.

On the product side, ChatGPT-5 is described as a router over multiple underlying models, with special training emphasized most strongly in healthcare. During the live coverage, a cancer survivor compared ChatGPT-5 with earlier models, and the pitch was that medical advice should be more accurate than what a typical large language model delivers. The reviewer—without claiming medical expertise—frames the results as “anecdotally” better, pointing to benchmark claims and suggesting that accuracy improvements matter most where errors carry high stakes.

Coding is where the performance story becomes concrete. The model’s “mixture of models” approach is said to be especially strong for coding and app-building, and the live demos leaned hard on “vibe coding” to let users generate apps quickly. But the most striking finding comes from a side-by-side test: a complex Japan travel itinerary applet (with interactive day-by-day navigation and interest-based toggles like ramen-heavy or temple-heavy days) worked in ChatGPT-5’s canvas environment—delivering a fully functional, clickable app with real destinations. The same prompt sent to Lovable using ChatGPT-5 produced a “white screen of death,” with no interactivity and a clear failure.

That environment sensitivity shows up again in how the model handles structured tasks. For a Gantt chart request about the Apollo 13 mission, ChatGPT-5 could research and outline the build components and critical path, but struggled to render a readable chart directly. When asked to code the chart, it generated a full Gantt chart that was followable—still not perfect visually, but materially more usable.

Beyond coding, the reviewer ran a “gnarly” data-analysis test designed to mimic messy business reality: three entangled CSVs with inconsistent formatting and a SQL injection attack embedded in one file. The goal was to detect duplicates, surface the injection risk, and produce an auditable explanation with a clear picture of employee counts across overloaded and underloaded projects. Results varied dramatically depending on prompting mode. “Think hard” versions of ChatGPT-5 scored highest, beating Claude Code and multiple OpenAI variants, while vanilla ChatGPT-5 without the extra reasoning mode scored the lowest—making ChatGPT-5 both the best and worst performer in the same test set.

The takeaway is less about hype and more about fit. ChatGPT-5 is portrayed as a strong “daily driver” for non-coders too—improved writing quality, better multimodal reading (including handwriting), and more natural drafting and graphing. Yet it still hallucinates occasionally and can overbuild buggy applets, so checkpointing and verification remain necessary. In the broader debate—model rankings, betting-market swings, and backlash—the reviewer argues the practical question is where reliability and high-stakes accuracy improve enough to matter, especially in medical and other correctness-critical domains.

Cornell Notes

ChatGPT-5 is presented as a router over multiple models, with special emphasis on higher-stakes domains like healthcare and a major push toward coding and app-building. In real tests, it delivered working code and usable outputs—especially in the canvas environment—and showed a big interface-dependent gap where the same prompt could fail elsewhere. A “gnarly” business-data task (messy CSVs plus a SQL injection trap) produced the clearest lesson: prompting mode matters, with “think hard” variants scoring best while vanilla ChatGPT-5 scored worst in the same comparison set. Overall, the model’s reliability gains—plus improved writing and multimodal reading—make it feel more like a reasoning partner, but it still hallucinates and can generate fragile applets that require checkpointing.

What structural change is described for ChatGPT-5, and why does it matter for performance?

ChatGPT-5 is described as a “model router” that routes requests to multiple underlying models, with additional special training. That matters because different sub-models appear to specialize—coding and proof-by-code/math are repeatedly stronger than other tasks. The reviewer also links this to why prompting and interface choices (like canvas) can swing outcomes: the router plus training and controls (e.g., reasoning effort) can change how the system executes a task.

Why did the Japan travel itinerary test succeed in one environment but fail in another?

The reviewer used the same complex prompt: build an interactive itinerary applet with real Japan destinations and configurable emphasis (ramen-heavy vs. temple-heavy days, narrative per day, click-through usability). In ChatGPT-5’s canvas app environment, the output became a fully working, clickable app. In Lovable, using ChatGPT-5 for the same coding challenge resulted in a “white screen of death,” with text but no design or interactivity—graded as a complete fail. The reviewer treats this as a meaningful environment prioritization issue, not just a prompt issue.

What does the “gnarly CSV” test reveal about prompting, and what were the key results?

The test used three entangled, inconsistent CSVs with a SQL injection attack embedded in one file and no consistent formatting—intended to mimic messy real-world business data. The model was asked to explain what happened, find duplicates, detect the injection risk, and produce an auditable, clear employee-count picture. ChatGPT-5 with “think hard” (either via a button or typed instruction) scored highest, beating Claude Code and multiple OpenAI variants. Vanilla ChatGPT-5 without “think hard” scored the lowest—lower than 03, 03 Pro, Claude Code, and other ChatGPT-5 modes—showing that correct prompting mode can be the difference between strong and weak execution.

How did ChatGPT-5 perform on the Apollo 13 Gantt chart task, and what pattern emerged?

For Apollo 13, ChatGPT-5 could research and lay out components and the critical path tied to the disaster, but it struggled to produce a readable Gantt chart directly. When asked to code the chart, it generated a full Gantt chart that the reviewer could follow, though it remained visually dense. The pattern: it may overindex on correctness and structure when coding is requested, while presentation/formatting can lag unless the task is framed to generate executable artifacts.

What reliability and safety cautions still apply despite the improvements?

Even with reduced hallucinations compared with earlier versions, the reviewer still encountered hallucinations in testing. Applets can be fragile: the model may overbuild and introduce bugs, so checkpointing and publishing in stages are encouraged. For high-stakes use (especially medical), the reviewer emphasizes that accuracy improvements matter, but verification remains essential.

Review Questions

In the reviewer’s comparisons, how did “think hard” change outcomes on the messy CSV + SQL injection test?
What evidence supports the claim that ChatGPT-5’s coding performance depends on the interface environment (canvas vs. Lovable)?
Why might asking for code produce better results than asking for a direct formatted artifact (e.g., the Apollo 13 Gantt chart)?

Key Points

1
ChatGPT-5 is described as a router over multiple underlying models, with special training emphasized for healthcare and coding-heavy work.
2
Healthcare accuracy is positioned as a major investment area, with the claim that medical advice should be more accurate than typical large language model outputs.
3
Coding performance can be highly environment-dependent: a complex itinerary app worked in ChatGPT-5 canvas but failed in Lovable with a white-screen output.
4
Prompting mode materially changes results on correctness-critical tasks; “think hard” variants outperformed vanilla ChatGPT-5 in the same test set.
5
ChatGPT-5 shows a recurring strength in tasks where it can generate executable artifacts (code, charts) rather than only formatted text.
6
Despite improvements, hallucinations still occur and generated applets may be brittle, so checkpointing and verification are recommended.
7
The practical debate over “best model” rankings is framed as less important than matching the model to the user’s correctness and reliability needs.

Highlights

A complex Japan travel itinerary applet became a fully working, clickable canvas app in ChatGPT-5, while the same prompt in Lovable returned a “white screen of death.”

On a messy, adversarial CSV task containing a SQL injection trap, ChatGPT-5 with “think hard” scored highest, but vanilla ChatGPT-5 scored lowest—showing prompting mode can flip performance.

ChatGPT-5 could research Apollo 13 details but struggled to render a readable Gantt chart directly; asking it to code the chart produced a usable result.

The model’s healthcare emphasis centers on reducing high-stakes errors, with the reviewer treating accuracy gains as the most consequential improvement.

Topics

ChatGPT-5
Model Routing
Canvas Applets
Prompting Modes
Healthcare Accuracy