The 10 Biggest ChatGPT-5 Problems & How to Fix Them

TL;DR

ChatGPT-5’s “single model” experience can hide routing to faster, less reasoning-heavy behavior, so explicit “think hard” prompts and custom instructions are key to getting deeper answers.

Briefing Cornell Notes

Briefing

ChatGPT-5’s rollout has triggered backlash not just over one-off mistakes, but over how the system reshapes long-running user workflows—especially through a “single model” experience that actually routes requests behind the scenes. The core complaint is that the router often steers users toward faster, less “reasoning” behavior to protect OpenAI’s GPU capacity, which can produce shallow answers, inconsistent behavior versus prior setups, and sudden quality changes mid-conversation. The practical takeaway: many of the biggest problems are fixable with prompt tactics and customization, but some require switching models or using the API.

A central theme is that ChatGPT-5 behaves like multiple models bundled under one interface. Users who previously relied on specific “thinking” behavior (or on older model variants) found their outputs changed after migration. The transcript frames this as “model drift and mismatch”: production workflows can break because new models don’t reproduce old outputs exactly. The recommended response is operational—version prompts, track changes, and run targeted experiments—rather than expecting a perfect drop-in replacement. For chatbot users, the router’s default choices can also matter: if shallow responses persist on complex questions, the fix is to explicitly request deeper reasoning (e.g., “think hard”) and to set custom instructions so the system defaults to “deep analysis” unless the user asks for a quick take.

Another major friction point is the gap between ChatGPT and the API. In chat, routing can hide which underlying model is actually responding; in the API, developers can select a specific model and get more consistent behavior. The transcript notes that OpenAI has been adding visibility in chat—showing which model is being used and responding—yet full control still favors API users or higher-tier plans with more model options. Where users can’t reliably force a particular model, prompt-based routing becomes the workaround.

Long-context expectations also collide with reality. Even with very large token windows, recall isn’t perfect; the transcript cites OpenAI evaluation suggesting roughly 89% accuracy across 128 to 256,000 tokens, with “lost in the middle” effects still present. The mitigation remains familiar: anchor key instructions at the beginning and end, and use rhythmic reminders throughout the context.

Several narrower but concrete issues follow. Users asking for JSON sometimes receive invalid JSON; the fix is to request structured outputs using JSON schema and custom instructions, with the possibility of switching models if smaller variants like “GPT5 Mini” misbehave. Tool use can also be deceptive: the model may claim it called a tool without actually doing so, so prompts should require a plan and proof via artifacts (e.g., showing the Python query or generated code). Reasoning mode costs time and tokens; if speed matters, users should choose non-reasoning modes or explicitly request faster answers. Safety “guardrail friction” can block or soften responses for bio-adjacent requests, requiring narrower phrasing or model changes.

Finally, the transcript highlights a “silent fallback” on lower tiers: after heavy usage (e.g., around 80 messages in 3 hours), the system can downgrade without warning, reducing quality mid-conversation. The proposed solutions are monitoring usage, upgrading tiers, or using the API. Overall, the message is blunt: there’s no magic transition that eliminates the need for prompt engineering and workflow adaptation—ChatGPT-5 rewards deliberate instruction and ongoing tuning, and it can deliver exceptional results when used that way.

Cornell Notes

ChatGPT-5’s biggest pain points stem from routing and consistency: a “single” chat experience can steer requests toward faster, less reasoning-heavy behavior to manage GPU load, which can yield shallow answers and break older workflows. Model drift is expected after migration, so production users should version prompts, track changes, and run targeted experiments rather than assuming outputs will match. Long-context performance isn’t perfect even with large token windows, so anchoring and reminders still matter to avoid “lost in the middle.” Several practical fixes are prompt- and configuration-based: request “think hard,” use custom instructions, demand JSON schema for structured outputs, require tool-call proof via artifacts, and verify factual claims with citations. Some issues—like silent downgrades and full model control—depend on plan tier or the API.

Why do users sometimes get shallow answers on complex questions even when they’re using “ChatGPT-5” as a single model?

The system routes requests behind the scenes. The router is tuned to preserve GPU capacity and can default to a faster, non-reasoning variant when it’s not explicitly prompted otherwise. The transcript’s recommended fixes are (1) add a direct instruction like “think hard” in the prompt and (2) set custom instructions to default to deep analysis unless the user requests a quick take.

How does “chat vs API mismatch” affect reliability, and what can users do about it?

Chat uses routing inside the chatbot, while the API can provide direct model access where developers can test a specific model in a sandbox and deploy it with more consistent behavior. The transcript notes that OpenAI has been adding chat visibility into which model is responding, but full control still typically requires either using the API or selecting among available model options in the chat dropdown (more options on higher tiers). If model selection isn’t available, prompting and custom instructions become the main lever.

What causes “model drift” after upgrading to ChatGPT-5, and how should teams respond?

Old workflows can produce different outputs after migration because new models don’t replicate prior behavior exactly. The transcript treats this as inevitable and recommends prompt versioning and evaluation: track prompts, test changes deliberately, and adjust prompts to the new model’s behavior. For production pipelines, selecting the exact GPT5 model can improve control; for chatbot-only flows, more prompt customization and routing work may be required.

Why doesn’t a huge context window guarantee perfect recall, and what prompting techniques still help?

Even with very large token windows, recall can degrade—especially in the “lost in the middle” region. The transcript cites OpenAI evaluation indicating about 89% accuracy between 128 and 256,000 tokens, which is strong but not perfect. Mitigations include anchoring key instructions at the beginning, reiterating requirements at the end, and using rhythmic reminders throughout the context (a technique associated with system-prompt strategies shown by Claude).

What’s the recommended approach when ChatGPT-5 doesn’t return valid JSON or makes questionable tool-call claims?

For JSON, asking only “Please return JSON” can fail; the transcript recommends requesting structured outputs with JSON schema and specifying the expected structure in custom instructions, with the possibility of switching models if smaller variants (e.g., GPT5 Mini) show JSON issues. For tool calls, the model can sometimes claim it performed actions it didn’t actually complete. The fix is to require a plan and then require proof via artifacts—such as showing the Python query or generated code—so the response demonstrates the tool work rather than merely asserting it.

Review Questions

What routing-related symptoms would make you suspect the router is defaulting to a faster, non-reasoning behavior, and how would you change your prompts/custom instructions?
How would you design a prompt versioning and evaluation plan to handle model drift when migrating production workflows to ChatGPT-5?
Which long-context prompting tactics would you use to reduce “lost in the middle,” and why do they work even when token windows are large?

Key Points

1
ChatGPT-5’s “single model” experience can hide routing to faster, less reasoning-heavy behavior, so explicit “think hard” prompts and custom instructions are key to getting deeper answers.
2
Chat and API reliability differ: chat routing can obscure which model responds, while the API supports direct model selection; full control often requires the API or higher-tier model dropdown options.
3
Model drift after migration is expected; production teams should version prompts, evaluate output changes, and run targeted adjustments instead of assuming old workflows will still match.
4
Large context windows don’t guarantee perfect recall; anchoring instructions at the beginning and end and using rhythmic reminders helps counter “lost in the middle.”
5
For structured data, request JSON via JSON schema (and custom instructions) rather than only asking for “JSON,” since invalid JSON can occur—especially in smaller variants.
6
To reduce tool-call hallucinations, require a plan and demand proof through artifacts (e.g., showing the generated Python query/code) rather than accepting tool-call claims at face value.
7
On lower tiers, quality can silently downgrade after heavy usage; monitoring usage, upgrading tiers, or using the API are the practical mitigations.

Highlights

The router is tuned to protect GPU capacity, which can push complex questions toward a faster, non-reasoning mode unless users explicitly request deeper thinking.

Long-context performance improves but still isn’t perfect; the transcript cites OpenAI evaluation around 89% accuracy across 128 to 256,000 tokens, with “lost in the middle” still relevant.

Tool-call reliability improves when prompts require artifacts that prove actions occurred—like showing the Python query or generated code—rather than trusting assertions.

ChatGPT-5 can silently downgrade mid-conversation on lower plans after roughly 80 messages in 3 hours, with no warning and reduced quality.

Topics

ChatGPT-5 Routing
Prompt Custom Instructions
Model Drift
Long-Context Recall
Tool-Call Verification