Get AI summaries of any video or article — Sign up free
How can GPT-4.5 be So Bad? thumbnail

How can GPT-4.5 be So Bad?

Sam Witteveen·
6 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

GPT-4.5 is described as more natural and less verbose than GPT-4, with improved structured output and function calling.

Briefing

GPT-4.5 arrives with a “bigger and more natural” pitch, but benchmark results and practical tradeoffs paint it as an also-ran: stronger than GPT-4 in some conversational and structured-output behaviors, yet noticeably behind leading reasoning-focused models and priced far beyond comparable options.

The rollout leans on a scaling narrative that splits progress into two paths: scaling up training data and model size versus scaling at inference time using more compute and longer reasoning chains. GPT-4.5 is positioned as a large, top-tier release, with claims of improved alignment, less verbosity, and faster “getting to the point.” In everyday interaction, it can feel more conversational and more willing to deliver the requested format without the fluff—traits that matter for tool use, function calling, and structured outputs.

But when performance is measured, the picture shifts. In comparisons using published benchmark results, GPT-4.5 lands behind multiple alternatives. On MML-style evaluations, it is described as below DeepSeek V3 even though DeepSeek V3 is not framed as a reasoning model. On SWE-bench Verified—an engineering task benchmark—GPT-4.5 is reported to trail DeepSeek R1, an older Claude Sonnet baseline, and newer offerings such as OpenAI’s own o3 mini and Claude 3.7. Mathematics benchmarks are also cited as a weak spot, reinforcing the idea that GPT-4.5 is not optimized for the hardest reasoning workloads where newer “reasoning” models tend to shine.

That mismatch leads to a broader suspicion: GPT-4.5 may reflect an older generation of training priorities, not the newest inference-time reasoning trend. The transcript points to a knowledge cutoff of October 2023, while Claude 3.7 is said to reach October 2024—suggesting GPT-4.5 has been “sitting around” longer than the market’s most recent models. Another practical clue is output length: GPT-4.5’s maximum output is described as 16,000 tokens, far below reasoning models that allow dramatically larger outputs (even if some of those tokens are reasoning tokens not directly visible).

Cost is the final pressure point. GPT-4.5 pricing is contrasted with GPT-4: roughly $75 per million tokens in and $150 per million tokens out, versus GPT-4’s $250 per million tokens in and $10 per million tokens out (and even cheaper rates for GPT-4 mini). The transcript argues that this makes GPT-4.5 hard to justify for most users unless it becomes a drop-in replacement with a major price cut. There’s also frustration about latency: a test described as taking about 1 minute 10 seconds for GPT-4.5, compared with ~13 seconds for GPT-4 and ~6.5 seconds for GPT-4 mini, raising concerns about day-to-day usability.

Still, the transcript doesn’t dismiss GPT-4.5 entirely. It’s framed as potentially strong for structured outputs, function calling, and cleaner prose—just not as a clear winner for reasoning-heavy tasks or as a cost-effective upgrade. The closing question is whether the marginal improvements in interaction quality are worth paying a premium when faster, cheaper, and more capable reasoning models are already available across platforms.

Cornell Notes

GPT-4.5 is marketed as a larger, more natural, less verbose model with better alignment and improved structured output. In practice, benchmark comparisons cited in the transcript place it behind several reasoning-focused models (including DeepSeek V3/V1-style results on some tasks and OpenAI o3 mini and Claude 3.7 on others), especially on SWE-bench Verified and math evaluations. The transcript also flags signs of an older model generation: a knowledge cutoff of October 2023 and a relatively modest maximum output length (16,000 tokens) compared with newer reasoning models. Finally, pricing and latency are presented as major blockers—GPT-4.5 is described as far more expensive than GPT-4 variants, and tests reportedly show slower generation times. The net effect: GPT-4.5 may be useful for formatting and tool-like outputs, but it’s not a clear upgrade for hard reasoning or cost-sensitive use.

What scaling approach is used to justify GPT-4.5, and how does that relate to the transcript’s criticism?

The transcript contrasts two scaling paths for LLMs: (1) scaling up training—more tokens, larger models, more data—and (2) scaling at inference time—using extra compute at test time and longer reasoning chains. GPT-4.5 is framed as a “bigger model” trained with the first approach, while the transcript argues that the most noticeable recent gains across the industry have come from the second approach (reasoning-focused models that spend more compute during inference). That mismatch is used to explain why GPT-4.5 can feel better conversationally yet still lose on reasoning benchmarks.

Which benchmarks are used to claim GPT-4.5 underperforms, and what do those comparisons suggest?

The transcript cites multiple benchmark comparisons. On MML-style tasks, GPT-4.5 is described as below DeepSeek V3. On SWE-bench Verified, GPT-4.5 is said to be “way behind” DeepSeek V3 and DeepSeek R1, behind an older Claude Sonnet baseline, and behind newer models like OpenAI o3 mini and Claude 3.7. For mathematics, it again claims GPT-4.5 is not competitive. Together, these results suggest GPT-4.5 is not optimized for the hardest reasoning and engineering evaluation regimes where newer reasoning models excel.

What “older model” indicators are mentioned beyond benchmark scores?

Two main indicators are highlighted. First, the knowledge cutoff for GPT-4.5 is given as October 2023, while Claude 3.7 is described as reaching October 2024—implying GPT-4.5 may be less up to date. Second, maximum output length is described as 16,000 tokens for GPT-4.5, compared with reasoning models that allow much larger outputs (e.g., 100,000 or 65,000 tokens). Even if some of those tokens are reasoning tokens, the transcript treats the gap as evidence GPT-4.5 hasn’t kept pace with newer capabilities.

How does the transcript evaluate GPT-4.5’s value using pricing and speed?

Pricing is presented as a decisive tradeoff. GPT-4.5 is described as $75 per million tokens in and $150 per million tokens out, contrasted with GPT-4’s $250 per million tokens in and $10 per million tokens out (and GPT-4 mini’s much lower rates). The transcript argues that this makes GPT-4.5 hard to justify unless it becomes cheaper or becomes a true replacement. Latency is also criticized: a test is reported to take about 1 minute 10 seconds for GPT-4.5 versus ~13 seconds for GPT-4 and ~6.5 seconds for GPT-4 mini, which would limit practical everyday use.

What kinds of tasks does the transcript still credit GPT-4.5 for?

Despite the negative benchmark and cost framing, the transcript credits GPT-4.5 for “cleaner” output characteristics: less verbosity, more direct answers, and better structured output and function calling than GPT-4. It also notes that non-reasoning models often produce nicer prose and creative writing, and suggests GPT-4.5 may raise the quality bar there due to being larger—though at a high price.

What role does the system prompt play in the transcript’s discussion of GPT-4.5 behavior?

The transcript claims that a new system prompt for ChatGPT 4.5 was published on Twitter and that experiments showed it made a difference. It also mentions using a jailbreak-focused account (Plyy the Liberator) to probe the model, with observations including a cutoff date around 2023 and references to tool-related behavior. The takeaway is that prompt configuration may affect behavior and safety boundaries, even if core benchmark performance remains limited.

Review Questions

  1. Which two scaling strategies are contrasted, and how does the transcript connect that contrast to GPT-4.5’s benchmark results?
  2. What evidence is used to argue GPT-4.5 is behind on reasoning tasks (name the benchmarks and the direction of the comparisons)?
  3. Why does the transcript say GPT-4.5 may be hard to adopt despite improvements in conversational quality—what do pricing and latency have to do with it?

Key Points

  1. 1

    GPT-4.5 is described as more natural and less verbose than GPT-4, with improved structured output and function calling.

  2. 2

    Benchmark comparisons cited in the transcript place GPT-4.5 behind several reasoning-focused models on tasks like SWE-bench Verified and mathematics.

  3. 3

    The transcript flags GPT-4.5’s October 2023 knowledge cutoff and 16,000-token max output as signs it may be based on older training priorities.

  4. 4

    Pricing is presented as a major adoption barrier: GPT-4.5’s per-token rates are portrayed as far higher than GPT-4 and GPT-4 mini.

  5. 5

    Reported latency tests suggest GPT-4.5 can be much slower than GPT-4 variants, reducing its practicality for everyday use.

  6. 6

    The transcript’s overall conclusion is conditional: GPT-4.5 may be useful for formatting and tool-like workflows, but it’s not a clear upgrade for hard reasoning or cost-sensitive applications.

Highlights

GPT-4.5 is framed as conversationally stronger, but benchmark comparisons repeatedly place it behind DeepSeek and newer reasoning models on engineering and math evaluations.
A knowledge cutoff of October 2023 and a 16,000-token max output are used as signals that GPT-4.5 may not reflect the newest reasoning-era capabilities.
The transcript argues the pricing premium and slower generation times make GPT-4.5 difficult to justify as a drop-in replacement for GPT-4.

Topics

Mentioned