NEW Claude Just Launched! Get Full Test Results vs. ChatGPT-5 + How it Saves You Hours

TL;DR

The new Claude model is positioned as more than a generator: it repeatedly checks and fixes its own work, reducing hidden errors in decks, spreadsheets, and code.

Briefing Cornell Notes

Briefing

A new Claude model is drawing attention for one practical reason: it produces workplace-ready outputs while making it easier to see exactly where a human expert needs to intervene. In head-to-head tests against OpenAI’s ChatGPT-5 and Claude’s prior frontier model, Opus 4.1, it stood out less for flashy generation and more for a disciplined habit of checking its own work—catching layout issues in PowerPoint, validating spreadsheet logic, and even verifying that code can actually run before claiming it’s ready.

The model’s strongest differentiator showed up during “real work” tasks: building multi-slide SAS decks, drafting documents in an Amazon PRFAQ style, analyzing messy spreadsheets, and working inside Claude Code. Against Opus 4.1, it delivered clearer narrative structure and higher immediate usability—described as roughly “90% ready” for a first pass. That matters because the bottleneck in many AI-assisted workflows isn’t drafting text; it’s the time spent cleaning up slop, reconciling inconsistencies, and reworking artifacts until they’re credible enough to share.

A key mechanism behind that improvement is more visible internal quality control. The model provides running commentary that shows which tools it’s invoking and what it’s checking. During PowerPoint creation, it repeatedly measured pixel-level alignment between title text and visuals, flagged mismatches, and redid slides without being prompted. In code-related work, it validated that a Next.js project could start and run a dev server before returning results. The result is a workflow where the output comes with fewer hidden errors—and where the user can focus on judgment rather than detective work.

The model also aims at a specific professional use case: turning raw, unstructured inputs into executive-ready narratives. In one test, it ingested 66 pages of voice-of-customer PDF quotes that were jumbled and out of order. It extracted meaningful themes and produced a PowerPoint narrative arc in one shot—something the tester previously found extremely hard to do manually at scale. The deck wasn’t claimed to be perfect, but it was close enough to enable rapid iteration, with subsequent refinements taking only minutes.

Another notable claim is robustness to prompting. The tester reports getting usable results from both highly structured prompts and casual, short instructions paired with data. That contrasts with frustration some users have had with ChatGPT-5 being more sensitive to prompt structure, to the point where “prompt packs” have been released to compensate.

Overall, the pitch is that Anthropic is betting on a future where teams still need PowerPoints, spreadsheets, and code execution—but benefit from clearer, more professional outputs that reduce “grunge time.” Instead of spending hours wrestling with messy drafts, the model is positioned as a decisioning baseline: it helps users quickly determine what’s right, what’s wrong, and what to revise. The broader payoff is less yelling at AI and more collaboration—where human domain expertise can shine through because the machine’s outputs are clearer, more checkable, and easier to trust enough to iterate.

Cornell Notes

The new Claude model is presented as a step forward for professional work because it generates outputs that are clearer, more checkable, and closer to “ready to use” than prior options. In tests against ChatGPT-5 and Claude Opus 4.1, it repeatedly caught and fixed issues—such as PowerPoint alignment problems and spreadsheet/code correctness—without requiring the user to micromanage. A standout example involved converting 66 pages of disorganized voice-of-customer quotes into an executive-ready PowerPoint narrative arc in one pass. The model also appears less fragile to prompting, producing usable results from both formal and casual prompt styles. The practical takeaway: faster iteration and more time spent on human decisions rather than cleaning up AI slop.

What was the most important difference the tester observed between the new Claude model and ChatGPT-5/Opus 4.1?

The model made it easier to see where human expertise needs to intervene because it produced clearer, more structured outputs and—crucially—checked its own work more aggressively. In PowerPoint tasks, it measured pixel overlap between title text and visuals, flagged mismatches, and redid slides on its own. In spreadsheet/code work, it validated formulas and even checked that a Next.js dev server could start and run before confirming it could. The tester contrasts this with ChatGPT-5’s tendency to “say it could do stuff” without the same level of verification.

How did the model perform on “work artifact” tasks like decks, docs, and spreadsheets?

It was tested on multiple deliverable types: creating 11–12 slide SAS decks, writing Amazon PRFAQ-style documents, building spreadsheets, and working in Claude Code. The tester reports that it beat Opus 4.1 on these head-to-head assignments, producing narrative clarity that felt close to publishable on the first pass—described as around “90% ready,” with polishing taking only additional minutes.

Why does the voice-of-customer example matter for real teams?

The tester fed the model 66 pages of raw voice-of-customer quotes that were out of order and unorganized. The model extracted meaningful narrative themes and produced an executive-ready PowerPoint arc in one shot. The tester frames this as a hard manual problem—quotes tend to “melt together” in the brain—so a tool that can connect specific quotes to coherent insights can save substantial time and improve the quality of the final narrative.

What evidence was given that the model is less sensitive to prompt wording?

The tester reports that it worked well with both a super formal prompt structure and a casual approach consisting of just two or three lines plus data. In both cases, outputs were described as usable and “healthy,” including PowerPoints that could be shown around an office. The implication is that Anthropic has invested in understanding office-style primitives (docs, decks, spreadsheets) well enough that users don’t need elaborate prompt engineering.

What workflow shift does the tester believe this enables—automation or decisioning?

Decisioning. The tester argues that while Opus 4.1 already enabled productivity, the new model pushes toward decisioning by making outputs clearer and more professional. That clarity reduces time spent on slop and shifts effort to judging what’s correct and iterating where it isn’t—turning AI from a draft generator into a collaborator that supports faster, higher-quality decisions.

How does the model’s “pushback” behavior relate to professional collaboration?

The tester describes the model as having opinions about what is correct and incorrect, and being willing to push back when something doesn’t feel right. That balance—being persuasive but also able to say “I don’t think that’s quite correct”—is framed as a move away from hyperactive, overly compliant behavior. The desired end state is fewer confrontational interactions and more thoughtful human-AI teamwork.

Review Questions

In what specific ways did the model demonstrate self-checking during PowerPoint creation, and why does that reduce user effort?
What does the voice-of-customer test illustrate about the model’s ability to transform unstructured inputs into executive narratives?
How does the tester connect prompt robustness to broader improvements in office-work outputs like docs, decks, and spreadsheets?

Key Points

1
The new Claude model is positioned as more than a generator: it repeatedly checks and fixes its own work, reducing hidden errors in decks, spreadsheets, and code.
2
Head-to-head tests report it beats Claude Opus 4.1 on professional deliverables like 11–12 slide SAS decks and Amazon PRFAQ-style docs.
3
PowerPoint quality improvements included pixel-level alignment checks between title text and visuals, followed by automatic slide corrections.
4
A voice-of-customer example described converting 66 pages of disorganized quotes into an executive-ready PowerPoint narrative arc in one pass.
5
The model is claimed to be less sensitive to prompt structure, producing usable outputs from both formal and casual prompt styles.
6
The practical payoff is faster iteration and “decisioning” rather than spending time cleaning up AI slop.
7
The model’s pushback behavior is framed as enabling a more professional human-AI collaboration, where expertise guides revisions.

Highlights

The model caught PowerPoint layout problems itself—measuring pixel overlap and redoing slides—rather than relying on the user to notice mistakes.

It validated code by checking that a Next.js dev server could actually start and run before confirming results.

A single run turned 66 pages of messy voice-of-customer quotes into an executive-ready PowerPoint narrative arc.

First-pass decks were described as roughly “90% ready,” enabling multiple iterations in minutes rather than hours.

Topics

Claude Model
PowerPoint Automation
Voice of Customer
Prompt Robustness
Claude Code
AI Decisioning

Mentioned

Nate B Jones
PRFAQ