The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

TL;DR

Claude 3.5 Sonnet’s standout improvements are in reasoning, coding, and visual question answering, not merely in new “computer control” features.

Briefing Cornell Notes

Briefing

Claude 3.5 Sonnet’s biggest upgrade isn’t a flashy new “computer control” trick—it’s a noticeable jump in reasoning, coding, and multimodal understanding, backed by both published benchmarks and fresh third-party-style testing. The model also carries world knowledge up to April 2024, but the practical takeaway is that it performs better on the kinds of tasks people actually use LLMs for: software engineering benchmarks, science and math questions, and visual question answering over charts and tables.

A key thread running through the results is that Claude 3.5 Sonnet improves across multiple evaluation suites, including OS World (a large set of office and daily-use tasks) and SWE-bench Verified (software engineering). In OS World, the human comparison baseline is drawn from computer science undergraduates who hadn’t seen the exact test samples, which makes the reported “human-level” gap feel less like a trivial yardstick and more like a meaningful bar. On SWE-bench Verified, Claude 3.5 Sonnet lands at 49%, beating the earlier Claude 3.5 Sonnet and also outperforming OpenAI’s o1 preview in the cited comparison. The transcript also notes that apples-to-apples comparisons are tricky because prompting and scaffolding differ, but the direction of improvement is consistent.

The reasoning gains show up beyond standard benchmarks. In the creator’s own “simple bench” (a human-friendly test covering spatial reasoning and social intelligence), Claude 3.5 Sonnet scores higher than the previous version. Even with prompting and scaffolding, models still don’t reach the expanded human baseline, but the gap narrows compared with earlier Claude releases—an important point because it suggests progress is real, not just prompt engineering.

Where expectations should be tempered is reliability in agentic settings. Claude 3.5 Sonnet can use a computer via an API, yet broad adoption is unlikely due to unreliability and a long list of missing capabilities (for example, sending emails, making purchases, or reliably editing/manipulating images). The transcript leans on a separate “retail and airline” agent benchmark to make the reliability problem concrete: performance drops sharply when tasks must succeed every time across multiple attempts (described via “pass-to-the-power-of-K”). For airline booking and modification, one-try success rates are far from enough when the system must be correct repeatedly.

There’s also a nuance in safety behavior: Claude 3.5 Sonnet is said to refuse toxic requests slightly less often and refuse some innocent requests slightly more often than the prior model—small shifts, but worth noting when judging overall capability.

Finally, the transcript widens the lens beyond Claude. It highlights ongoing competition in AI entertainment and interactive modalities—Runway’s Act One for character-driven video generation, HeyGen-style interactive avatars with “Zoom-like” roleplay, and NotebookLM’s ability to customize podcast-style conversations from uploaded sources using Gemini 1.5 Pro. Taken together, the central message is clear: Claude 3.5 Sonnet is better in the ways that matter for everyday work, but the last mile to major economic impact still hinges on dependable agent behavior, not just higher benchmark scores.

Cornell Notes

Claude 3.5 Sonnet improves most visibly in reasoning, coding, and multimodal understanding, with knowledge updated through April 2024. Reported gains come from major benchmarks like OS World (hundreds of office/daily tasks) and SWE-bench Verified (software engineering), where Claude 3.5 Sonnet reaches 49% and tops the cited comparisons. The transcript also adds a separate “simple bench” run, showing a step up over the previous Claude 3.5 Sonnet, though humans still lead. The main limiter isn’t raw competence—it’s reliability for agentic tasks that must succeed repeatedly, especially in high-stakes domains like airline booking. Small safety-behavior shifts are also noted, with slightly different refusal patterns versus the prior model.

What’s the most important upgrade in Claude 3.5 Sonnet, beyond “computer use” headlines?

The transcript emphasizes reasoning, coding, and visual understanding improvements. It cites better performance on science, general knowledge, coding, mathematics, and visual question answering over charts/tables/graphs compared with the earlier Claude 3.5 Sonnet. The model’s world knowledge is updated through April 2024, but the practical value is stronger task performance across common problem types.

Why does the OS World benchmark comparison matter, and what detail affects how to interpret it?

OS World includes 350+ tasks spanning professional office work and daily use. A key interpretive detail is how the “human average” was derived: humans were computer science majors who had basic software skills but hadn’t been exposed to the exact samples or software before. That makes the reported human accuracy (around 72%) a more meaningful baseline than if the humans had prior familiarity with the test items.

How does Claude 3.5 Sonnet perform on software engineering benchmarks, and what comparison caveat is raised?

On SWE-bench Verified, Claude 3.5 Sonnet is reported at 49%, beating the earlier Claude 3.5 Sonnet and outperforming o1 preview in the cited comparison. The transcript warns that apples-to-apples comparisons are hard because results depend on prompting and scaffolding choices, so benchmark numbers should be read as directional evidence rather than perfectly controlled experiments.

What reliability problem emerges in agent benchmarks, and how does “pass-to-the-power-of-K” clarify it?

Agent benchmarks show steep drops when success must be consistent across multiple attempts. “Pass@K” means correct at least once within K tries, but “pass-to-the-power-of-K” means the agent must get it right every time across K attempts—one mistake ruins the whole trial. For airline tasks, even relatively strong one-try success (46% is cited) becomes much weaker under repeated-success requirements (dropping to around 40% for K=8), highlighting reliability as the bottleneck.

What does the transcript say about Claude 3.5 Sonnet’s computer-control capabilities and why that limits adoption?

Although Claude 3.5 Sonnet can use a computer via an API, broad public adoption is considered unlikely due to unreliability and missing capabilities. Examples of limitations mentioned include sending emails, making purchases, and reliably capturing or editing/manipulating images. The implication is that agentic autonomy still needs robustness before it becomes widely useful.

How does the transcript characterize safety/refusal behavior changes?

Claude 3.5 Sonnet is described as slightly worse at refusals: it correctly refuses toxic requests slightly less often and incorrectly refuses innocent requests slightly more often than the previous model. The change isn’t framed as dramatic, but it’s presented as a measurable difference worth tracking alongside capability gains.

Review Questions

Which benchmarks are used to support the claim that Claude 3.5 Sonnet is better, and what kinds of tasks do they measure?
Explain why “pass-to-the-power-of-K” can make an agent look much worse than “pass@K,” using the airline example described.
What does the transcript suggest is the main remaining barrier to large economic impact from LLM agents?

Key Points

1
Claude 3.5 Sonnet’s standout improvements are in reasoning, coding, and visual question answering, not merely in new “computer control” features.
2
World knowledge is updated through April 2024, but the practical value comes from higher task performance across benchmarks.
3
OS World comparisons are interpreted with care because the human baseline used computer science majors without prior exposure to the exact test samples.
4
SWE-bench Verified results place Claude 3.5 Sonnet at 49%, outperforming the cited earlier Claude version and beating o1 preview in the referenced comparison.
5
Agentic reliability remains the limiting factor: repeated-success metrics (“pass-to-the-power-of-K”) show sharp performance drops in high-stakes tasks like airline booking.
6
Claude 3.5 Sonnet shows small shifts in refusal behavior—slightly fewer correct refusals of toxic requests and slightly more incorrect refusals of innocent ones.
7
Beyond Claude, the transcript highlights rapid progress in AI video generation, interactive avatars, and source-to-podcast customization via NotebookLM and Gemini 1.5 Pro.

Highlights

Claude 3.5 Sonnet’s gains are framed as real competence improvements—reasoning, coding, and chart/table understanding—supported by OS World and SWE-bench Verified results.

Reliability is the key gap: agent success rates collapse when tasks must be correct repeatedly, not just once (pass-to-the-power-of-K).

Computer-control via API is treated as promising but not ready for broad adoption due to unreliability and missing actions like purchases and image edits.

Safety behavior shifts are subtle but measurable: slightly different refusal patterns versus the previous Claude 3.5 Sonnet.

Interactive AI modalities are accelerating in parallel—Runway Act One, Zoom-like avatar roleplay, and NotebookLM podcast customization.

Topics

Claude 3.5 Sonnet
OS World Benchmark
SWE-bench Verified
Agent Reliability
AI Video Modalities

Mentioned

Anthropic
OpenAI
Mistral
DeepMind
Gemini
Runway
HeyGen
NotebookLM
Weights & Biases
Weave
Google
LLM
API
OS World
SWE-bench
GP QA
MMMU
pass@K
pass-to-the-power-of-K