The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Claude 3.5 Sonnet’s standout improvements are in reasoning, coding, and visual question answering, not merely in new “computer control” features.
Briefing
Claude 3.5 Sonnet’s biggest upgrade isn’t a flashy new “computer control” trick—it’s a noticeable jump in reasoning, coding, and multimodal understanding, backed by both published benchmarks and fresh third-party-style testing. The model also carries world knowledge up to April 2024, but the practical takeaway is that it performs better on the kinds of tasks people actually use LLMs for: software engineering benchmarks, science and math questions, and visual question answering over charts and tables.
A key thread running through the results is that Claude 3.5 Sonnet improves across multiple evaluation suites, including OS World (a large set of office and daily-use tasks) and SWE-bench Verified (software engineering). In OS World, the human comparison baseline is drawn from computer science undergraduates who hadn’t seen the exact test samples, which makes the reported “human-level” gap feel less like a trivial yardstick and more like a meaningful bar. On SWE-bench Verified, Claude 3.5 Sonnet lands at 49%, beating the earlier Claude 3.5 Sonnet and also outperforming OpenAI’s o1 preview in the cited comparison. The transcript also notes that apples-to-apples comparisons are tricky because prompting and scaffolding differ, but the direction of improvement is consistent.
The reasoning gains show up beyond standard benchmarks. In the creator’s own “simple bench” (a human-friendly test covering spatial reasoning and social intelligence), Claude 3.5 Sonnet scores higher than the previous version. Even with prompting and scaffolding, models still don’t reach the expanded human baseline, but the gap narrows compared with earlier Claude releases—an important point because it suggests progress is real, not just prompt engineering.
Where expectations should be tempered is reliability in agentic settings. Claude 3.5 Sonnet can use a computer via an API, yet broad adoption is unlikely due to unreliability and a long list of missing capabilities (for example, sending emails, making purchases, or reliably editing/manipulating images). The transcript leans on a separate “retail and airline” agent benchmark to make the reliability problem concrete: performance drops sharply when tasks must succeed every time across multiple attempts (described via “pass-to-the-power-of-K”). For airline booking and modification, one-try success rates are far from enough when the system must be correct repeatedly.
There’s also a nuance in safety behavior: Claude 3.5 Sonnet is said to refuse toxic requests slightly less often and refuse some innocent requests slightly more often than the prior model—small shifts, but worth noting when judging overall capability.
Finally, the transcript widens the lens beyond Claude. It highlights ongoing competition in AI entertainment and interactive modalities—Runway’s Act One for character-driven video generation, HeyGen-style interactive avatars with “Zoom-like” roleplay, and NotebookLM’s ability to customize podcast-style conversations from uploaded sources using Gemini 1.5 Pro. Taken together, the central message is clear: Claude 3.5 Sonnet is better in the ways that matter for everyday work, but the last mile to major economic impact still hinges on dependable agent behavior, not just higher benchmark scores.
Cornell Notes
Claude 3.5 Sonnet improves most visibly in reasoning, coding, and multimodal understanding, with knowledge updated through April 2024. Reported gains come from major benchmarks like OS World (hundreds of office/daily tasks) and SWE-bench Verified (software engineering), where Claude 3.5 Sonnet reaches 49% and tops the cited comparisons. The transcript also adds a separate “simple bench” run, showing a step up over the previous Claude 3.5 Sonnet, though humans still lead. The main limiter isn’t raw competence—it’s reliability for agentic tasks that must succeed repeatedly, especially in high-stakes domains like airline booking. Small safety-behavior shifts are also noted, with slightly different refusal patterns versus the prior model.
What’s the most important upgrade in Claude 3.5 Sonnet, beyond “computer use” headlines?
Why does the OS World benchmark comparison matter, and what detail affects how to interpret it?
How does Claude 3.5 Sonnet perform on software engineering benchmarks, and what comparison caveat is raised?
What reliability problem emerges in agent benchmarks, and how does “pass-to-the-power-of-K” clarify it?
What does the transcript say about Claude 3.5 Sonnet’s computer-control capabilities and why that limits adoption?
How does the transcript characterize safety/refusal behavior changes?
Review Questions
- Which benchmarks are used to support the claim that Claude 3.5 Sonnet is better, and what kinds of tasks do they measure?
- Explain why “pass-to-the-power-of-K” can make an agent look much worse than “pass@K,” using the airline example described.
- What does the transcript suggest is the main remaining barrier to large economic impact from LLM agents?
Key Points
- 1
Claude 3.5 Sonnet’s standout improvements are in reasoning, coding, and visual question answering, not merely in new “computer control” features.
- 2
World knowledge is updated through April 2024, but the practical value comes from higher task performance across benchmarks.
- 3
OS World comparisons are interpreted with care because the human baseline used computer science majors without prior exposure to the exact test samples.
- 4
SWE-bench Verified results place Claude 3.5 Sonnet at 49%, outperforming the cited earlier Claude version and beating o1 preview in the referenced comparison.
- 5
Agentic reliability remains the limiting factor: repeated-success metrics (“pass-to-the-power-of-K”) show sharp performance drops in high-stakes tasks like airline booking.
- 6
Claude 3.5 Sonnet shows small shifts in refusal behavior—slightly fewer correct refusals of toxic requests and slightly more incorrect refusals of innocent ones.
- 7
Beyond Claude, the transcript highlights rapid progress in AI video generation, interactive avatars, and source-to-podcast customization via NotebookLM and Gemini 1.5 Pro.