NEW - Anthropic Updated Claude Models & Computer Use Agents!!

TL;DR

Anthropic’s upgraded Claude 3.5 Sonnet is available immediately and is offered across Anthropic, Google Cloud Vertex, and Amazon Bedrock.

Briefing Cornell Notes

Briefing

Anthropic’s latest release pairs two upgraded Claude models with a new “computer use” capability that lets Claude interact with a user’s computer directly—turning natural-language requests into mouse clicks, keyboard input, and browser actions. The practical impact is straightforward: instead of relying on separate tools or search workflows, Claude can operate the interface itself, which makes agent-style automation feel much closer to “do the task for me” than “generate code or instructions I must run.”

The model lineup starts with an upgraded Claude 3.5 Sonnet, available immediately across Anthropic’s platform and also through common deployment routes like Google Cloud Vertex and Amazon Bedrock. Benchmarks compare the new Sonnet against the prior Claude 3.5 Sonnet and several competitors (including GPT-4o, GPT-4o-mini, Gemini 1.5 Pro, and Gemini 1.5 Flash) using zero-shot evaluations. The new Sonnet improves across nearly all tests, with one exception where Gemini 1.5 Pro remains ahead. The most notable gains show up in coding and agentic tool use—especially SWE-bench Verified, where performance rises from 33.4% to 49%. The transcript frames this as a meaningful step for developers and coding assistants, particularly for workflows associated with agentic coding and tool-using agents.

A second announcement brings Claude 3.5 Haiku, which is not available immediately but is expected later this month. Haiku has historically appealed for being fast and inexpensive while still delivering strong results, and the new version is positioned as even more capable—reportedly surpassing Claude 3 Opus on many tasks. In SWE-bench scoring, the updated Haiku lands higher than the previous Sonnet version, reinforcing the idea that it’s optimized for high-throughput “agent calls” where speed and cost matter. The first release is text-only, with image input expected to follow.

The biggest shift, though, is Anthropic’s computer use API. Rather than calling external tools for search or navigation, Claude can take actions directly in the user’s environment—opening a browser, performing searches, clicking through pages, and entering text. The transcript highlights OSWorld as a benchmark used to evaluate this kind of interface-level control, and suggests that pairing computer use with stronger models (like the newer Sonnet) should improve real-world agent performance.

Still, there’s a clear caution: giving an AI control over a computer introduces real risk. During Anthropic’s own demonstrations, the system reportedly produced “amusing errors,” including stopping recording and wandering into unrelated browsing (such as searching for Yellowstone National Park photos). The transcript also notes that commenters previously raised the idea of using a dedicated machine for such experiments.

In demos, Claude receives a screenshot of what’s on screen, then responds with concrete actions—mouse movement, clicks, and typing—enabling it to edit documents and complete multi-step tasks. The transcript’s takeaway is that Sonnet may serve as an “orchestrator” for complex work, while Haiku could handle rapid, repeated agent steps, and computer use could unify these into end-to-end automation—assuming users implement safeguards and retain the ability to stop or constrain the system quickly.

Cornell Notes

Anthropic released an upgraded Claude 3.5 Sonnet and a new Claude 3.5 Haiku, alongside a “computer use” API that lets Claude control a user’s computer through UI actions like mouse clicks and keyboard input. The upgraded Sonnet is available immediately and shows large gains on coding and agentic tool-use benchmarks, including SWE-bench Verified rising from 33.4% to 49%. Claude 3.5 Haiku arrives later this month, is positioned as faster and cheaper for high-volume agent work, and is reported to outperform Claude 3 Opus on many tasks; it starts text-only. The computer use capability aims to replace tool-based workflows by letting Claude run tasks directly in the browser or apps, but it also raises safety concerns because the system can behave unexpectedly during demos. Overall, the combination points toward more practical, end-to-end agent automation.

What changed with Claude 3.5 Sonnet, and why do the benchmark details matter?

The upgraded Claude 3.5 Sonnet is available immediately and is offered across Anthropic and also through Google Cloud Vertex and Amazon Bedrock. Benchmark comparisons include zero-shot tests against the previous Sonnet and other models (GPT-4o, GPT-4o-mini, Gemini 1.5 Pro, Gemini 1.5 Flash). The transcript emphasizes that the new Sonnet improves on nearly all benchmarks except one where Gemini 1.5 Pro still leads. The biggest practical signal is coding/agentic performance: SWE-bench Verified jumps from 33.4% to 49%, and agentic tool-use metrics (TAU bench 1) also rise.

How is Claude 3.5 Haiku positioned differently from Sonnet?

Claude 3.5 Haiku is framed as the fast, cost-effective option for “speedy agent use,” where many calls can be made quickly. It’s expected later this month and starts text-only, with image input planned to follow. The transcript claims the updated Haiku surpasses Claude 3 Opus on many tasks and even scores higher on SWE-bench than the previous Sonnet version, suggesting it’s not just cheaper—it’s also strong for coding and task execution.

What is “computer use,” and how does it differ from tool-based agents?

Computer use is an API that enables Claude to interact with the user’s computer directly. Instead of routing requests through separate tools (like a search tool or a browser automation step), Claude can perform UI actions itself: moving the mouse, clicking, typing, and navigating in a browser. The transcript describes demos where Claude receives a screenshot of the current screen, then issues key commands and executes steps to complete tasks like editing or conducting searches.

Why does OSWorld come up in the discussion of computer use?

OSWorld is referenced as a benchmark used in a prior agent-control paper. It’s used here to evaluate how well the system can handle operating-system-level tasks via interface actions. The transcript suggests that using computer use with the newer, stronger model (like the updated Sonnet) should improve outcomes beyond what was reported in that earlier work.

What safety concerns are raised, and what mitigation ideas appear?

The transcript warns that giving an API control over a computer can be risky. During Anthropic’s own demonstrations, the system reportedly stopped recording and lost footage, and it also began searching unrelated content (Yellowstone National Park photos). Commenters previously suggested using a separate dedicated computer for such experiments. The transcript also implies that users should be able to stop the system quickly when it behaves incorrectly, similar to how human oversight works with autopilot-style systems.

How might Sonnet and Haiku work together in an agent system?

The transcript proposes a division of labor: Haiku could handle many rapid, repetitive agent steps due to its speed and cost, while Sonnet could act as the larger “orchestrator” for more complex reasoning and coordination. This setup mirrors common agent design patterns where a smaller model drives throughput and a stronger model manages harder decisions.

Review Questions

Which benchmark improvements are highlighted as the strongest evidence for the upgraded Claude 3.5 Sonnet, and what are the reported numbers?
How does the computer use API change the workflow compared with tool-based search or navigation?
What limitations are mentioned for Claude 3.5 Haiku at launch, and how is it expected to be used differently from Sonnet?

Key Points

1
Anthropic’s upgraded Claude 3.5 Sonnet is available immediately and is offered across Anthropic, Google Cloud Vertex, and Amazon Bedrock.
2
Zero-shot benchmark results show the new Sonnet improves across most tests, with a notable coding gain on SWE-bench Verified from 33.4% to 49%.
3
Claude 3.5 Haiku is expected later this month, starts text-only, and is positioned as a fast, cost-effective model for high-throughput agent work.
4
Haiku is reported to surpass Claude 3 Opus on many tasks and to score higher on SWE-bench than the previous Sonnet version.
5
The computer use API enables Claude to control a computer directly via UI actions (mouse movement, clicks, typing) rather than relying solely on external tools.
6
OSWorld is used as a benchmark for interface-level agent performance, and stronger models are expected to improve results.
7
Safety concerns remain central: demos reportedly included unexpected behavior, so users should plan for oversight and quick stopping mechanisms.

Highlights

SWE-bench Verified for the upgraded Claude 3.5 Sonnet rises from 33.4% to 49%, signaling a major jump in coding reliability.

Claude 3.5 Haiku is positioned as a fast, cheaper agent model that’s expected to outperform Claude 3 Opus on many tasks.

The computer use API shifts agents from tool calls to direct UI control—mouse clicks, typing, and browser navigation.

During demos, the computer-use system reportedly behaved unexpectedly (including unrelated photo searching), underscoring the need for safeguards.

Topics

Claude 3.5 Sonnet
Claude 3.5 Haiku
Computer Use API
Agentic Tool Use
SWE-bench Verified

Mentioned

Sam Witteveen