Bigger than Open AI o1 - Claude 3.5 Agentic Computer Use

TL;DR

Anthropic’s Claude 3.5 “computer use” trains models to operate desktop software by interpreting screenshots and executing cursor, click, and typing actions.

Briefing Cornell Notes

Briefing

Anthropic’s Claude 3.5 models are being pushed into a new category: “computer use,” where the system can operate a computer like a person—moving a cursor, clicking buttons, typing text, and reacting to what’s on screen. The headline capability is available as a public beta through the API, and early demos show it can complete real workflows end-to-end, including form filling, basic coding changes, and scheduling tasks. The practical significance is straightforward: instead of limiting AI to chat or website-only automation, the model can interact with ordinary desktop software, turning routine computer work into something closer to autonomous execution.

The upgrade path matters too. Claude 3.5 Sonnet receives a “pretty decent boost across the board,” with the most notable gains in coding-related performance. Claude 3.5 Haiku is positioned as a smaller, faster model that matches the older Claude 3 Opus level of performance while costing less and running quicker. Anthropic also frames benchmarking carefully—arguing that OpenAI’s o1 models shouldn’t be directly compared to Claude 3.5 Sonnet—while still publishing results that place Claude 3.5 Sonnet and Haiku strongly across a range of tasks.

The most concrete demonstration of computer use is a fictional business workflow: Claude is asked to fill out a vendor request form using scattered information across a spreadsheet and a CRM. It starts by taking screenshots, notices the target company isn’t in the spreadsheet, switches to the CRM, searches for the company, scrolls to find the needed fields, transfers the information, and submits the form—without manual intervention. A second demo targets coding. Claude navigates to the Claude website, prompts itself to generate a themed personal homepage, downloads the resulting files, opens them in VS Code, and starts a local server. When a terminal error appears due to missing Python, it retries with Python 3, then uses find-and-replace to remove the failing line and reruns the site.

A third example shows task orchestration: Claude searches for a sunrise viewpoint near the user’s location, checks timing, and creates a calendar invite. Across these examples, the common thread is that the model isn’t just producing text—it’s executing multi-step actions by interpreting what it sees.

Still, access and constraints are a major story. The public beta is described as experimental and “cumbersome or error prone,” and the API-based setup introduces rate limits and safety restrictions. In hands-on testing described alongside the demos, the computer-use environment can stall on screenshot delays, hit token/request limits quickly, and refuse certain actions (such as creating social media accounts, sending messages, making phone calls, and performing tasks requiring personal authentication). The result is a tension: the capability looks like a major leap toward real-world automation, but the current sandbox and guardrails can make it hard for developers to fully stress-test or deploy it.

Overall, Anthropic’s push signals that agentic AI is moving from browsing and form-filling in web interfaces toward controlling general-purpose computers. If the beta constraints loosen and reliability improves, the implications are large—AI that can handle the repetitive, menial work of everyday desktop tasks could become a new baseline for productivity tools.

Cornell Notes

Anthropic is rolling out “computer use” for Claude 3.5, training its models to interact with a desktop environment like a human—seeing the screen, moving a cursor, clicking, and typing. In demos, Claude can complete workflows such as finding missing data in a CRM and submitting a vendor form, fixing coding errors in VS Code by reading terminal output, and planning a sunrise hike with timing and a calendar invite. Claude 3.5 Sonnet gets broad improvements with notable gains in coding, while Claude 3.5 Haiku aims to deliver Opus-level performance at lower cost and faster speed. The beta is API-only and comes with experimental reliability limits, rate limiting, and safety restrictions that can block or slow real testing.

What does “computer use” mean in Anthropic’s Claude 3.5 rollout, and why is it different from earlier agent tools?

Computer use means Claude can operate a computer interface directly: it takes screenshots, interprets what’s on screen, then performs actions such as moving a cursor, clicking UI elements, and typing text. That’s a step beyond agents that only work inside a browser or scrape web pages. In the demos, Claude switches between apps (e.g., spreadsheet to CRM, VS Code to a local browser) and completes tasks by interacting with the UI rather than only generating instructions.

How did Claude handle a real multi-step business workflow in the vendor form demo?

Claude was asked to fill out a vendor request form for “ant equipment company” using data from a spreadsheet or search portal. It first took screenshots and checked the spreadsheet, then realized the company wasn’t listed. It switched to the CRM, searched for the company, scrolled to locate the needed fields, transferred the information into the vendor form, and submitted it—without the user manually copying data.

What coding capabilities were demonstrated with computer use, and how did Claude recover from an error?

Claude navigated to the Claude website, prompted itself to generate a themed personal homepage, downloaded the files, opened them in VS Code, and started a local server. When the terminal showed an error because Python wasn’t available, Claude detected the failure and retried using Python 3. After the site still showed an issue, it read the terminal output, used VS Code’s find-and-replace to remove the problematic line, saved the file, and reran the website.

What does the “orchestrating tasks” demo show beyond coding and forms?

It demonstrates end-to-end planning and scheduling. Claude searched for a sunrise location near the user, estimated distance and timing using maps and search results, and then created a calendar invite with the relevant details. The emphasis is on chaining web lookups, calculations, and calendar actions into one autonomous flow.

Why might developers find the public beta frustrating even if the demos look strong?

The beta is API-only and experimental, with rate limits and token limits that can be consumed quickly because computer use involves repeated screenshot uploads and many interaction steps. In hands-on testing described, the agent can stall due to screenshot delays, hit “per minute rate limit” errors, and fail to complete longer sequences. Safety restrictions also block certain actions like creating accounts, sending messages, making phone calls, or performing tasks requiring personal authentication.

How do Claude 3.5 Sonnet and Claude 3.5 Haiku differ in the rollout?

Claude 3.5 Sonnet is positioned as the upgraded flagship model with broad improvements, especially in coding. Claude 3.5 Haiku is a smaller, lightweight, faster model designed to match the older Claude 3 Opus performance while being cheaper and quicker. Both are tied to agentic/computer-use benchmarking, but Haiku targets efficiency while Sonnet targets stronger overall capability.

Review Questions

What specific UI-level actions (cursor/click/typing) does computer use enable, and how does that change what an AI agent can accomplish?
In the coding demo, what signals did Claude use to diagnose failures, and what concrete steps did it take to fix them?
What kinds of safety restrictions and rate-limit behaviors can prevent computer-use agents from being practically testable in early public betas?

Key Points

1
Anthropic’s Claude 3.5 “computer use” trains models to operate desktop software by interpreting screenshots and executing cursor, click, and typing actions.
2
Claude 3.5 Sonnet receives broad improvements with notable gains in coding, while Claude 3.5 Haiku targets Opus-like performance at lower cost and faster speed.
3
Computer use is available as an API-based public beta, described as experimental and sometimes error-prone or cumbersome to set up.
4
Demos show end-to-end task completion: switching from spreadsheet to CRM to submit a vendor form, fixing VS Code/terminal errors, and creating calendar invites from planning queries.
5
Reliability and usability are constrained by screenshot-driven latency, token/request rate limits, and safety restrictions that block certain real-world actions.
6
Compared with web-only agents, computer use aims to interact with general-purpose applications, not just browser content.

Highlights

Claude 3.5 can complete a vendor request workflow by noticing missing data in a spreadsheet, switching to a CRM, extracting fields, and submitting the form through UI actions.

In the coding demo, Claude reads terminal output, retries with Python 3 after a missing-Python failure, then uses find-and-replace in VS Code to remove the line causing the error.

The strongest promise—autonomous desktop control—comes with practical friction: API rate limits and safety guardrails can prevent longer or more realistic testing.

The rollout frames Haiku as a cost-and-speed win by matching older Opus-level performance while Sonnet pushes broader capability gains, especially in coding.

Topics

Mentioned

API
VS Code
CRM
GPT
o1
GP4
AI
Docker