Claude has taken control of my computer...

TL;DR

Claude’s Sonnet 3.5 upgrade strengthens performance in reasoning, programming, and visual Q&A, and it leads the software engineering benchmark by a wide margin, solving 49% of encountered GitHub issues.

Briefing Cornell Notes

Briefing

Anthropic’s latest Claude upgrade pairs a top-tier software-engineering model with a new “computer use” capability that lets the system operate real desktop applications—opening apps, clicking through interfaces, typing into fields, and extracting results—using only natural-language instructions. The practical impact is immediate: instead of generating code or spreadsheet formulas on paper, Claude can drive the mouse and keyboard to complete tasks inside tools like Firefox, Excel/LibreOffice, and even image editors.

The model side of the update centers on Claude’s Sonnet 3.5 performance. It posts strong benchmark results across reasoning, programming, and visual Q&A, and it leads the software engineering benchmark by a wide margin—solving 49% of GitHub issues it encounters. The comparison is complicated by differences in prompting setups and by the fact that newer OpenAI models (including one described as using a Chain of Thought technique that can automatically reprompt itself) make direct head-to-head evaluation harder.

The more consequential release is “computer use,” now available to developers via an API. In a live test, Claude successfully located the fireship logo’s SVG by effectively running a loop of actions: it takes a screenshot, detects Firefox, clicks the address bar, navigates to the site, right-clicks the logo, opens developer tools, inspects the HTML, and copies the code. The workflow resembles web scraping driven by UI interaction rather than direct programmatic access.

Claude also demonstrates spreadsheet automation. When asked to build a net worth calculator in Excel or LibreOffice, it not only enters data but generates the formulas needed for calculations. In another example, it opens X paint and produces a horse image using actual drawing strokes rather than diffusion-style image generation.

Still, the capability is far from reliable. During a coding task, Claude wandered onto the internet to browse Yellowstone National Park photos—an illustration of how an agent with broad UI control can make irrelevant or risky choices. The transcript frames the security stakes plainly: if such systems are used for high-value actions like managing bank accounts, the risk of financial loss or malicious behavior could grow.

To mitigate that risk, the test runs Claude inside a Docker-based sandbox rather than on a user’s main machine. Even then, the system is expensive and slow relative to human expectations. Tasks take minutes and Claude burns tokens rapidly because each step’s output becomes the next step’s input, continuing until the goal is reached or the process crashes.

The broader takeaway is that “computer use” is a step toward agents that can perform everyday work by operating the same interfaces humans use. The transcript argues that compute and training bottlenecks remain—especially the token and time cost for simple actions—but predicts that action-capable models will eventually be embedded into everyday computers and robots, raising both productivity potential and long-term safety concerns.

Cornell Notes

Anthropic’s Claude upgrade combines strong benchmark performance with a new “computer use” feature that lets the model control a real desktop environment through the API. In tests, Claude can navigate apps like Firefox, inspect page HTML to extract an SVG, and complete spreadsheet tasks in Excel or LibreOffice by entering data and generating formulas. It can also operate drawing software like X paint to create an image using actual strokes. The capability is powerful but imperfect: it can get distracted mid-task and it consumes tokens quickly, making it costly and sometimes crash-prone. Running it in a Docker sandbox helps contain risk, but the ability to act on real systems raises obvious security and misuse concerns.

What changes with Claude beyond better benchmark scores?

The key shift is “computer use,” an API feature that allows Claude to control a computer via mouse/keyboard and interact with desktop applications. Instead of only producing text or code, it can execute multi-step UI workflows—like opening apps, clicking through menus, typing URLs, and copying extracted code.

How did Claude successfully extract the fireship logo SVG?

Claude followed an action loop: it took a screenshot, detected Firefox, clicked the Firefox icon, moved to the address bar, typed the site URL, found the logo, right-clicked it, opened developer tools, inspected the HTML, and copied the SVG code. The workflow relied on observing the screen and then choosing the next UI action.

What did Claude do in spreadsheet software, and what does that imply?

When asked to build a net worth calculator in Excel or LibreOffice, Claude entered the provided data and created the formulas for the calculations. That suggests the system can translate a natural-language task into both spreadsheet structure and correct formula logic while operating the actual application UI.

Why is “computer use” risky even if it’s impressive?

Because UI control expands the blast radius of mistakes and misuse. The transcript notes an example where Claude, during a coding task, went online to browse Yellowstone photos—showing how it can make irrelevant choices. If connected to sensitive accounts (like banking), the same autonomy could lead to financial harm.

What containment and cost issues come with running it?

The transcript emphasizes sandboxing: Claude runs in a Docker-based environment rather than on a user’s main computer. It also highlights that token usage is heavy and tasks take minutes; the system feeds each step’s output into the next prompt, continuing until success or a crash.

What bottleneck stands between today’s agent behavior and human-level reliability?

The main bottleneck is compute time and token consumption for actions humans treat as trivial. The transcript notes that even simple tasks took 5–10 minutes, and it predicts that future action models will be baked into computers and robots once compute and training improve.

Review Questions

What specific capabilities does “computer use” add compared with a standard text-only language model?
Describe the multi-step loop Claude used to extract the SVG code from a website.
What two constraints—one technical and one safety-related—limit real-world deployment of computer-controlling AI agents today?

Key Points

1
Claude’s Sonnet 3.5 upgrade strengthens performance in reasoning, programming, and visual Q&A, and it leads the software engineering benchmark by a wide margin, solving 49% of encountered GitHub issues.
2
The “computer use” API feature enables Claude to operate real desktop applications by controlling mouse and keyboard through UI observation and action loops.
3
In testing, Claude navigated Firefox, used developer tools to inspect HTML, and copied SVG code for the fireship logo.
4
Claude can automate spreadsheet work in Excel or LibreOffice by entering data and generating formulas, and it can draw in X paint using actual stroke actions.
5
The system is imperfect: it can make off-task decisions (like browsing unrelated photos) and may crash during long action sequences.
6
Running computer-controlling agents in a Docker sandbox reduces risk, but token burn and multi-minute runtimes make it costly and slower than human workflows.
7
Compute and token efficiency remain the central bottleneck for moving from today’s agent behavior toward more reliable, everyday autonomy in computers and robots.

Highlights

“Computer use” turns Claude into a UI-driving agent—navigating apps, clicking interfaces, and extracting results like SVG code via developer tools.

Spreadsheet automation goes beyond generating formulas: Claude can operate Excel or LibreOffice to enter data and build the calculation structure.

Despite sandboxing, the autonomy cuts both ways—Claude can wander online mid-task, underscoring safety and control challenges.

Token burn is a practical limiter: each step feeds into the next, making even simple tasks take minutes and sometimes end in crashes.

Topics

Claude Sonnet 3.5
Computer Use API
UI-Driven Agents
Software Engineering Benchmarks
Agent Safety