Claude Computer Use TESTED - This is VERY Promising!

TL;DR

The tested computer-use system can control a browser to execute multi-step workflows with checkpoints, not just produce text.

Briefing Cornell Notes

Briefing

A new “computer use” setup for Claude 3.5 Sonnet is being tested in a real desktop workflow—opening apps, navigating websites, writing and running code, and manipulating files—using a Docker-based environment that connects the model to a virtual workspace. The core takeaway from the hands-on run: the system can reliably carry out multi-step, agent-like tasks (search, click, type, save, then verify results), but it still stumbles on interactive edge cases like complex game play and occasional file-handling glitches.

The test begins with a sequential checklist: the system opens Firefox, searches Google, and then stops once it finds a relevant “computer use” coding video. That early success matters because it demonstrates the model isn’t just generating text—it’s controlling a browser UI and following a task flow with checkpoints.

Next comes a coding task to validate end-to-end execution. The agent writes Python code for bubble sort, saves it to a temporary file, opens a terminal, and runs the script. It then produces output and creates a file artifact, showing the loop from “write code → save → execute → inspect results” works in practice. The workflow is fast enough to feel interactive, and the agent can switch between the editor and terminal windows.

The experiment then shifts to a more agentic data-gathering job: pulling the top five Hacker News titles and vote counts and inserting them into a spreadsheet. The system navigates to hackernews.com, extracts the requested fields, and types them into the correct columns. Minor errors appear—like duplicating a number—but the overall structure holds: it can move from web browsing to spreadsheet entry and complete the task.

To test higher-stakes interaction, the agent plays chess on chess.com against the computer. It handles cookie/privacy prompts and starts a match, but performance degrades during gameplay. It eventually gives up, starts a new game, and then makes odd moves—repeatedly selecting the rook in a way that leads to losing material. The session ends with the operator stopping the run, suggesting the system’s planning and rule-consistent decision-making still lags in complex, turn-based environments.

Finally, the agent tackles an image workflow: search for a bird image, save it, install a dependency, write a Python script to crop the image to 500×500, and display the result. This segment highlights both resilience and fragility. The system hits errors (missing or mis-referenced files, missing image viewer tooling), but it retries with alternative commands—installing an image viewer and continuing until the output can be opened. The resulting crop is square, though the original image doesn’t fully fill the frame.

Overall, the test paints computer use as genuinely promising for practical automation—especially for browser-driven tasks and file-based coding workflows—while also showing clear failure modes in memory limits, interactive games, and brittle UI/file assumptions. The setup is offered via a GitHub repo and run locally through Docker with an API key, making it accessible for further experimentation.

Cornell Notes

Claude 3.5 Sonnet “computer use” is tested in a Docker-based local setup that can control a virtual desktop: it opens Firefox, searches the web, writes and runs Python code, fills a spreadsheet from Hacker News, and performs an image pipeline (download → crop → display). The strongest results come from structured, multi-step workflows where the agent can follow UI actions and then verify outputs (code execution, spreadsheet entries, saved files). Interactive complexity remains a weak spot: chess gameplay on chess.com starts correctly but later collapses into strange moves and repeated rook selection. Image tasks show resilience—errors trigger retries with installs and alternative steps—though output quality depends on the source image and tooling.

What sequence of actions shows the system can do more than generate text?

The run starts by opening Firefox, searching Google, and then stopping after finding a relevant “computer use” coding video. That demonstrates a UI-control loop: navigate → search → select the right target → halt at the intended checkpoint rather than continuing blindly.

How does the test validate that the agent can complete a coding workflow end-to-end?

It writes Python code for bubble sort, saves it to a temporary file, opens a terminal, and runs the script. The agent then shows output and creates an expected file artifact, indicating it can connect editor actions to execution and verification.

Why is the Hacker News-to-spreadsheet task a meaningful benchmark?

It combines web navigation and structured data entry. The agent visits hackernews.com, extracts the top five titles and vote counts, and types them into the spreadsheet columns. The run is mostly correct but includes a small data-entry mistake (a number typed twice), highlighting both competence and the need for validation.

What failure mode appears in the chess.com experiment?

After starting a match and handling prompts, the agent eventually gives up and restarts. In subsequent play it makes odd, rule-inconsistent moves—repeatedly selecting the rook—and loses material. The session ends when it keeps choosing the same problematic move pattern.

How does the image-cropping test show both robustness and brittleness?

The agent searches for a bird image, saves it, writes and runs a crop script to 500×500, and then tries to display the result. It encounters errors like missing files and missing image viewer support, but it retries by installing an image viewer and continuing. The final crop opens successfully, though the original image doesn’t fully fill the square output.

Review Questions

Which tasks in the test include a clear verification step (e.g., running code, opening an output file), and how did those verification steps affect confidence in the results?
What specific behaviors during the chess match suggest a planning or rule-consistency limitation?
In the image workflow, what were the main sources of error, and what retry strategies helped the agent recover?

Key Points

1
The tested computer-use system can control a browser to execute multi-step workflows with checkpoints, not just produce text.
2
A full coding loop works: generate Python code, save it, run it in a terminal, and inspect results.
3
Browser-to-spreadsheet automation is feasible: the agent can extract top items from Hacker News and populate spreadsheet columns, with occasional data-entry errors.
4
Interactive, rule-heavy environments like chess.com expose weaknesses: the agent starts matches but later makes strange moves and may restart or give up.
5
Image pipelines can succeed despite errors: the agent retries after missing files or missing viewer tools by installing dependencies and attempting alternative steps.
6
Resource constraints matter: one run required opening a new container after memory usage hit about 4 GB.

Highlights

The agent successfully completes a bubble sort workflow: it writes Python, saves a file, opens a terminal, and runs the script to produce output.

Hacker News extraction to spreadsheet entry works end-to-end, with only minor mistakes like duplicated values.

Chess gameplay starts correctly but collapses into odd move selection—especially repeated rook targeting—leading to failure.

When image cropping fails due to tooling or file issues, the agent keeps trying by installing missing components and reattempting until the output can be opened.