Claude has taken control of my computer...
Based on Fireship's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Claude’s Sonnet 3.5 upgrade strengthens performance in reasoning, programming, and visual Q&A, and it leads the software engineering benchmark by a wide margin, solving 49% of encountered GitHub issues.
Briefing
Anthropic’s latest Claude upgrade pairs a top-tier software-engineering model with a new “computer use” capability that lets the system operate real desktop applications—opening apps, clicking through interfaces, typing into fields, and extracting results—using only natural-language instructions. The practical impact is immediate: instead of generating code or spreadsheet formulas on paper, Claude can drive the mouse and keyboard to complete tasks inside tools like Firefox, Excel/LibreOffice, and even image editors.
The model side of the update centers on Claude’s Sonnet 3.5 performance. It posts strong benchmark results across reasoning, programming, and visual Q&A, and it leads the software engineering benchmark by a wide margin—solving 49% of GitHub issues it encounters. The comparison is complicated by differences in prompting setups and by the fact that newer OpenAI models (including one described as using a Chain of Thought technique that can automatically reprompt itself) make direct head-to-head evaluation harder.
The more consequential release is “computer use,” now available to developers via an API. In a live test, Claude successfully located the fireship logo’s SVG by effectively running a loop of actions: it takes a screenshot, detects Firefox, clicks the address bar, navigates to the site, right-clicks the logo, opens developer tools, inspects the HTML, and copies the code. The workflow resembles web scraping driven by UI interaction rather than direct programmatic access.
Claude also demonstrates spreadsheet automation. When asked to build a net worth calculator in Excel or LibreOffice, it not only enters data but generates the formulas needed for calculations. In another example, it opens X paint and produces a horse image using actual drawing strokes rather than diffusion-style image generation.
Still, the capability is far from reliable. During a coding task, Claude wandered onto the internet to browse Yellowstone National Park photos—an illustration of how an agent with broad UI control can make irrelevant or risky choices. The transcript frames the security stakes plainly: if such systems are used for high-value actions like managing bank accounts, the risk of financial loss or malicious behavior could grow.
To mitigate that risk, the test runs Claude inside a Docker-based sandbox rather than on a user’s main machine. Even then, the system is expensive and slow relative to human expectations. Tasks take minutes and Claude burns tokens rapidly because each step’s output becomes the next step’s input, continuing until the goal is reached or the process crashes.
The broader takeaway is that “computer use” is a step toward agents that can perform everyday work by operating the same interfaces humans use. The transcript argues that compute and training bottlenecks remain—especially the token and time cost for simple actions—but predicts that action-capable models will eventually be embedded into everyday computers and robots, raising both productivity potential and long-term safety concerns.
Cornell Notes
Anthropic’s Claude upgrade combines strong benchmark performance with a new “computer use” feature that lets the model control a real desktop environment through the API. In tests, Claude can navigate apps like Firefox, inspect page HTML to extract an SVG, and complete spreadsheet tasks in Excel or LibreOffice by entering data and generating formulas. It can also operate drawing software like X paint to create an image using actual strokes. The capability is powerful but imperfect: it can get distracted mid-task and it consumes tokens quickly, making it costly and sometimes crash-prone. Running it in a Docker sandbox helps contain risk, but the ability to act on real systems raises obvious security and misuse concerns.
What changes with Claude beyond better benchmark scores?
How did Claude successfully extract the fireship logo SVG?
What did Claude do in spreadsheet software, and what does that imply?
Why is “computer use” risky even if it’s impressive?
What containment and cost issues come with running it?
What bottleneck stands between today’s agent behavior and human-level reliability?
Review Questions
- What specific capabilities does “computer use” add compared with a standard text-only language model?
- Describe the multi-step loop Claude used to extract the SVG code from a website.
- What two constraints—one technical and one safety-related—limit real-world deployment of computer-controlling AI agents today?
Key Points
- 1
Claude’s Sonnet 3.5 upgrade strengthens performance in reasoning, programming, and visual Q&A, and it leads the software engineering benchmark by a wide margin, solving 49% of encountered GitHub issues.
- 2
The “computer use” API feature enables Claude to operate real desktop applications by controlling mouse and keyboard through UI observation and action loops.
- 3
In testing, Claude navigated Firefox, used developer tools to inspect HTML, and copied SVG code for the fireship logo.
- 4
Claude can automate spreadsheet work in Excel or LibreOffice by entering data and generating formulas, and it can draw in X paint using actual stroke actions.
- 5
The system is imperfect: it can make off-task decisions (like browsing unrelated photos) and may crash during long action sequences.
- 6
Running computer-controlling agents in a Docker sandbox reduces risk, but token burn and multi-minute runtimes make it costly and slower than human workflows.
- 7
Compute and token efficiency remain the central bottleneck for moving from today’s agent behavior toward more reliable, everyday autonomy in computers and robots.