Gemini 2.5 Computer Use MCP | On The Edge #7
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemini 2.5 computer use can be operationalized through an MCP server that separates browser actions from macOS desktop actions.
Briefing
Google DeepMind’s Gemini 2.5 “computer use” model is being put to work through a custom MCP server that can control both a Mac and a browser. The core takeaway from the demo is practical: the model can translate high-level goals—like finding a specific file, opening it in the right app, or filling out a web form—into sequences of UI actions, and it does so reliably enough to complete tasks, even if speed and edge cases still lag.
The setup splits control into two toolsets: browser use and macOS app control. In the MCP server, the browser side provides tools such as taking screenshots and executing actions, while the macOS side adds screenshotting plus actions for interacting with the desktop (including clicking, typing, and launching apps). The author uses Cloud Code to run the MCP integration, then demonstrates four tools per environment (browser and Mac), with separate TypeScript implementations referenced from a single index.
On macOS, the first test targets a real-world workflow: locating an MP4 file named “Elizabeth” (created earlier by Suno) and playing it in QuickTime Player. The model begins by searching the Finder, then types the filename, right-clicks the result, chooses “Open with,” selects QuickTime Player, and starts playback. It’s not fast—there’s a visible delay and at least one attempt where the model needs to adjust (including switching to smaller screenshots)—but it succeeds at the pixel-level UI navigation required to find the correct file and launch the correct application.
Next comes browser automation using a “robo form filler” page. The goal is to fill out a multi-field form by role-playing as “Neo from Matrix” and generating plausible answers. The agent takes a screenshot of the page, plans the form-filling steps, and then proceeds to populate fields. It handles cookie prompts and begins entering data (starting with a first name consistent with Thomas Anderson / Neo). The run ends before every field is completed, not because the approach fails, but because the session hits a turn limit; increasing max turns would likely extend completion.
A final test pushes the model into a more brittle workflow: using computer control to open Cursor, create a Python file named “hello YouTube.py,” write code that prints “hello YouTube,” save it, and run it in a terminal. The model manages to create and edit the file and save it, but the execution is inefficient and error-prone—file naming and keyboard/layout issues cause confusion, and the terminal/run step takes a long time and appears to involve missteps (including a “No such file” moment and a rename to “hello.py”).
Overall, the demo frames Gemini 2.5 computer use as an incremental but meaningful step forward when paired with MCP tools: it can perform useful UI tasks across apps and websites, yet it still struggles with speed, turn budgeting, and precise developer workflows. The combination of richer context via MCP and a computer-use model is positioned as promising—while also carrying obvious risk when granting agents control of a real machine.
Cornell Notes
Gemini 2.5’s computer-use model is integrated into an MCP server that provides two action environments: browser control and macOS desktop control. In practice, the agent can complete UI-heavy tasks like searching for a specific MP4 file (“Elizabeth”), opening it with QuickTime Player, and playing it. In the browser, it can take screenshots, plan, and fill out a multi-field form while handling prompts like cookies—though it may stop early when a max-turn limit is reached. A developer workflow test (creating a Python file in Cursor and running it in a terminal) works partially but shows inefficiency and fragility around naming, keyboard/layout quirks, and command execution. The result: capable task completion, with clear room for reliability and speed improvements.
How does the MCP server structure control for the agent?
What macOS task demonstrates the model’s ability to navigate real UI elements?
What does the browser form-filling test show, and what limits it?
Why is the developer workflow test less impressive than the Finder and form demos?
What overall pattern emerges across the three use cases?
Review Questions
- Which tool categories does the MCP server provide for browser use versus macOS control, and why does that matter for task execution?
- What specific steps did the agent perform to open the “Elizabeth” MP4 in QuickTime Player, and where did delays or errors appear?
- In the Cursor + terminal test, what kinds of issues prevented a smooth run, and how did the max-turn constraint affect earlier browser automation?
Key Points
- 1
Gemini 2.5 computer use can be operationalized through an MCP server that separates browser actions from macOS desktop actions.
- 2
The MCP integration uses screenshot tools plus action-execution tools to let the model navigate real UI elements like Finder menus and web forms.
- 3
A macOS workflow—searching for an MP4 named “Elizabeth,” opening it with QuickTime Player, and playing it—completed successfully despite noticeable slowness and at least one adjustment attempt.
- 4
Browser automation can fill complex forms by planning after a screenshot, handling prompts like cookies, and entering role-play answers, but may stop early due to max-turn limits.
- 5
Developer workflows are more fragile: file naming, saving, and terminal execution can derail due to UI/input mismatches and inefficient step selection.
- 6
Pairing computer-use models with MCP tool access can expand context and enable more useful end-to-end automation, but risk remains when controlling a real machine.