Get AI summaries of any video or article — Sign up free
Gemini 2.5 Computer Use MCP | On The Edge #7 thumbnail

Gemini 2.5 Computer Use MCP | On The Edge #7

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemini 2.5 computer use can be operationalized through an MCP server that separates browser actions from macOS desktop actions.

Briefing

Google DeepMind’s Gemini 2.5 “computer use” model is being put to work through a custom MCP server that can control both a Mac and a browser. The core takeaway from the demo is practical: the model can translate high-level goals—like finding a specific file, opening it in the right app, or filling out a web form—into sequences of UI actions, and it does so reliably enough to complete tasks, even if speed and edge cases still lag.

The setup splits control into two toolsets: browser use and macOS app control. In the MCP server, the browser side provides tools such as taking screenshots and executing actions, while the macOS side adds screenshotting plus actions for interacting with the desktop (including clicking, typing, and launching apps). The author uses Cloud Code to run the MCP integration, then demonstrates four tools per environment (browser and Mac), with separate TypeScript implementations referenced from a single index.

On macOS, the first test targets a real-world workflow: locating an MP4 file named “Elizabeth” (created earlier by Suno) and playing it in QuickTime Player. The model begins by searching the Finder, then types the filename, right-clicks the result, chooses “Open with,” selects QuickTime Player, and starts playback. It’s not fast—there’s a visible delay and at least one attempt where the model needs to adjust (including switching to smaller screenshots)—but it succeeds at the pixel-level UI navigation required to find the correct file and launch the correct application.

Next comes browser automation using a “robo form filler” page. The goal is to fill out a multi-field form by role-playing as “Neo from Matrix” and generating plausible answers. The agent takes a screenshot of the page, plans the form-filling steps, and then proceeds to populate fields. It handles cookie prompts and begins entering data (starting with a first name consistent with Thomas Anderson / Neo). The run ends before every field is completed, not because the approach fails, but because the session hits a turn limit; increasing max turns would likely extend completion.

A final test pushes the model into a more brittle workflow: using computer control to open Cursor, create a Python file named “hello YouTube.py,” write code that prints “hello YouTube,” save it, and run it in a terminal. The model manages to create and edit the file and save it, but the execution is inefficient and error-prone—file naming and keyboard/layout issues cause confusion, and the terminal/run step takes a long time and appears to involve missteps (including a “No such file” moment and a rename to “hello.py”).

Overall, the demo frames Gemini 2.5 computer use as an incremental but meaningful step forward when paired with MCP tools: it can perform useful UI tasks across apps and websites, yet it still struggles with speed, turn budgeting, and precise developer workflows. The combination of richer context via MCP and a computer-use model is positioned as promising—while also carrying obvious risk when granting agents control of a real machine.

Cornell Notes

Gemini 2.5’s computer-use model is integrated into an MCP server that provides two action environments: browser control and macOS desktop control. In practice, the agent can complete UI-heavy tasks like searching for a specific MP4 file (“Elizabeth”), opening it with QuickTime Player, and playing it. In the browser, it can take screenshots, plan, and fill out a multi-field form while handling prompts like cookies—though it may stop early when a max-turn limit is reached. A developer workflow test (creating a Python file in Cursor and running it in a terminal) works partially but shows inefficiency and fragility around naming, keyboard/layout quirks, and command execution. The result: capable task completion, with clear room for reliability and speed improvements.

How does the MCP server structure control for the agent?

It splits tools into two categories: browser use and macOS app control. Each category includes screenshot capability plus an “execute actions” tool. The macOS side also targets desktop interactions needed to navigate Finder, right-click menus, and open apps like QuickTime Player. Separate TypeScript modules implement browser and macOS behavior, then the index references both so one MCP server can route actions to the right environment.

What macOS task demonstrates the model’s ability to navigate real UI elements?

The agent finds and plays an MP4 file named “Elizabeth.” It searches in Finder, types the filename, right-clicks the matching file, selects “Open with,” chooses QuickTime Player, and starts playback. The demo notes that it’s not fast and may require adjustments (like using smaller screenshots), but it still reaches the correct file and launches the correct application.

What does the browser form-filling test show, and what limits it?

The agent fills out a multi-field web form using a “robo form filler” page. It takes a screenshot, plans the steps, handles cookie prompts, and begins entering role-play answers as “Neo from Matrix” (starting with a first name consistent with Thomas Anderson). The run stops before every field is completed because the session hits a max-turn limit, not because the approach can’t operate the page.

Why is the developer workflow test less impressive than the Finder and form demos?

Creating and running code requires precise sequencing: opening Cursor, creating the correct filename, saving, then running the right command in a terminal. The agent saves and edits the file and writes “print hello YouTube,” but naming and execution go wrong—there’s a “No such file” moment, a rename to “hello.py,” and the terminal/run step takes a long time. Keyboard/layout quirks and inefficient UI steps make the workflow brittle.

What overall pattern emerges across the three use cases?

The model can translate goals into UI actions and complete tasks when the environment is visually straightforward (Finder search + menu navigation; form filling with clear fields). It becomes less reliable and slower when tasks require tight developer-tool precision (file naming, saving, and running commands), where small UI or input mismatches cascade into failures or extra steps.

Review Questions

  1. Which tool categories does the MCP server provide for browser use versus macOS control, and why does that matter for task execution?
  2. What specific steps did the agent perform to open the “Elizabeth” MP4 in QuickTime Player, and where did delays or errors appear?
  3. In the Cursor + terminal test, what kinds of issues prevented a smooth run, and how did the max-turn constraint affect earlier browser automation?

Key Points

  1. 1

    Gemini 2.5 computer use can be operationalized through an MCP server that separates browser actions from macOS desktop actions.

  2. 2

    The MCP integration uses screenshot tools plus action-execution tools to let the model navigate real UI elements like Finder menus and web forms.

  3. 3

    A macOS workflow—searching for an MP4 named “Elizabeth,” opening it with QuickTime Player, and playing it—completed successfully despite noticeable slowness and at least one adjustment attempt.

  4. 4

    Browser automation can fill complex forms by planning after a screenshot, handling prompts like cookies, and entering role-play answers, but may stop early due to max-turn limits.

  5. 5

    Developer workflows are more fragile: file naming, saving, and terminal execution can derail due to UI/input mismatches and inefficient step selection.

  6. 6

    Pairing computer-use models with MCP tool access can expand context and enable more useful end-to-end automation, but risk remains when controlling a real machine.

Highlights

The agent successfully located a specific MP4 (“Elizabeth”) in Finder, used “Open with,” selected QuickTime Player, and started playback—showing end-to-end desktop control.
Browser form filling worked through screenshot → plan → action cycles, including cookie handling, but completion depended on turn budget.
The Cursor + terminal coding demo produced the most friction: file naming and run commands went off track, leading to delays and a “No such file” moment.
Speed and precision remain weaker than basic task completion, especially in developer-tool workflows.

Topics

  • Gemini 2.5 Computer Use
  • MCP Tooling
  • macOS UI Automation
  • Browser Form Filling
  • Cursor Terminal Automation

Mentioned