Project Mariner (Google AI Agent) - First 5 Tests and Impression

TL;DR

Project Mariner successfully completed a YouTube search and returned a concrete view count of 25,000 for “Google Flow all about AI.”

Briefing Cornell Notes

Briefing

Google’s Project Mariner browser agent delivers a mixed but promising first impression: it can reliably navigate the web, complete straightforward search-and-find tasks, and even execute simple code via an online Python runner—while running into hard limits around actions like sending emails and some interactive chat flows.

In the first test, Mariner successfully searched YouTube for the creator’s “Google Flow all about AI” video and returned a concrete metric: 25,000 views. The workflow looked like a typical agent loop—prepare a session, browse to the relevant site, run the query, then confirm completion—ending with an explicit “task complete” status and the view count displayed.

The second test targeted Gmail automation. Mariner could gather information about the latest Claude Anthropic live stream events from the web, but it stalled when asked to send an email. Even after the user manually logged into Gmail via takeover, the agent refused to complete the send step, producing a “cannot do that” style blocker. The result was a partial win: it could collect the live-stream details, but it couldn’t perform the final outbound action.

The third test focused on DeepMind’s diffusion model page and joining a weight list. Mariner found the correct page, clicked through to the “join the weight list” area, and handled cookie acceptance. It then reached a sign-in-gated form and successfully updated a field—changing a “profession” value to “engineer.” That demonstrated the agent’s ability to interact with multi-step web forms, though the overall experience was described as “far from perfect,” suggesting friction or brittleness in real-world flows.

The fourth test was the most impressive: Mariner was asked to find a way to test a small Python snippet online. It identified a site (v3schools) as a place to run code, initially crashed, then recovered after retry instructions. On the next attempt, it ran the code and produced the expected output: “The sum of seven and five is 12.” The agent required additional guidance (like excluding comments), but it still managed to locate an execution environment and complete the task end-to-end.

The final test attempted to converse with ChatGPT about the future of software engineers. Mariner struggled with the interaction layer: it navigated toward ChatGPT-related pages, but responses didn’t load, and an “internal error has occurred” message appeared when trying to run the prompt. The user ultimately abandoned the chat portion.

Overall, Mariner’s early strengths center on web navigation, search, and form completion, plus the ability to execute simple code through external tools. Its early weaknesses show up when the task requires privileged actions (like sending emails) or reliable interactive chat behavior. The takeaway is clear: the agent can be useful for research and browsing tasks, but it still needs guardrails and better reliability for action-heavy and conversational workflows.

Cornell Notes

Project Mariner performed well on web navigation tasks and delivered a clear win on code execution. It found a YouTube video and reported 25,000 views, then located DeepMind’s diffusion model weight-list page, accepted cookies, and updated a sign-in form field to “engineer.” The agent also searched for an online Python runner, recovered after a crash, and successfully executed a simple script to produce the result 12. Email sending failed due to an apparent action blocker, and the attempt to hold a live conversation with ChatGPT ran into loading/internal errors. The pattern suggests strong browsing and form skills, with limitations on privileged actions and interactive chat reliability.

What was the most reliable early capability demonstrated by Project Mariner?

The agent reliably completed search-and-retrieve tasks on the open web. In the YouTube test, it navigated to the relevant results and returned a specific metric—25,000 views for “Google Flow all about AI.” It also used the same browsing loop to gather information about the latest Claude Anthropic live stream events.

Why did the Gmail test end without a successful email being sent?

Mariner could collect the live-stream information but hit a permissions/action blocker when asked to send an email. Even after the user manually logged into Gmail via takeover, the agent still refused to perform the sending step, returning a “cannot do that” style failure.

How did Mariner handle the DeepMind weight-list signup flow?

It found the “DeepMind diffusion model” page, clicked into the “join the weight list” area, and accepted cookies. After reaching a sign-in-gated form, it successfully changed the “profession” field to “engineer,” showing it can interact with multi-step web forms when the user is authenticated.

What made the Python code execution test stand out?

Mariner located an online execution environment (v3schools). It initially crashed, but after retry instructions it reloaded, accepted the code input without comments, and ran the program. The terminal output confirmed correctness: “The sum of seven and five is 12.”

What went wrong when trying to converse with ChatGPT?

The agent struggled with the interactive layer. It navigated toward ChatGPT.com, but prompts didn’t produce a usable response. Attempts to run prompts resulted in an “internal error has occurred,” and the user eventually gave up on the chat portion.

Review Questions

Which tasks did Mariner complete end-to-end successfully, and which steps failed due to blockers or errors?
What evidence suggests Mariner can recover from failures during web-based code execution?
How do the Gmail and ChatGPT failures differ in nature (permissions vs. interaction reliability)?

Key Points

1
Project Mariner successfully completed a YouTube search and returned a concrete view count of 25,000 for “Google Flow all about AI.”
2
It could gather information about Claude Anthropic live streams but failed to send emails due to an action/permission blocker.
3
It navigated to DeepMind’s diffusion model weight-list page, accepted cookies, and updated a sign-in form field to “engineer.”
4
Mariner found an online Python execution site (v3schools), recovered after a crash, and executed a simple script to produce 12.
5
Interactive chat with ChatGPT was unreliable, with prompts failing to load and an “internal error has occurred” message appearing.
6
Early performance is strongest for browsing, searching, and form interaction, while action-heavy and conversational tasks need improvement.

Highlights

Mariner returned a specific YouTube metric—25,000 views—after navigating to the correct results page.

Even with manual Gmail login, the agent refused to send an email, indicating a hard limit on outbound actions.

The agent executed Python online: after retries, it produced “The sum of seven and five is 12.”

DeepMind signup interaction worked at the form level: it accepted cookies and changed “profession” to “engineer.”

ChatGPT conversation attempts stalled with loading issues and an “internal error has occurred.”

Topics

Browser Agent
Web Navigation
Python Execution
Gmail Automation
Interactive Chat