3 AI Agent Browser Automation Challenges That Keep Getting Harder

TL;DR

The agent can navigate AWS console UIs to complete multi-step infrastructure tasks, including S3 bucket creation and static website setup.

Briefing Cornell Notes

Briefing

Browser automation with AI agents can tackle surprisingly complex AWS console workflows—but the hardest parts aren’t the clicks, they’re the operational edge cases, permissions, and long-running interactive sessions.

Three escalating challenges tested a Chrome-controlling “cloud code” browser agent against AWS’s most UI-heavy tasks. Level one focused on building a simple static website pipeline in Amazon S3: create an S3 bucket, upload an image and an index.html file, enable static website hosting, and adjust public access settings so the site becomes reachable. The agent navigated directly to the S3 console, filled in a bucket name, created the bucket, uploaded the required files, and edited the bucket’s static website hosting configuration. The workflow hit a snag when it needed to modify the bucket policy/public access controls; instead of repeatedly struggling in the console UI, it switched to CloudShell to run the necessary CLI command. After about 40 minutes, the static site and assets were publicly accessible, demonstrating that the agent can recover from UI friction by changing tactics.

Level two raised the difficulty by requiring a full compute-and-browse loop: launch a Linux VM, expose it with a graphical remote desktop, and then use the VM’s browser to open a YouTube video about “cloud code.” The agent again started by navigating the AWS console to launch an instance. It proceeded to bring up an Ubuntu VM and attempted to use CloudShell to launch Firefox and load the YouTube page. The YouTube playback didn’t fully succeed—likely due to resource or connectivity limitations—but the VM was created and the browser rendering partially worked. The result earned a “pass” because the core objective (provisioning and reaching the browsing step) was achieved, even if the final playback wasn’t perfect.

Level three shifted from infrastructure to application building: create and publish a small web app where users can upload a video and then view it on a public playback page—essentially a lightweight, user-uploaded video platform. The agent produced the front-end (HTML/CSS), implemented the upload flow, and generated a shareable link. Testing from another machine showed the uploaded video could be played back, including with a larger ~200MB file, and refreshes correctly loaded the content. There was some “cheating” in the sense that CloudShell was leveraged heavily during parts of the build, but the end-to-end outcome worked: a functional upload-and-playback service stood up quickly (reported as roughly 5–10 minutes).

Across all three levels, the pattern was clear: the agent handles routine navigation and form-filling well, but the real breakthroughs come when it can switch tools—especially when console UI becomes brittle. The exercise also highlights how quickly these systems can move from “browser automation” to “full-stack deployment” when the right permissions and tooling are available.

Cornell Notes

An AI-driven browser automation agent successfully completed three escalating AWS tasks: creating an S3 static website, provisioning a Linux VM and attempting YouTube playback, and building a small video upload web app with public playback. Level one worked end-to-end but required a detour into CloudShell to fix bucket policy/public access issues after the console UI proved difficult. Level two launched an Ubuntu VM and reached the browsing step, though YouTube rendering wasn’t fully reliable. Level three produced a working upload-and-playback platform quickly, generating a public link and supporting large uploads (around 200MB). The key takeaway is that these agents excel at UI navigation and can deliver real deployments when they can switch from brittle UI interactions to command-line tooling.

What made the S3 static website challenge harder than simple form-filling?

The agent could create the bucket, upload an image and index.html, and enable static website hosting. The main friction came when adjusting bucket policy/public access settings—editing those controls in the console caused issues. The agent resolved it by switching to CloudShell and running the needed CLI command, then returned to confirm the site and assets were accessible.

How did the agent approach the VM + YouTube playback challenge, and why did it fall short?

It navigated to launch an instance, brought up an Ubuntu VM, and then attempted to open Firefox via CloudShell and load a YouTube URL. The VM existed and the browser step was attempted, but playback/rendering wasn’t fully successful, suggesting constraints like connectivity, resource limits, or remote desktop/browser environment issues.

What did the level three web app require, and what evidence showed it worked?

It needed a front end that lets users upload a video and then displays a public playback page. The agent generated the HTML/CSS, implemented upload behavior, and produced a direct share link. Testing from another machine showed the uploaded video played back, and a refresh loaded the content. A larger upload (~200MB) also succeeded, confirming the pipeline worked beyond tiny test files.

Where did “tool switching” matter most across the three challenges?

Tool switching was the difference between partial and complete success. Level one stalled on bucket policy/public access edits in the console, then succeeded after moving to CloudShell for CLI changes. Level three also leaned on CloudShell during implementation/build steps, even though the stated goal emphasized browser automation.

What does the overall run suggest about the agent’s strengths and limitations?

Strengths: navigating complex AWS console flows, filling forms, uploading assets, and generating working application code quickly. Limitations: interactive, resource-heavy, or permission-sensitive steps (like bucket policy editing or reliable remote browser playback) can fail unless the agent can adapt—often by using CLI/CloudShell rather than relying purely on UI clicks.

Review Questions

Which specific step in the S3 workflow forced the agent to switch from console interactions to CloudShell, and what was the outcome?
Why might YouTube playback on a newly launched VM be unreliable even when the VM boots successfully?
What functional behaviors must a “video upload + public playback” app implement for the level three test to pass?

Key Points

1
The agent can navigate AWS console UIs to complete multi-step infrastructure tasks, including S3 bucket creation and static website setup.
2
S3 bucket policy/public access changes are a common failure point in UI-only automation; CloudShell/CLI can unblock them.
3
Launching a VM is achievable through browser automation, but remote desktop and browser rendering can still fail due to environment constraints.
4
A working end-to-end upload-and-playback web app can be generated quickly when the agent has the right tooling and permissions.
5
Successful automation depends less on clicking through screens and more on handling permission-sensitive configuration and switching tactics when UI friction appears.
6
Large file uploads (around 200MB) can work in the resulting app, indicating the pipeline isn’t limited to trivial test sizes.

Highlights

Level one succeeded in building a public S3 static website, but only after bucket policy/public access edits were handled via CloudShell CLI rather than repeated console UI attempts.

Level two managed to provision an Ubuntu VM and attempt Firefox/YouTube playback, yet the final rendering wasn’t fully reliable—showing the limits of interactive browser automation under real constraints.

Level three produced a functional “upload a video, then play it back on a public link” platform, with playback working after refresh and uploads scaling to roughly 200MB.

Topics

AI Agent Browser Automation
AWS S3 Static Website
CloudShell CLI
VM Remote Desktop
Video Upload Web App

Mentioned

AWS
S3
VM
CLI
HTML
CSS
CloudShell