The Rise of WebMCP

TL;DR

WebMCP lets websites expose structured, callable tools to AI agents in Chrome, replacing scraping and screenshot-driven guessing.

Briefing Cornell Notes

Briefing

WebMCP is poised to replace today’s “guess-and-scrape” web interaction for AI agents by letting websites expose structured, callable tools directly to the browser—cutting both token usage and engineering pain. Instead of an agent reading raw HTML, interpreting screenshots, or trying to infer which buttons to press, a page can register functions (for example, “search products” or “fill this form”) so the model can invoke them like tools. The practical payoff is straightforward: one tool call can collapse what used to take dozens of clicks, scrolls, and intermediate reasoning steps.

The core problem WebMCP targets is that current agent workflows treat websites like a foreign language. Whether agents rely on multimodal screenshots or direct DOM access, they still need to translate web content into something the model can act on. Screenshot-based approaches burn large amounts of tokens per image, while HTML/DOM approaches still require summarizing and filtering out irrelevant markup like paragraph tags and CSS. That translation overhead becomes a recurring cost and a recurring source of brittleness—especially for complex, dynamic pages.

WebMCP reframes the interaction model: each web page can act like an MCP (Model Context Protocol) surface for the agent. The agent asks what it can read, click, or fill in, and the site responds with structured capabilities and results. The concept has been discussed in academic work and earlier industry proposals, including collaboration between Microsoft and Google on a spec for how to make this work without forcing every capability behind a traditional backend API. Normal human browsing remains intact; the change is about giving agents a more efficient, standardized interface.

A key emphasis is human-in-the-loop coordination. Rather than pushing toward fully autonomous browsing, the design supports workflows where agents act on a user’s behalf while still handing control back when needed—such as when a site lacks the exact product a user wants and the system pauses for confirmation. This aligns with a three-pillar framing described at the Web AI Summit: context (helping the agent understand what matters beyond the current screen), capabilities (actions like form completion using available user knowledge), and coordination (managing the back-and-forth between user and agent).

Technically, WebMCP uses two main APIs. The declarative API leverages existing HTML forms, adding metadata like tool names and descriptions so agents can discover and use them. The imperative API targets more complex, dynamic interactions that may require JavaScript execution, defining a schema for richer tools. Both run client-side in the browser, which keeps the system closer to the user’s session and avoids forcing every site feature into a separate server API.

Chrome’s early preview already places WebMCP behind a flag, and participation in the Chrome early preview program is available for developers. The expectation is that rollout will accelerate soon, potentially alongside new ways to package site functionality into reusable “web MCPs,” making it easier for agent builders and product teams to ship reliable web actions with lower token costs.

Cornell Notes

WebMCP gives websites a standardized way to expose structured tools to AI agents inside Chrome, so agents can call functions instead of scraping HTML or interpreting screenshots. The goal is to reduce token costs and improve reliability by replacing “guess which button to press” behavior with explicit capabilities like “search products” or “fill this form.” Chrome’s implementation uses two APIs: a declarative path that annotates existing HTML forms, and an imperative path for dynamic interactions that may need JavaScript execution. The design also emphasizes human-in-the-loop coordination, pausing for user confirmation when exact outcomes aren’t available. With WebMCP already present behind a Chrome flag, developers can start testing through Chrome’s early preview program.

Why do screenshot-based and DOM-based agent approaches become expensive or brittle?

Screenshot workflows require passing images into a multimodal model, which can consume thousands of tokens per image. DOM/HTML approaches avoid image tokens but still require translating raw markup into agent-friendly summaries—filtering out irrelevant tags (like paragraph elements) and presentation details (like CSS). In both cases, the agent is effectively speaking a “foreign language” to the site, which increases token spend and failure rates when pages change.

What does it mean for a web page to “act as an MCP” under WebMCP?

A site can register structured functions that the agent can discover and call. Instead of the model inferring what to click, it can ask what it can read, click, or fill in, and then invoke tools such as “search products.” The response is structured results, so a single tool call can replace many intermediate browser interactions.

How do the declarative and imperative APIs differ in WebMCP?

The declarative API focuses on standard actions derived from existing HTML forms, adding tool metadata such as tool name and description so agents can use them. The imperative API handles more complex dynamic interactions, defining a schema for richer tools and enabling JavaScript execution when needed. Both are designed to run client-side in the browser session.

What role does human-in-the-loop coordination play in WebMCP-style browsing?

The system is designed to support back-and-forth control rather than full autonomy. For example, if an agent tries to buy a specific brand or milk type and the site doesn’t have the exact item, the workflow can route back to the user for confirmation. This coordination pillar helps manage uncertainty and mismatches between user intent and site inventory.

What are the three pillars mentioned for Web AI Summit-style web agent interaction?

The framing uses: (1) context—information the agent needs to understand what the user is doing, including relevant history beyond the current screen; (2) capabilities—actions on the user’s behalf, like filling out forms using known user details; and (3) coordination—how control flows between user and agent, including when to pause for human input.

What practical advantage does WebMCP aim to deliver beyond convenience?

It targets efficiency and cost. By letting agents call structured functions directly, it reduces the number of interactions required to complete tasks (dozens of clicks/scrolls can collapse into one tool call). That reduction directly lowers token consumption compared with screenshot-heavy or translation-heavy approaches.

Review Questions

How does WebMCP reduce token usage compared with screenshot-based agent workflows?
In what situations would a site prefer the declarative API over the imperative API?
Describe a human-in-the-loop scenario that WebMCP’s coordination pillar is meant to handle.

Key Points

1
WebMCP lets websites expose structured, callable tools to AI agents in Chrome, replacing scraping and screenshot-driven guessing.
2
Screenshot-based agent browsing is token-expensive, while DOM-based browsing still requires costly translation from raw HTML into agent-ready actions.
3
WebMCP treats each page as an MCP-like interface where agents can discover what they can read, click, and fill in.
4
Chrome’s WebMCP implementation uses two APIs: a declarative API for annotating existing HTML forms and an imperative API for dynamic, JavaScript-driven interactions.
5
The design emphasizes human-in-the-loop coordination, pausing for user confirmation when exact outcomes aren’t available.
6
Both APIs run client-side in the browser session, aiming to avoid forcing every capability into backend API calls.
7
WebMCP is already present in Chrome behind a flag, and developers can access early previews through Chrome’s early preview program.

Highlights

WebMCP shifts web interaction from “infer buttons from screenshots/HTML” to “call structured functions,” turning many steps into one tool invocation.

The declarative API builds on existing HTML forms by adding tool metadata, while the imperative API defines richer schemas for dynamic behaviors.

Human-in-the-loop coordination is a first-class goal—agents can act, but they can also hand control back when the site can’t satisfy the user’s exact request.

Topics

WebMCP
AI Agents
Browser Tools
Human-in-the-Loop
Chrome Preview