The Rise of WebMCP
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
WebMCP lets websites expose structured, callable tools to AI agents in Chrome, replacing scraping and screenshot-driven guessing.
Briefing
WebMCP is poised to replace today’s “guess-and-scrape” web interaction for AI agents by letting websites expose structured, callable tools directly to the browser—cutting both token usage and engineering pain. Instead of an agent reading raw HTML, interpreting screenshots, or trying to infer which buttons to press, a page can register functions (for example, “search products” or “fill this form”) so the model can invoke them like tools. The practical payoff is straightforward: one tool call can collapse what used to take dozens of clicks, scrolls, and intermediate reasoning steps.
The core problem WebMCP targets is that current agent workflows treat websites like a foreign language. Whether agents rely on multimodal screenshots or direct DOM access, they still need to translate web content into something the model can act on. Screenshot-based approaches burn large amounts of tokens per image, while HTML/DOM approaches still require summarizing and filtering out irrelevant markup like paragraph tags and CSS. That translation overhead becomes a recurring cost and a recurring source of brittleness—especially for complex, dynamic pages.
WebMCP reframes the interaction model: each web page can act like an MCP (Model Context Protocol) surface for the agent. The agent asks what it can read, click, or fill in, and the site responds with structured capabilities and results. The concept has been discussed in academic work and earlier industry proposals, including collaboration between Microsoft and Google on a spec for how to make this work without forcing every capability behind a traditional backend API. Normal human browsing remains intact; the change is about giving agents a more efficient, standardized interface.
A key emphasis is human-in-the-loop coordination. Rather than pushing toward fully autonomous browsing, the design supports workflows where agents act on a user’s behalf while still handing control back when needed—such as when a site lacks the exact product a user wants and the system pauses for confirmation. This aligns with a three-pillar framing described at the Web AI Summit: context (helping the agent understand what matters beyond the current screen), capabilities (actions like form completion using available user knowledge), and coordination (managing the back-and-forth between user and agent).
Technically, WebMCP uses two main APIs. The declarative API leverages existing HTML forms, adding metadata like tool names and descriptions so agents can discover and use them. The imperative API targets more complex, dynamic interactions that may require JavaScript execution, defining a schema for richer tools. Both run client-side in the browser, which keeps the system closer to the user’s session and avoids forcing every site feature into a separate server API.
Chrome’s early preview already places WebMCP behind a flag, and participation in the Chrome early preview program is available for developers. The expectation is that rollout will accelerate soon, potentially alongside new ways to package site functionality into reusable “web MCPs,” making it easier for agent builders and product teams to ship reliable web actions with lower token costs.
Cornell Notes
WebMCP gives websites a standardized way to expose structured tools to AI agents inside Chrome, so agents can call functions instead of scraping HTML or interpreting screenshots. The goal is to reduce token costs and improve reliability by replacing “guess which button to press” behavior with explicit capabilities like “search products” or “fill this form.” Chrome’s implementation uses two APIs: a declarative path that annotates existing HTML forms, and an imperative path for dynamic interactions that may need JavaScript execution. The design also emphasizes human-in-the-loop coordination, pausing for user confirmation when exact outcomes aren’t available. With WebMCP already present behind a Chrome flag, developers can start testing through Chrome’s early preview program.
Why do screenshot-based and DOM-based agent approaches become expensive or brittle?
What does it mean for a web page to “act as an MCP” under WebMCP?
How do the declarative and imperative APIs differ in WebMCP?
What role does human-in-the-loop coordination play in WebMCP-style browsing?
What are the three pillars mentioned for Web AI Summit-style web agent interaction?
What practical advantage does WebMCP aim to deliver beyond convenience?
Review Questions
- How does WebMCP reduce token usage compared with screenshot-based agent workflows?
- In what situations would a site prefer the declarative API over the imperative API?
- Describe a human-in-the-loop scenario that WebMCP’s coordination pillar is meant to handle.
Key Points
- 1
WebMCP lets websites expose structured, callable tools to AI agents in Chrome, replacing scraping and screenshot-driven guessing.
- 2
Screenshot-based agent browsing is token-expensive, while DOM-based browsing still requires costly translation from raw HTML into agent-ready actions.
- 3
WebMCP treats each page as an MCP-like interface where agents can discover what they can read, click, and fill in.
- 4
Chrome’s WebMCP implementation uses two APIs: a declarative API for annotating existing HTML forms and an imperative API for dynamic, JavaScript-driven interactions.
- 5
The design emphasizes human-in-the-loop coordination, pausing for user confirmation when exact outcomes aren’t available.
- 6
Both APIs run client-side in the browser session, aiming to avoid forcing every capability into backend API calls.
- 7
WebMCP is already present in Chrome behind a flag, and developers can access early previews through Chrome’s early preview program.