Get AI summaries of any video or article — Sign up free
Introduction to Operator & Agents thumbnail

Introduction to Operator & Agents

OpenAI·
5 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Operator is an early research preview agent that completes tasks by controlling a remote web browser using screenshots plus keyboard and mouse.

Briefing

AI agents are moving from chat-based assistance into hands-on work: Operator is an OpenAI system that can take control of a remote web browser, interpret what’s on the screen, and complete real tasks—like booking reservations, shopping for groceries, and buying event tickets—using only the same kind of visual input and keyboard/mouse actions a person would use.

Operator launches as an early research preview with a browser-in-the-cloud interface. Users type a request, and Operator opens a remote session, navigates websites, and performs actions by clicking, typing, and iterating based on fresh screenshots. In a reservation demo, it searched OpenTable for a restaurant in San Francisco and handled availability changes by proposing an alternate time when the requested slot wasn’t available. It also demonstrated “human-in-the-loop” checkpoints: before irreversible actions, Operator asks for confirmation so users can cancel or approve.

A second demo focused on grocery shopping through Instacart. Operator used vision to read an uploaded shopping list image (eggs, spinach, mushrooms, chicken thighs, chili crunch) and then added items to a cart by repeatedly observing the screen, planning the next step, and acting. The workflow followed a tight loop: take a screenshot, decide what to do next, click or type, then capture the updated screen to verify the effect—continuing until the task is complete or the system requests help.

Operator’s underlying research model is “Computer Using Agent” (CUA), built off GPT-4o but trained to control a computer directly. The key shift is avoiding reliance on website APIs. Instead of needing structured integrations for each service (like Instacart), CUA learns to operate through the universal interface of pixels plus keyboard and mouse. That design choice aims to remove a major bottleneck for agent deployment across the web.

The system also includes a takeover mode. Users can seize control of the remote browser, make changes privately, and then hand it back to Operator. The session is described as private during takeover, and the system can’t see what the user does in that mode beyond the last screenshot.

Because real-world actions carry risk, Operator is wrapped in a layered safety approach centered on misalignment. It uses refusals for harmful requests, moderation and detection mechanisms, and blocks suspicious instructions. When the model might be wrong, confirmations are used before impactful steps like purchases or reservations. When websites might be untrusted or malicious, a “prompt injection monitor” can pause the agent’s trajectory if something looks suspicious.

Reliability is still imperfect—Operator is explicitly positioned as a research preview that makes mistakes. Performance is quantified with evals: CUA scores 38.1% on OSWorld (Linux-style operating system navigation) and 58.1% on Web Arena (website navigation), both above other publicly published results but below human performance. The rollout begins with Pro users in the United States, with other regions later, plus plans to expand access and launch an API in the coming weeks.

Cornell Notes

Operator is an OpenAI agent that can complete real tasks by controlling a remote web browser. It works by repeatedly taking screenshots, deciding the next keyboard/mouse action, and verifying results through the updated screen—without needing website APIs. The system supports human-in-the-loop confirmations before impactful actions and a takeover mode where users can privately adjust the session and then return control to the agent. Operator is powered by a new model, Computer Using Agent (CUA), built off GPT-4o and trained to operate computers like humans do. Safety relies on layered mitigations for harmful requests, model mistakes, and suspicious websites, while performance is still below humans on OSWorld and Web Arena.

How does Operator actually perform tasks on websites?

Operator runs a remote browser session in the cloud. After a user enters a prompt, it opens the relevant site, then follows a loop: it looks at the current screen (raw pixels), decides what to do next, performs actions via keyboard and mouse (clicks and typing), and then takes another screenshot to confirm the effect. This continues until the task is done or Operator asks for help.

What role do confirmations and takeover play in keeping users in control?

Operator uses confirmations before irreversible or high-impact actions such as booking or purchasing. If a requested time isn’t available, it can return with an alternative and ask the user to approve. For deeper intervention, users can click “take control” to seize the remote browser. During takeover, the session is described as private, and Operator can’t observe what the user does beyond the last screenshot; afterward, control can be handed back to Operator.

Why does CUA matter for web automation across many sites?

CUA (Computer Using Agent) is trained to control a computer using the same universal interface humans use—screen pixels plus keyboard and mouse—rather than depending on each website providing an API. That design avoids the “API bottleneck,” where an agent would otherwise need site-specific integrations to shop, book, or transact.

How does Operator handle tasks that start from images or ambiguous requests?

In the grocery demo, a user uploaded a picture of a shopping list. Operator used vision to read items from the image and then proceeded to add them to a cart. When a user doesn’t specify a particular service (e.g., “buy these groceries” without naming Instacart), Operator can use search to find likely matches and then ask clarifying questions if needed.

What safety risks does Operator target, and how?

The safety framework focuses on misalignment in three areas: (1) user misalignment—requests for harmful or agentic wrongdoing are refused using mitigations carried over from ChatGPT-style systems; (2) agent misalignment—when the model might make an error, Operator asks for confirmation before impactful actions; (3) website misalignment—if a site is fraudulent or tries to manipulate the agent, a prompt injection monitor can pause the agent’s actions when suspicious behavior is detected.

How strong is Operator’s performance today, based on reported benchmarks?

Operator is a research preview and not perfect. On OSWorld (navigating common operating systems), CUA scored 38.1%, compared with 72.4% human performance. On Web Arena (navigating common websites), CUA scored 58.1%, again below human performance. The benchmarks emphasize that the agent receives only the universal screen interface (like screenshots), not extra structured page data or clickable metadata.

Review Questions

  1. What loop does Operator follow to decide and verify each action on a website?
  2. How do confirmations and takeover reduce risk when an agent might be wrong?
  3. Why does training CUA to use keyboard/mouse and screenshots matter for scaling across websites?

Key Points

  1. 1

    Operator is an early research preview agent that completes tasks by controlling a remote web browser using screenshots plus keyboard and mouse.

  2. 2

    Operator’s core workflow is a repeated cycle: observe the screen, plan the next action, execute it, then re-screenshot to confirm progress.

  3. 3

    A human-in-the-loop design includes confirmations before impactful actions and a takeover mode for private user intervention.

  4. 4

    Operator is powered by Computer Using Agent (CUA), built off GPT-4o, trained to operate computers without relying on website APIs.

  5. 5

    Safety is handled through layered mitigations targeting harmful requests, model mistakes, and suspicious or fraudulent websites (including a prompt injection monitor).

  6. 6

    Reported benchmark results show CUA outperforming other published systems but still trailing human performance on OSWorld and Web Arena.

  7. 7

    Operator begins rolling out to Pro users in the United States first, with broader availability and an API planned in the coming weeks.

Highlights

Operator can book, shop, and purchase by navigating real websites through a remote browser session—clicking and typing based on what appears on screen.
CUA is designed to work without website APIs by learning to operate computers via the universal interface of pixels plus keyboard and mouse.
Before irreversible actions, Operator asks for confirmation; if users take over, the session is described as private during that period.
Safety is framed around misalignment: user intent, agent correctness, and website trustworthiness—each addressed with specific mitigations.
On OSWorld and Web Arena, CUA scores 38.1% and 58.1% respectively, indicating real capability but still significant room to improve.

Topics

Mentioned

  • AI
  • GPT-4o
  • CUA
  • OSWorld