Get AI summaries of any video or article — Sign up free
Introduction to ChatGPT agent thumbnail

Introduction to ChatGPT agent

OpenAI·
6 min read

Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

ChatGPT agent unifies deep research-style browsing, operator-style interactive web actions, and terminal-based computation into one long-running workflow.

Briefing

ChatGPT agent is positioned as a unified “do-the-work” system that can plan, browse, and act across a long task horizon—using a virtual computer, a text browser, a visual (GUI) browser, and a terminal—so users can go from ideas to completed outcomes like bookings, documents, and shopping. The core shift is consolidation: instead of choosing between separate capabilities for deep web research and interactive web actions, the agent switches among tools as needed, runs code, calls APIs, and produces artifacts such as spreadsheets and slide decks.

A live demo shows the agent handling wedding planning end-to-end. After users enter a prompt with constraints (dress code, venue, weather, and gift considerations), the agent takes a few seconds to set up its environment, then begins a multi-step workflow that includes asking for clarification when needed, using a text browser to scan information efficiently, and switching to a visual browser to interact with web interfaces—clicking, filling forms, and verifying that suit choices look right. The workflow continues over time, with the interface showing the agent’s computer screen and an overlay of its internal reasoning as it decides what to do next.

Under the hood, the system is built around a unified tool box: a text browser for fast reading and searching across many pages, a visual browser for interacting with page UI elements, and a terminal for running code and generating/analyzing files. Through the terminal, the agent can call public APIs and—when explicitly connected—APIs tied to private data sources such as Google Drive, Google Calendar, GitHub, and SharePoint. It can also use an image generation API to create visuals for deliverables like slide decks.

Tool selection is trained using reinforcement learning on “hard tasks” designed to force the model to use all capabilities. Early in training, the model may try to use every tool even for simple problems, but rewards for correct and efficient completion drive smarter switching. Examples given include a restaurant booking flow that starts with text browsing, then moves to visual browsing for photos and availability checks, and a creative artifact workflow that searches online first, then uses the terminal for code-based compilation and verification.

The rollout is framed as a merger of earlier OpenAI agentic products: operator (for online actions like reservations and emails) and deep research (for in-depth reporting). The rationale is complementarity—operator struggles with long article reading while deep research is weaker at interactive, highly visual web pages and authenticated access—so the agent combines strengths and adds a longer-running, universal-task orientation.

Capability claims are backed by benchmark results shared in a meta-demo where the agent generates PowerPoint slides from Google Drive data. Reported gains include near-doubling performance on Humanity’s Last Exam when tool use is enabled (42%), state-of-the-art performance on Front TMS (27%) with full tool access, strong results on WebArena (improved over a prior O3-based model), and high pass rates on BrowseComp (69%). Spreadsheet editing performance is also described as improving when the agent receives raw Excel files via the terminal.

Security is treated as a central caveat. The system faces prompt injection risks—malicious websites trying to trick an agent into entering sensitive information. Mitigations include training the model to ignore suspicious instructions, layered monitoring that can stop trajectories, and the ability to update defenses as new attacks emerge. Users are encouraged to be cautious with sensitive data and to use takeover modes for direct input when appropriate.

The product is described as launching for Pro Plus and team users immediately, with enterprise and edu targeted by month’s end, alongside an expectation that safety controls and capabilities will be refined as adoption grows.

Cornell Notes

ChatGPT agent is built to complete real tasks over a long time horizon by combining multiple tools in one workflow: a text browser for fast research, a visual (GUI) browser for interacting with web pages, and a terminal for running code, generating files, and calling APIs. Reinforcement learning on “hard tasks” teaches the model when to switch tools and how to coordinate browsing, computation, and artifact creation. Demos show it planning and executing wedding logistics, producing sticker designs and ordering workflows, and generating PowerPoint slides from connected Google Drive data. Performance claims include large gains on intelligence and agent benchmarks when tool use is enabled, alongside safety measures to reduce prompt-injection risks. The system is powerful but introduces a new attack surface, so users are urged to handle sensitive information carefully.

What makes ChatGPT agent different from using separate research and action tools?

It unifies capabilities into one agent that can move between a text browser, a visual browser, and a terminal inside a single virtual computer environment. In practice, that means it can read many pages quickly (text browsing), interact with page UI elements like forms and clickable components (visual browsing), and run code or generate artifacts like spreadsheets and slide decks (terminal). The agent also calls APIs—public ones and private-data APIs only after explicit connections—so it can both gather information and act on it.

How does the system decide which tool to use at each step?

Tool choice is trained with reinforcement learning using hard tasks that require all capabilities. Early training may overuse tools, but rewards for correct and efficient completion push the model toward smarter switching. The transcript gives examples: for restaurant booking, it typically starts with text browsing to find candidates, then switches to visual browsing to check photos and availability and complete the reservation. For creative artifacts, it searches online first, uses the terminal to compile code-based outputs, and then verifies results in the GUI environment.

Why is “interruptibility” emphasized for long-running agent tasks?

Long trajectories can take 15–30 minutes depending on complexity, so users need a way to redirect the work midstream. The agent is trained to support multi-turn collaboration: it can ask clarifying questions, request confirmation at key steps, and acknowledge interruptions so it can incorporate new instructions without restarting. The demo shows this when the user adds a request for men’s black shoes size 9.5 while the agent is already working on earlier shopping items.

What safety problem is highlighted, and what defenses are described?

The main risk discussed is prompt injection, where a malicious website tries to override the agent’s instructions—e.g., steering it to enter credit card information on a fake page. Defenses include training the model to ignore suspicious instructions on suspicious sites, layered monitors that watch the agent’s actions and stop trajectories when something looks wrong, and the ability to update those defenses in real time as new attacks are discovered. Users are also encouraged to avoid sharing highly sensitive information and to use takeover modes for direct sensitive input.

What benchmark results are used to support the agent’s capability claims?

Several benchmarks are cited with tool-enabled improvements. Humanity’s Last Exam is reported as nearly doubling with tool access (42%). Front TMS is described as reaching new state-of-the-art performance (27%) when the agent has all tools (browser, computer, terminal). For agentic web tasks, WebArena shows improvement over a prior O3-based model, and BrowseComp is reported at a 69% pass rate. Spreadsheet Bench performance is described as rising from 30% to 45% when the agent receives the raw Excel file via the terminal.

How does the agent produce deliverables like slides and spreadsheets?

It uses the terminal to run code and generate/analyze files, including slide decks and spreadsheets. It can also call an image generation API to create visuals for slide decorations. In the meta-demo, the agent connects to Google Drive via the Google Drive API, reads relevant content, writes code to compile the final PowerPoint, and then refines its output by reviewing and improving its own results before producing a downloadable PowerPoint file.

Review Questions

  1. How do the text browser, visual browser, and terminal each contribute to completing a multi-step web task?
  2. What training approach is used to teach the agent when to switch tools, and why are “hard tasks” important?
  3. What is prompt injection in the context of AI agents, and what mitigations are described to reduce its impact?

Key Points

  1. 1

    ChatGPT agent unifies deep research-style browsing, operator-style interactive web actions, and terminal-based computation into one long-running workflow.

  2. 2

    Tool selection is trained via reinforcement learning on tasks that force the model to use a text browser, GUI browser, and terminal in the right sequence.

  3. 3

    The agent can call public APIs and, when explicitly connected, private-data APIs such as Google Drive, Google Calendar, GitHub, and SharePoint.

  4. 4

    Multi-turn collaboration is a core feature: the agent can ask clarifying questions, request confirmation at important steps, and accept user interruptions mid-trajectory.

  5. 5

    The system can generate and refine artifacts like spreadsheets and PowerPoint slides, including using an image generation API for visuals.

  6. 6

    Security risk is treated as fundamental: prompt injection can trick agents into unsafe actions, so layered monitoring and user caution (especially with sensitive data) are emphasized.

  7. 7

    Rollout begins for Pro Plus and team users immediately, with enterprise and edu targeted by the end of the month.

Highlights

The agent can switch between a text browser for efficient reading and a visual browser for UI interaction, enabling end-to-end tasks like wedding planning rather than just research summaries.
Reinforcement learning on “hard tasks” is used to teach when to use each tool—so the system learns tool choice, not just tool access.
Prompt injection is singled out as a key threat for agentic browsing, with mitigations including suspicious-instruction filtering and trajectory-stopping monitors.

Topics

Mentioned