Introduction to Deep Research

TL;DR

Deep research is an agentic web-browsing capability that can run for 5–30 minutes to produce comprehensive, fully cited research reports.

Briefing Cornell Notes

Briefing

OpenAI is rolling out “Deep research,” a new agentic capability that can browse the internet for many minutes, synthesize what it finds, and return a comprehensive, fully cited research report—essentially turning open-ended web searching into analyst-style deliverables. The key shift is that Deep research removes the usual speed/latency expectations placed on models, allowing it to run for roughly 5 to 30 minutes so it can plan, gather evidence across multiple sources, and adapt its approach as new information appears.

The motivation traces back to OpenAI’s reasoning models in the O Series, including o1, which can “think for a long time” but previously lacked crucial tool access—especially reliable internet browsing. Deep research fills that gap by conducting multi-step research on the web: it discovers relevant content, synthesizes it, and reasons over it while updating its plan as it uncovers more. The output is positioned as something close to what an analyst or domain expert might produce, with citations and structured formatting rather than a quick summary.

The rollout is tied to practical use cases across knowledge work and beyond. For work, Deep research is framed as a way to reduce the manual labor of gathering and reconciling information—such as market research, academic literature review, or building slide-ready content. For personal tasks, it’s pitched as a tool for high-stakes purchases and planning: for example, researching skis for conditions in Japan, then producing recommendations with a table comparing options.

A live walkthrough in ChatGPT shows how Deep research handles ambiguity. When given a complex prompt—like analyzing iOS and Android adoption rates, language-learning interest, and mobile penetration changes across developed versus developing countries—it first asks clarifying questions about assumptions (overall vs. category-specific adoption, how to interpret “mobile penetration,” and whether to focus on general or engaged interest). After requirements are set, it begins browsing and reasoning under the hood, opening pages and extracting information from multiple formats including images, tables, and PDFs. It also uses information from one search step to guide subsequent searches.

Under the hood, Deep research is powered by a fine-tuned version of the soon-to-be-released o3 reasoning model. Training used end-to-end reinforcement learning on “hard browsing” and other reasoning tasks, teaching the system to plan and execute multi-step trajectories, react to real-time information, and backtrack when needed. The model can browse over user-uploaded files, use a Python tool for calculations and generating plots, embed those plots in its final response, and embed images from websites. Citations are described as sentence- and passage-level.

Performance claims include a new high of 26.6% accuracy on the “Humanity’s Last Exam” benchmark from the Center for AI Safety and Scale AI, plus strong results on other evaluations requiring web browsing, multimodal capability, code execution, and reasoning over files. Internal expert evaluations emphasize that task success correlates more with economic value than with raw time-to-complete, and that allowing more tool calls improves performance. Even so, the team warns that hallucinations remain possible, urging users to verify sources.

Deep research launches later today in Pro, with plans to roll out to Plus and Team next, followed by Education and Enterprise. Longer term, OpenAI points to an AGI roadmap where agents can run longer and connect to custom context—such as enterprise data stores—so the same browsing-and-synthesis agent can operate on proprietary knowledge as well as public web content.

Cornell Notes

Deep research is an agentic capability that can browse the internet for 5–30 minutes, then synthesize findings into a comprehensive, fully cited research report. It’s designed to overcome a key limitation of earlier reasoning models: strong “thinking” without reliable access to web tools. In ChatGPT, it can ask clarifying questions up front, then iteratively search, open pages, extract information from multiple formats (including tables, PDFs, and images), and adapt its plan as new evidence appears. Powered by a fine-tuned o3 reasoning model trained with end-to-end reinforcement learning on hard browsing, it can also run calculations via Python and embed plots and images in final outputs. The rollout begins in Pro, with broader availability planned, and the longer-term goal is connecting such agents to custom enterprise context.

What problem does Deep research solve compared with earlier reasoning models like o1?

Earlier O Series reasoning models can “think for a long time,” but they lacked key tool access—especially the ability to browse the internet. Deep research adds multi-step web browsing so the model can discover, synthesize, and reason over up-to-date web content. That turns knowledge work that normally requires manual searching and cross-checking into a longer-running autonomous research process that produces a cited report.

Why does Deep research allow long runtimes (5–30 minutes), and what changes as a result?

Deep research removes typical latency constraints so it can run longer without being forced to return quickly. That extra time supports planning, iterative searching, and backtracking when new information changes what should be looked up next. The system can therefore adapt its research plan as it uncovers more evidence, aiming for analyst-level completeness rather than a fast summary.

How does Deep research handle ambiguous or underspecified requests?

When a prompt leaves assumptions unclear, Deep research returns clarifying questions before committing to a full research run. In the market-research example, it asked how to interpret mobile penetration and whether to use overall adoption rates versus specific categories, and whether to focus on general interest or engaged interest. The user’s answers guide the subsequent browsing and synthesis.

What does Deep research do “under the hood” while browsing?

It conducts literal searches, opens pages, and reads through different components such as images, tables, and PDFs. It also uses information from one search step to inform the next step, creating a multi-step trajectory rather than a single query-and-summarize pass. A sidebar shows the reasoning and browsing progress, including which sources it is exploring.

What tools and capabilities does Deep research have beyond web browsing?

Deep research is powered by a fine-tuned version of the soon-to-be-released o3 reasoning model and can browse user-uploaded files. It can use a Python tool for calculations and for creating plots, then embed those plots in its final response. It can also embed images from websites and provide sentence- and passage-level citations.

How is Deep research evaluated, and what cautions come with the performance claims?

Reported results include a new high of 26.6% accuracy on Humanity’s Last Exam (a benchmark from the Center for AI Safety and Scale AI) and strong performance on other evaluations requiring web browsing, multimodal capability, code execution, and reasoning over files. Internal expert evaluations emphasize that more tool calls and more time improve success rates. Despite strong results, hallucinations are still possible, so users are urged to check sources themselves.

Review Questions

When would you want to use Deep research’s clarifying-question step, and what kinds of assumptions should you specify to get better outputs?
How do long runtimes (5–30 minutes) change the research workflow compared with typical fast Q&A models?
What additional capabilities (beyond browsing) does Deep research have for producing reports, calculations, and visualizations?

Key Points

1
Deep research is an agentic web-browsing capability that can run for 5–30 minutes to produce comprehensive, fully cited research reports.
2
The main upgrade over earlier reasoning models is tool access—especially multi-step internet browsing—so the system can gather evidence rather than rely on memory alone.
3
Deep research can ask clarifying questions first, then iteratively search, open pages, extract information from multiple formats, and adapt its plan as it learns more.
4
Deep research is powered by a fine-tuned version of the soon-to-be-released o3 reasoning model trained with end-to-end reinforcement learning on hard browsing and reasoning tasks.
5
The system can browse user-uploaded files, use a Python tool for calculations and plots, and embed both plots and website images in final outputs.
6
Reported benchmark performance includes 26.6% accuracy on Humanity’s Last Exam, with additional gains on evaluations requiring browsing, multimodal understanding, and code execution.
7
OpenAI warns that hallucinations remain possible, so users should verify citations when accuracy matters.

Highlights

Deep research’s defining feature is time: it can take 5 to 30 minutes to return a cited, analyst-style report instead of rushing a quick answer.

It doesn’t just summarize search results—it opens pages, extracts tables/PDFs/images, and uses what it finds to decide what to search next.

Deep research can produce structured outputs with tables and embedded plots, backed by sentence- and passage-level citations.

Performance claims include a new high of 26.6% accuracy on Humanity’s Last Exam, but source-checking is still recommended.

Topics

Agentic Research
Web Browsing
Reasoning Models
Cited Reports
ChatGPT Rollout

Mentioned

Mark
Issa
Josh
Neil
AGI
PDF
VC