Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research

TL;DR

Deep research’s strongest early advantage is web-based needle-in-a-haystack retrieval and synthesis, particularly for obscure knowledge tasks.

Briefing Cornell Notes

Briefing

OpenAI’s newly released “Deep research” agent—built on its most powerful o3 model—delivers a noticeable leap in web-based, needle-in-a-haystack research, especially on benchmarks that reward stitching together obscure facts. In early hands-on testing, it surged from low baseline performance on a key “useful assistant” benchmark to roughly 67–72%, while still trailing human experts who can reach about 92% when they put in effort. The practical takeaway: for hard, information-dense questions that require searching, filtering, and synthesizing, Deep research often finds the right trail faster than competing systems.

That strength showed up most clearly on “Humanity’s last exam,” a benchmark centered on obscure knowledge and the ability to assemble scattered details into correct answers. With web access enabled, Deep research’s performance jumped sharply, reinforcing the idea that its core advantage is not just answering, but hunting down niche sources and assembling them into a coherent response. The same pattern appeared in a real-world-style test using a newsletter archive: Deep research located posts where a dice rating changed to 5 or higher and then extracted and summarized the relevant sections—exactly the kind of targeted retrieval that would otherwise require manual scanning.

The caveat is that the agent’s reliability is uneven, and its interaction style can be frustrating. On a “common sense / spatial reasoning” style mini-benchmark, it repeatedly asked clarifying questions instead of directly answering, and the results showed little improvement—suggesting it can struggle to “grasp the real world” even when it can browse. More importantly, multiple shopping and fact-check tests raised concerns about hallucinations and citation fidelity. In one case, Deep research claimed it visited a specific price-history site, yet the returned price history did not match what the site actually showed. In another, it produced plausible-sounding details that were wrong, and DeepSeek R1 with search hallucinated even more aggressively (including inventing battery-life claims).

When Deep research was compared against DeepSeek R1 with search and Google’s Gemini Deep research, the overall pattern favored OpenAI for task completion—often “better pretty much every time”—but not for factual cleanliness. DeepSeek R1 with search was less annoying in conversation (fewer clarifying-question loops) but still produced errors, including fabricated specifics. Gemini Deep research, in these tests, failed to retrieve relevant newsletter data at all.

The transcript’s broader warning is that small, repeated hallucinations remain a thin line of defense for white-collar workflows. Even when models produce deep, well-structured research—like analyzing dozens of references in a research paper—incorrect details can slip in, and sometimes the summary language treats hypothetical or uncertain information as if it were factual. The result is a system that can dramatically accelerate research and synthesis, but still demands verification when decisions depend on accuracy.

Cornell Notes

OpenAI’s Deep research agent, powered by o3, is positioned as a web-enabled “researcher” that can find and synthesize niche information. In early tests, it performed strongly on benchmarks that reward obscure knowledge retrieval and on practical tasks like scanning a newsletter archive for specific rating changes. It also reached about 67–72% on a benchmark of usefulness, still below human performance (~92%) when humans work carefully. The main limitation is reliability: it can hallucinate details, sometimes even when it claims to have visited a source, and it may get stuck in clarifying-question loops on reasoning-style tasks.

What benchmarks and metrics were used to judge Deep research, and what did the results imply?

Two benchmark themes drove the evaluation. On “Humanity’s last exam,” which tests obscure knowledge assembly, Deep research’s web-enabled performance jumped substantially—supporting the idea that it’s especially strong at finding niche facts and stitching them together. On a “useful assistant” benchmark (with level-3 research tasks), Deep research scored roughly 72–73% when selecting the best answer, and about 67% even when taking the first answer. Humans still led at about 92%, meaning the agent is improving fast but not yet matching careful human judgment.

Why did the common-sense/spatial-reasoning test underperform, despite Deep research’s web strength?

A simple reasoning benchmark that should test spatial/common-sense behavior didn’t work well because Deep research repeatedly asked clarifying questions instead of giving direct answers. Across about eight questions, there was little evidence of improvement, and the interaction became a loop where the user effectively had to solve the puzzle for it. The transcript frames this as either a sign of overly cautious behavior or a failure to internalize real-world reasoning.

How did Deep research perform on a real information-retrieval task involving newsletter posts and dice ratings?

Deep research was tested against DeepSeek R1 and Gemini Deep research using a newsletter archive (“Signal to Noise”). The task: find posts where a dice rating changes to 5 or above, then extract the relevant sections. Deep research succeeded—identifying the posts with dice ratings ≥5 and summarizing the exact sections. DeepSeek R1 with search was described as finding no such entries during the test window, while Gemini Deep research failed to find dice ratings at all for the newsletters.

What reliability problems showed up in shopping and citation checks?

Multiple checks pointed to hallucinations and citation mismatch. Deep research claimed it used camelcamelcamel for price-history research, but the linked results did not correspond to that site’s actual historical low for a specific toothbrush. It also produced incorrect details in summaries. DeepSeek R1 with search was reported to hallucinate battery life (e.g., claiming ~70 days when the real figure was closer to 30–35). The transcript emphasizes that even small errors can be dangerous when used for decision-making.

How did the transcript reconcile “deep, useful research” with the risk of hallucinations?

The core tension is that the agent can be excellent at locating and synthesizing information, yet still produce repeated small inaccuracies. The transcript notes that even when outputs are deep—such as analyzing many references in a paper—hallucinated details can appear. That means the system can accelerate work, but verification remains necessary, especially for tasks where correctness matters.

Review Questions

In the usefulness benchmark described, what were the approximate scores for Deep research and for humans, and what does that gap suggest about real-world reliability?
What kinds of failures were observed in reasoning/spatial tasks versus web-based retrieval tasks?
Give one example each of a citation/price-history mismatch or a fabricated detail reported for Deep research and DeepSeek R1 with search. What was wrong in each case?

Key Points

1
Deep research’s strongest early advantage is web-based needle-in-a-haystack retrieval and synthesis, particularly for obscure knowledge tasks.
2
On a usefulness benchmark, Deep research reached roughly 67–72% depending on answer selection, while humans reached about 92% with effort.
3
Reasoning-style tests showed weaknesses, including repeated clarifying-question loops that can prevent direct answers.
4
Citation fidelity and factual accuracy remain inconsistent; Deep research can claim it visited a source while returning details that don’t match the source’s actual data.
5
DeepSeek R1 with search may be less conversationally annoying but still hallucinates concrete details (e.g., product specs) in shopping scenarios.
6
Gemini Deep research underperformed in these tests on the newsletter retrieval task, failing to find relevant data.
7
For white-collar workflows, small hallucinations are still a critical risk even when research outputs look deep and well structured.

Highlights

Deep research’s web-enabled performance jumped sharply on an obscure-knowledge benchmark, reinforcing that its edge is targeted retrieval and synthesis.

Even with strong benchmark gains, humans still lead on usefulness tasks (about 92% vs ~67–72%), implying careful human judgment remains hard to replace.

Shopping and citation checks exposed mismatches—Deep research can return incorrect price-history details despite claiming it used a specific site.

DeepSeek R1 with search produced concrete fabricated product facts (like battery life), showing hallucinations can be operationally harmful.

The transcript frames the current moment as fast progress with a persistent “thin line” of hallucination risk for decision-making work.

Topics

Deep Research
o3 Agent
Benchmark Usefulness
Hallucinations
Web Retrieval

Mentioned

Yan Lun
Philip Wang
o3
GPT
GPT-4
GPT 4
GPT-4 Turbo
AGI
LLM
VPN
R1
GPT 40
GPT 4 with sech
GPT 4 with search