Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Deep research’s strongest early advantage is web-based needle-in-a-haystack retrieval and synthesis, particularly for obscure knowledge tasks.
Briefing
OpenAI’s newly released “Deep research” agent—built on its most powerful o3 model—delivers a noticeable leap in web-based, needle-in-a-haystack research, especially on benchmarks that reward stitching together obscure facts. In early hands-on testing, it surged from low baseline performance on a key “useful assistant” benchmark to roughly 67–72%, while still trailing human experts who can reach about 92% when they put in effort. The practical takeaway: for hard, information-dense questions that require searching, filtering, and synthesizing, Deep research often finds the right trail faster than competing systems.
That strength showed up most clearly on “Humanity’s last exam,” a benchmark centered on obscure knowledge and the ability to assemble scattered details into correct answers. With web access enabled, Deep research’s performance jumped sharply, reinforcing the idea that its core advantage is not just answering, but hunting down niche sources and assembling them into a coherent response. The same pattern appeared in a real-world-style test using a newsletter archive: Deep research located posts where a dice rating changed to 5 or higher and then extracted and summarized the relevant sections—exactly the kind of targeted retrieval that would otherwise require manual scanning.
The caveat is that the agent’s reliability is uneven, and its interaction style can be frustrating. On a “common sense / spatial reasoning” style mini-benchmark, it repeatedly asked clarifying questions instead of directly answering, and the results showed little improvement—suggesting it can struggle to “grasp the real world” even when it can browse. More importantly, multiple shopping and fact-check tests raised concerns about hallucinations and citation fidelity. In one case, Deep research claimed it visited a specific price-history site, yet the returned price history did not match what the site actually showed. In another, it produced plausible-sounding details that were wrong, and DeepSeek R1 with search hallucinated even more aggressively (including inventing battery-life claims).
When Deep research was compared against DeepSeek R1 with search and Google’s Gemini Deep research, the overall pattern favored OpenAI for task completion—often “better pretty much every time”—but not for factual cleanliness. DeepSeek R1 with search was less annoying in conversation (fewer clarifying-question loops) but still produced errors, including fabricated specifics. Gemini Deep research, in these tests, failed to retrieve relevant newsletter data at all.
The transcript’s broader warning is that small, repeated hallucinations remain a thin line of defense for white-collar workflows. Even when models produce deep, well-structured research—like analyzing dozens of references in a research paper—incorrect details can slip in, and sometimes the summary language treats hypothetical or uncertain information as if it were factual. The result is a system that can dramatically accelerate research and synthesis, but still demands verification when decisions depend on accuracy.
Cornell Notes
OpenAI’s Deep research agent, powered by o3, is positioned as a web-enabled “researcher” that can find and synthesize niche information. In early tests, it performed strongly on benchmarks that reward obscure knowledge retrieval and on practical tasks like scanning a newsletter archive for specific rating changes. It also reached about 67–72% on a benchmark of usefulness, still below human performance (~92%) when humans work carefully. The main limitation is reliability: it can hallucinate details, sometimes even when it claims to have visited a source, and it may get stuck in clarifying-question loops on reasoning-style tasks.
What benchmarks and metrics were used to judge Deep research, and what did the results imply?
Why did the common-sense/spatial-reasoning test underperform, despite Deep research’s web strength?
How did Deep research perform on a real information-retrieval task involving newsletter posts and dice ratings?
What reliability problems showed up in shopping and citation checks?
How did the transcript reconcile “deep, useful research” with the risk of hallucinations?
Review Questions
- In the usefulness benchmark described, what were the approximate scores for Deep research and for humans, and what does that gap suggest about real-world reliability?
- What kinds of failures were observed in reasoning/spatial tasks versus web-based retrieval tasks?
- Give one example each of a citation/price-history mismatch or a fabricated detail reported for Deep research and DeepSeek R1 with search. What was wrong in each case?
Key Points
- 1
Deep research’s strongest early advantage is web-based needle-in-a-haystack retrieval and synthesis, particularly for obscure knowledge tasks.
- 2
On a usefulness benchmark, Deep research reached roughly 67–72% depending on answer selection, while humans reached about 92% with effort.
- 3
Reasoning-style tests showed weaknesses, including repeated clarifying-question loops that can prevent direct answers.
- 4
Citation fidelity and factual accuracy remain inconsistent; Deep research can claim it visited a source while returning details that don’t match the source’s actual data.
- 5
DeepSeek R1 with search may be less conversationally annoying but still hallucinates concrete details (e.g., product specs) in shopping scenarios.
- 6
Gemini Deep research underperformed in these tests on the newsletter retrieval task, failing to find relevant data.
- 7
For white-collar workflows, small hallucinations are still a critical risk even when research outputs look deep and well structured.