The World isn't Ready for AI this Capable.. Dive into Open AI o3 mini & Deep Research

TL;DR

o3 mini is positioned as an efficient successor to the o1 line, delivering stronger chain-of-thought reasoning and better STEM/coding performance at lower compute cost.

Briefing Cornell Notes

Briefing

OpenAI’s latest push—o3 mini plus a new “Deep Research” agent—signals a shift from simply scaling model size toward using reasoning and tool-driven web synthesis to get dramatically better results. The headline is performance: o3 mini is positioned as nearly 10x better than GPT-4o on a hard benchmark, and Deep Research is described as the framework behind that jump, capable of multi-step research that pulls from hundreds of online sources to produce analyst-style reports.

o3 mini arrives in multiple variants inside ChatGPT and via the API, with the key tradeoff being compute efficiency without giving up reasoning quality. OpenAI frames o3 mini as a successor to the o1 series, using chain-of-thought style reasoning while improving cost-effectiveness. It also emphasizes stronger STEM performance—science, math, and coding—and adds “developer features” such as function calling, structured outputs, and developer messages. Access is tiered: Plus, Team, and Pro users get it immediately, Enterprise access is slated for February, and free users can try it by selecting a “reason” option in the message composer. The daily message cap for these reasoning models rises from 50 to 150, reflecting the efficiency gains.

Benchmarks are used to sell the leap. On Competition Math (AMC 2024), o3 mini High posts a new high score of 87.3, with the narrator suggesting full o3 could land in the 90s. On coding, the o3 series is said to “slaughter” prior results: earlier o1 coding performance is around 1891 ELO, while o3 mini Low trails slightly and o3 mini Medium and High move deeper into the 2000s. In science question tests (GPQA Diamond), the o3 variants are described as more level overall, with o3 mini High still leading.

Deep Research is the bigger story because it’s not just another model name—it’s an agentic workflow that combines an LLM with web search and other tools to complete multi-step tasks. It’s currently limited to Pro users, with the transcript calling out the $200/month price and frustration that Plus users can’t access it. In an example request about how retail has changed over three years, Deep Research asks follow-up questions, then spends minutes gathering and synthesizing information into a detailed report. The framework is described as operating at the level of a research analyst, using reasoning to synthesize large amounts of online information and even handling inputs beyond text, including images and PDFs.

Comparisons sharpen the impact: Deep Research is contrasted with GPT-4o’s more generalized answers, including cases where GPT-4o is said to be wrong about a TV-show moment while Deep Research retrieves the correct detail from the web. On “Humanity’s last exam,” a new expert-style benchmark, Deep Research is reported at 26.6% accuracy versus GPT-4o at 3.3%, while other models—including DeepSeek R1 and Claude 3.5 Sonnet—score far lower. The transcript repeatedly ties these gains to a broader trend: complex reasoning behaviors and better outcomes are emerging from chain-of-thought methods plus tool use, not just from training ever-larger models.

Finally, hands-on demos with o3 mini show fast coding and physics/graphics generation—autonomous snake, a working Twitter clone in a single Python file, and multiple physics/3D demos—reinforcing the claim that smaller reasoning models can still deliver “big” capabilities quickly. The overall takeaway is that AI capability is accelerating through reasoning, retrieval, and synthesis, but access and cost remain a major gating factor for the most powerful workflow, Deep Research.

Cornell Notes

OpenAI’s o3 mini and Deep Research point to a new capability pattern: stronger reasoning plus tool-driven web synthesis. o3 mini is positioned as an efficient successor to the o1 line, delivering better STEM and coding performance while supporting developer features like function calling and structured outputs. Deep Research is described as an agentic framework (not just a model) that searches the web, performs multi-step reasoning, and compiles analyst-style reports from large numbers of sources. Reported benchmark gains are large—Deep Research reaches 26.6% accuracy on “Humanity’s last exam,” far above GPT-4o’s 3.3%. The practical implication is that AI can shift from answering questions to conducting research tasks that would otherwise take hours or days.

What makes o3 mini different from earlier o1-series models, beyond just being “new”?

o3 mini is framed as a smaller, more compute-efficient reasoning model that still performs chain-of-thought style thinking. It’s described as more cost-effective than prior variants while improving STEM performance (science, math, coding). It also supports developer features—function calling, structured outputs, and developer messages—so developers can build with it directly via ChatGPT and the API. Access is tiered (Plus/Team/Pro immediately; Enterprise in February; free users via a “reason” option), and the daily message cap for these reasoning models increases from 50 to 150.

How does Deep Research work, and why is it treated as a step change rather than another model release?

Deep Research is presented as an agentic framework that combines an LLM with tools like web search to complete multi-step research tasks. In the retail example, it takes a detailed prompt, asks follow-up questions, then spends minutes gathering and synthesizing information into a structured report. The transcript emphasizes that it can synthesize knowledge from hundreds of online sources and is claimed to operate at an analyst level. It’s also described as handling more than plain text—capable of working with images and PDFs—making it broader than typical chat-based Q&A.

What benchmark results are used to justify the “nearly 10x” claims?

The transcript highlights two benchmark angles. For o3 mini, Competition Math (AMC 2024) is used: o3 mini High scores 87.3, with the suggestion that full o3 could reach the 90s. For Deep Research, “Humanity’s last exam” is the headline: Deep Research is reported at 26.6% accuracy, while GPT-4o is at 3.3%. Other models mentioned (including DeepSeek R1 and Claude 3.5 Sonnet) are described as far lower, reinforcing the claim that tool-based research improves expert-question performance.

Why do comparisons with GPT-4o matter in the transcript’s argument?

The comparisons are meant to show that Deep Research doesn’t just produce longer answers—it produces more specific, source-grounded outputs. Examples include a mobile-market analysis request where Deep Research reportedly returns tables with country market share and time-range changes (e.g., 2013–2023), while GPT-4o provides a more generalized response. Another example claims GPT-4o gives an incorrect TV-show moment, while Deep Research retrieves the correct moment from the web. The underlying point is that web retrieval and synthesis reduce reliance on training-data guesses.

What real-world use cases are cited to illustrate Deep Research’s value?

A research professional is quoted describing how Deep Research helps deliver credible, up-to-date data for niche topics—specifically semiconductor chip shortage research. The workflow includes prompting for underlying causes, affected industries, and outlook, using industry publications, consulting briefs, publicly available data, and semiconductor association data. The professional notes that Deep Research can produce a holistic view much faster than traditional research, freeing time for other tasks. Additional community examples mentioned include business/technical analysis of DeepSeek R&D history and biomedical-style summaries spanning clinical trials and related domains.

How do hands-on o3 mini demos support the claim that smaller models can still be highly capable?

The transcript lists multiple o3 mini coding and simulation demos: an autonomous competitive snake game, a Twitter clone generated as a single Python file with working sign-up and posting flows, and physics/graphics experiments like a bouncing ball inside a spinning hexagon/3D tesseract. It also mentions one-shot shader generation and even Minecraft-related building attempts. The emphasis is speed and “one-shot” success—examples are described as working on the first try or generating multiple demos quickly—suggesting practical usability beyond benchmark numbers.

Review Questions

Which capabilities are attributed to o3 mini (reasoning efficiency, STEM strength, developer features), and which are attributed to Deep Research (web synthesis, multi-step research tasks, analyst-style outputs)?
On “Humanity’s last exam,” what accuracy numbers are given for Deep Research and GPT-4o, and what does the transcript infer from the gap?
What access restrictions and pricing differences are described for o3 mini versus Deep Research, and how do those constraints shape who can test the most advanced workflow?

Key Points

1
o3 mini is positioned as an efficient successor to the o1 line, delivering stronger chain-of-thought reasoning and better STEM/coding performance at lower compute cost.
2
o3 mini supports developer features such as function calling, structured outputs, and developer messages, making it immediately usable for application building via ChatGPT and the API.
3
Access is tiered: Plus/Team/Pro get o3 mini first, Enterprise follows in February, and free users can try it via a “reason” option; the daily message cap rises to 150 for these reasoning models.
4
Deep Research is described as an agentic framework that combines LLM reasoning with web search and tools to synthesize large amounts of online information into multi-step research reports.
5
Deep Research is currently Pro-only in the transcript, with the $200/month price highlighted as a major barrier for Plus users.
6
Reported benchmark results emphasize a large accuracy gap on “Humanity’s last exam” (Deep Research at 26.6% vs GPT-4o at 3.3%), alongside strong Competition Math performance for o3 mini High (87.3).
7
Hands-on o3 mini demos emphasize fast, working code generation (including games, clones, and physics/graphics experiments), reinforcing the practical impact of the reasoning improvements.

Highlights

o3 mini High posts an 87.3 score on Competition Math (AMC 2024), with the transcript suggesting full o3 could reach the 90s.

Deep Research is framed as a tool-using research agent that can synthesize from hundreds of sources and produce analyst-style reports in minutes.

On “Humanity’s last exam,” Deep Research is reported at 26.6% accuracy versus GPT-4o at 3.3%, with other major models far behind.

The transcript repeatedly contrasts web-grounded specificity (tables, correct factual details) with GPT-4o’s more generalized or sometimes incorrect answers.

o3 mini demos highlight “one-shot” coding and physics/graphics generation, including a working Twitter clone in a single Python file.

Topics

OpenAI o3 mini
Deep Research Agent
Reasoning Benchmarks
Tool-Driven Web Synthesis
AI Coding Demos