Gemini Deep Think

TL;DR

Deep Think is an advanced Gemini reasoning model tied to DeepMind’s gold-medal-level IMO performance, and it’s now publicly available.

Briefing Cornell Notes

Briefing

Google’s “Deep Think” reasoning model is now publicly available after helping DeepMind reach gold-medal-level performance at the International Mathematical Olympiad (IMO), a benchmark long treated as a stress test for high-school-level mathematical reasoning. The key takeaway isn’t just that the system solved IMO problems—it did so using a different approach than earlier DeepMind attempts, and it can take dramatically longer to produce results than standard chat-style models. That tradeoff—stronger reasoning at the cost of speed—runs through every example.

IMO, held since 1959, asks teams of students to solve six extremely difficult problems across areas like algebra, geometry, and number theory. Each problem is scored out of seven for a maximum of 42 points. DeepMind previously achieved a silver-equivalent result, but the comparison wasn’t considered fully “fair” because the AI system used far more time than human contestants were allowed. This year’s breakthrough is framed as more comparable: rather than relying on specialized math-checking systems such as AlphaProof or AlphaGeometry, DeepMind used an advanced Gemini variant called “Deep Think,” feeding it the IMO questions directly.

The transcript also places the IMO results in a broader AI arms race. OpenAI announced that an experimental reasoning LLM reached gold-medal-level performance, releasing proofs for the five problems it solved. Rumors at ICML in Vancouver suggested Google may also have scored gold, with timing around announcements reportedly designed to avoid overshadowing human winners. Both companies publicly released only the solutions for the problems they got right, raising the question of whether the remaining problems were too awkward to disclose.

To gauge Deep Think’s practical behavior, the transcript describes hands-on testing in which an IMO problem was pasted into the Gemini interface without using Lean or other formal proof tooling. The most striking observation is latency: after several minutes, no answer appeared, and even when “thinking tokens” began to stream, the final solution arrived much later—about 16 minutes in one run. Another benchmark problem from the AIME 2025 dataset showed a similar pattern: the model produced intermediate summaries that already pointed to the correct numeric answer (204), yet continued “thinking” for several additional minutes before returning the final response.

Outside math, Deep Think was tested on a prompt to generate a “salatai” scene using colorful voxels and 3D code. The result was treated as a success: the model produced usable Three.js code, rendered a navigable 3D environment, and demonstrated detailed understanding of voxel-based structure.

A final experiment involved an Angry Birds-style setup. With Deep Think enabled, the model detected missing runtime dependencies (it flagged the absence of “pygame”), iterated on the level design, and improved physics enough to score when hitting pigs. Still, the workflow required longer waits than a non-Deep-Think model, leading to skepticism about using Deep Think as a default coding engine.

Overall, Deep Think is portrayed as impressive for hard reasoning tasks—especially math and logic—but its long “parallel thinking” process makes it less practical when speed and cost matter. The transcript ends on the idea that future model development will hinge on balancing intelligence against latency and expense.

Cornell Notes

Deep Think, an advanced Gemini reasoning model, reached gold-medal-level performance on IMO-style problems and is now available publicly. Unlike earlier DeepMind approaches that leaned on specialized math systems, Deep Think feeds IMO questions directly into a Gemini variant and uses long “parallel thinking” to search for solutions. In hands-on tests, it often takes many minutes before any output appears, and even after intermediate summaries look correct, the model may continue reasoning for several more minutes. The result is strong performance on difficult math and logic, plus usable 3D code generation, but it’s slower than typical chat models—raising practical concerns for everyday coding and interactive use.

What makes the IMO such a meaningful benchmark for AI reasoning?

The International Mathematical Olympiad has run since 1959 and challenges pre-university students with six very hard problems spanning topics like algebra, geometry, and number theory. Teams get the same questions, and each problem is scored out of seven for a total maximum of 42 points. That structure makes it a high-stakes test of multi-step reasoning rather than quick pattern matching.

How did Deep Think’s IMO approach differ from earlier DeepMind attempts?

Earlier work relied on specialized math systems such as AlphaProof and AlphaGeometry, including built-in checking and structured proof workflows. This year’s approach emphasized using an advanced Gemini variant (“Deep Think”) and directly inputting the IMO questions, without using formal tooling like Lean in the described testing.

Why does Deep Think feel slow in practice?

The transcript attributes the delay to “parallel thinking time,” where the model generates many reasoning chains simultaneously, then selects and discards the most useful ones. That process can delay the first visible tokens—sometimes several minutes—and the final answer may arrive much later even after intermediate summaries begin streaming.

What did the hands-on math tests reveal about latency and output timing?

In one IMO-style run, the model produced no answer for roughly 5.5 minutes, then began emitting thinking tokens, and only returned the final solution around 16 minutes. In an AIME 2025 benchmark example, it produced intermediate summaries that already matched the correct answer (204) but still continued reasoning for several additional minutes before delivering the final response.

How did Deep Think perform on non-math tasks like 3D code generation?

When prompted to create a “salatai” using colorful voxels, it generated Three.js code that was tested by pasting into an HTML page. The output was described as a success: it rendered a navigable 3D scene with voxel-based structure and details like end elements on cells.

What happened with the Angry Birds-style experiment, and what does it imply?

With Deep Think enabled, the model flagged a missing dependency (“pygame”), suggesting it runs or tests within a controlled environment. It then iterated on the level design to improve ball physics and scoring. However, the workflow required longer waits than a standard Gemini 2.5 Pro approach, making it less attractive as a default coding tool.

Review Questions

What scoring structure and problem count make the IMO a particularly demanding reasoning benchmark?
How does “parallel thinking time” help explain both Deep Think’s strengths and its long wait times?
In the transcript’s examples, what evidence suggested that intermediate summaries could be correct before the final answer arrived?

Key Points

1
Deep Think is an advanced Gemini reasoning model tied to DeepMind’s gold-medal-level IMO performance, and it’s now publicly available.
2
The IMO benchmark uses six shared, hard problems with a 42-point maximum, making it a strong test of multi-step reasoning.
3
Deep Think’s IMO method is described as directly processing IMO questions in Gemini rather than relying on specialized proof-checking systems like AlphaProof or AlphaGeometry.
4
Hands-on tests show long latency: first visible output can take several minutes, and final answers may arrive around 10–20 minutes for hard problems.
5
Deep Think can generate usable 3D code (Three.js) for voxel-based scenes, not just solve math.
6
In an Angry Birds-style task, Deep Think improved level design and physics but required longer interaction cycles than faster models.
7
A central practical challenge is balancing intelligence against speed and cost for real-world use.

Highlights

Deep Think’s “parallel thinking” can delay the first tokens for minutes, and final answers may take about 16 minutes even when intermediate reasoning starts earlier.

Intermediate summaries can already point to the correct numeric answer (204 in an AIME 2025 example) while the model continues reasoning for several more minutes.

Deep Think produced working Three.js code for a voxel “salatai” scene, indicating it can translate reasoning into executable 3D generation.

In an Angry Birds-style experiment, Deep Think detected missing “pygame” and iterated on the level to improve physics and scoring, but at the cost of longer waits.

Topics

International Mathematical Olympiad
Deep Think
Reasoning Latency
Gemini
3D Code Generation

Mentioned

Gemini
DeepMind
Three.js
Angry Birds
IMO
ICML
AIME
ML