The AI that solved IMO Geometry Problems

TL;DR

Alpha Geometry’s 25/30 IMO geometry score depends on combining a deductive rule system with an auxiliary-construction generator, not on AI alone.

Briefing Cornell Notes

Briefing

Google DeepMind’s Alpha Geometry hit a striking benchmark on International Mathematical Olympiad (IMO) geometry problems: it solved 25 of 30, outperforming a silver-medalist level. The headline result matters because it shows an AI system can handle a domain where success typically depends on both correct reasoning and the right “idea” to add to a diagram—something that has long resisted brute-force computation.

But the more revealing story is how much progress came before the AI component ever generated an auxiliary construction. A logic-first system built from a hardcoded deductive database (DD) of geometry rules already solved 7 of 30 IMO problems. When the system was augmented with algebraic reasoning (AR)—the ability to solve the linear equations that arise from geometric constraints—the score rose to 14 of 30. Adding human-coded heuristics pushed performance further to 18 of 30, nearly reaching a bronze-medalist level. In other words, a substantial fraction of IMO geometry success can be achieved by chaining known geometric facts and then solving the resulting equations, without any “creative” search.

The remaining gap exposed a core weakness: DD plus AR struggles with auxiliary constructions. Many hard geometry solutions require drawing extra lines or shapes not present in the original diagram. That step is the bottleneck because it creates an effectively infinite search space—machines must guess which new elements to introduce before the logic and algebra can finish the job. Thelz’s theorem illustrates the contrast: once the right construction is made (connecting points to the circle’s center), the rest follows from isosceles triangles and angle-sum equations. The construction is the hard part; the deduction is comparatively straightforward.

Alpha Geometry addresses this by adding a language model whose sole job is to propose auxiliary constructions. The system iterates: the model reads the problem statement and the partial proof so far, outputs a candidate construction in a specialized geometry coding language, and then DD plus AR attempts to complete the proof. If the attempt fails, the output becomes new context for the language model, which proposes another construction, and the cycle repeats until the proof is found or time runs out. In effect, the language model plays the “creative brain” that proposes diagram changes, while DD plus AR acts as the “logical brain” that verifies consequences.

Training data posed another challenge: there are not enough solved IMO geometry problems available publicly to learn this behavior directly. DeepMind generated synthetic training examples instead. It randomly plotted points and lines, used DD plus AR to deduce what it could, then erased parts of the diagram to turn the missing elements into auxiliary constructions the model would need to recreate. This process produced hundreds of millions of synthetic proof examples, with millions requiring at least one auxiliary construction, including a longest synthetic proof of 247 steps with two auxiliary constructions.

With this setup, the full system—DD plus AR combined with the auxiliary-construction language model and fine-tuning—reached 25 out of 30. The broader significance is less about geometry trivia and more about a general recipe: pair structured reasoning with learned creativity, then train the creativity using synthetic data generated from a verifier. That blend could inform how machines tackle other reasoning-heavy fields where the key step is often inventing the right intermediate idea.

Cornell Notes

Alpha Geometry’s headline performance—25/30 IMO geometry problems—rests on a layered system rather than raw AI guessing. A deductive database (DD) of geometry rules solved 7/30, and adding algebraic reasoning (AR) that can solve linear equations raised that to 14/30. Human-coded heuristics pushed DD+AR to 18/30, but the system then hit a wall: hard problems often require auxiliary constructions (new lines or shapes) that aren’t in the starting diagram. A language model was trained to propose those constructions, while DD+AR verified and completed proofs. Because real solved examples are scarce, DeepMind generated hundreds of millions of synthetic proof cases by randomly drawing diagrams, deducing consequences with DD+AR, and erasing parts to create “missing” constructions for training.

Why did DD plus AR plateau at 18/30 even though it can chain many known facts and solve equations?

DD plus AR can deduce consequences from a fixed library of geometry rules and then solve the resulting linear equations. That works well when the needed intermediate steps are already implied by the given diagram. The plateau happens because many IMO geometry solutions require auxiliary constructions—extra lines or shapes that must be added before the deduction-and-equation machinery can take over. Without a way to reliably guess those additions, the search space becomes effectively unbounded at each step.

What exactly is the role of AR, and how does it relate to the kind of math geometry produces?

AR handles the algebraic portion of geometry proofs by solving systems of linear equations. The transcript’s example with a semicircle uses angle relationships derived from equal radii and triangle angle sums; once those relationships become equations, linear algebra can finish the computation. DeepMind implemented this capability explicitly as “algebraic reasoning,” complementing DD’s rule-based deductions.

How does the system decide what auxiliary construction to try next?

A language model proposes an auxiliary construction given (1) the problem statement and (2) the proof steps produced so far. The construction is output in a specialized geometry coding language (not plain English). After the model proposes a construction, DD plus AR attempts to complete the proof; if it fails, the new partial proof state is fed back into the language model for another construction proposal.

Why was synthetic data generation necessary for training?

There aren’t enough publicly available solved IMO geometry problems to train a model that can generate the right auxiliary constructions. DeepMind instead generated synthetic proof examples: it randomly plotted points and lines, used DD plus AR to deduce what it could, then erased parts of the diagram to create new problems whose solutions require reconstructing the erased elements. This produced hundreds of millions of synthetic proofs, including millions that required at least one auxiliary construction.

What performance numbers show the impact of each module?

The transcript gives a clear progression: DD alone solved 7/30. Adding AR increased that to 14/30. Adding human-coded heuristics pushed it to 18/30. Finally, combining DD+AR with the auxiliary-construction language model (with fine-tuning) reached 25/30.

Review Questions

Which component is responsible for handling auxiliary constructions, and why is that component necessary even when DD and AR are strong?
Explain the iterative loop used by Alpha Geometry after the language model proposes an auxiliary construction.
How does synthetic training data generation work, and what does erasing parts of a deduced diagram accomplish for learning?

Key Points

1
Alpha Geometry’s 25/30 IMO geometry score depends on combining a deductive rule system with an auxiliary-construction generator, not on AI alone.
2
A deductive database (DD) of geometry rules solved 7/30 problems; adding algebraic reasoning (AR) that solves linear equations raised performance to 14/30.
3
Human-coded heuristics improved DD+AR to 18/30, but the system struggled with auxiliary constructions that introduce new diagram elements.
4
Auxiliary constructions create an effectively infinite search space, which is why a learned proposal mechanism is crucial for hard cases.
5
A language model proposes auxiliary constructions in a specialized geometry coding language, while DD+AR verifies and completes proofs.
6
DeepMind trained the construction model using synthetic data generated by random diagram plotting, DD+AR deduction, and erasing parts to create new “missing construction” targets.

Highlights

DD plus AR already solved 14/30 IMO geometry problems—proof that much of the reasoning can be automated without “creative” search.

The decisive bottleneck wasn’t deduction or algebra; it was inventing auxiliary constructions that aren’t present in the original diagram.

Alpha Geometry uses an iterative loop: propose a construction with a language model, then let DD+AR attempt a proof, repeating until success or time runs out.

Synthetic training data—generated by deducing from random diagrams and erasing parts—enabled learning despite limited real IMO solutions.

The full system’s jump from 18/30 (with heuristics) to 25/30 came from adding learned construction generation on top of logical verification.

Topics

IMO Geometry
Alpha Geometry
Deductive Database
Algebraic Reasoning
Auxiliary Constructions

Mentioned

DeepMind
Alpha Geometry
IMO
DD
AR

The AI that solved IMO Geometry Problems | Guest video by @Aleph0