Alpha Everywhere: AlphaGeometry, AlphaCodium and the Future of LLMs

TL;DR

AlphaGeometry achieves near–IMO gold-medalist performance on geometry by combining a language model’s construct invention with a symbolic solver’s mechanical deduction.

Briefing Cornell Notes

Briefing

AlphaGeometry’s standout result is a near–International Mathematical Olympiad gold-medal performance on geometry problems using a neurosymbolic loop that lets a language model invent “constructs” and a symbolic engine carry out the deductions. The system targets a narrow slice of IMO-style difficulty—30 geometry questions rather than the full contest—and still lands at a level comparable to top human gold medalists, while the authors stress the work should not be treated as a straight line to AGI.

The mechanism matters more than the score. AlphaGeometry combines (1) a neural language model trained on synthetic proof data to propose the missing geometric steps—described as “pulling rabbits out of the hat”—and (2) a symbolic solver that mechanically applies known rules once a useful construct is introduced. In a typical proof, the symbolic component can’t easily invent the key move (like dropping a perpendicular to a midpoint), so the language model is tuned to suggest such constructs in the cases where brute-force deduction alone fails. If the symbolic engine stalls, the system loops back: it asks the language model for additional constructs until the proof is found. The training mixture includes many straightforward proofs solvable by deduction alone, plus a smaller set where constructs are essential; one cited example uses two constructs and a 247-step deduction.

The paper’s caveats are also central. The solutions are not optimized for human aesthetics and can look “like trash,” reflecting a search-and-solve strategy rather than theorem discovery shaped by symmetry or elegance. The approach is also positioned as an extension of a broader idea: hard problems often require inventing intermediate “moves,” and language models can help generate those moves while symbolic engines handle the reliable reasoning.

Compute and search are treated as levers. AlphaGeometry uses NVIDIA V100 GPUs, and the system’s success depends on how many candidate constructs it considers per step. The team reports that using less than 2% of the search budget—sampling eight constructs instead of 512—still solves 21 problems, placing performance below silver-medalist level but far above earlier state of the art. The transcript highlights the practical implication: newer GPU generations (A100, H100, and beyond) could expand the explored search space and improve results further, potentially making geometry largely “solved” in the near term—though that prediction is framed as speculative.

The discussion then broadens to AlphaCodium, an open-source, OP-sourced rival to AlphaCode that aims to beat AlphaCode 2 without fine-tuning by using code execution as feedback. The workflow resembles an LLM–environment conversation: generate candidate code, run it against unit tests, and iterate when tests fail. The transcript links this to a broader shift away from forcing immediate answers toward delayed, test-driven exploration—an approach also seen in other LLM planning and reflection efforts.

Overall, AlphaGeometry and AlphaCodium are presented as evidence that language models are increasingly useful for generating candidate steps—constructs in math, code in programming—that search and verification systems can then validate. They are not claimed to be AGI, but they reinforce a fast-moving trend: coupling LLMs with search, brute force, and external checks to turn “idea generation” into reliable outputs.

Cornell Notes

AlphaGeometry pairs a language model with a symbolic geometry solver in a loop. The language model proposes the crucial intermediate constructions (“rabbits out of the hat”) when deduction alone can’t proceed, and the symbolic engine then executes the mechanical proof steps. If the solver fails, the system requests more constructs and repeats until it finds a valid proof. On 30 IMO geometry problems, it reaches performance close to the average IMO gold medalist, though the resulting proofs may look non-aesthetic and “trash-like.” The broader significance is the growing alliance between LLM-driven idea generation and verification-heavy search systems, echoed by AlphaCodium’s test-driven code iteration.

What makes AlphaGeometry different from using brute-force deduction alone?

Brute-force symbolic deduction can handle many geometry proofs once the right intermediate objects are introduced, but it struggles with the key invention step. AlphaGeometry uses a language model trained on synthetic proof data to propose those missing constructions—such as dropping a perpendicular to a midpoint—when deduction stalls. After a construct is proposed, the symbolic engine mechanically derives the remaining steps. If the symbolic engine can’t finish, the system loops back to request additional constructs.

How is the training data structured, and why does that matter?

The language model is trained on purely synthetic data designed to teach it to produce proof-supporting constructs. The transcript notes that 91 million samples are solvable by brute-force step-by-step deduction using known rules, while 9 million cases require special constructs. The model is fine-tuned particularly on the construct-needed examples, including at least one cited case involving two constructs and a 247-step deduction.

What performance claim is made, and what limitation is emphasized?

AlphaGeometry scores almost as highly as the average IMO gold medalist, but only for a subset: geometry problems. It does not run the full IMO; instead it tackles 30 geometry IMO questions, excluding algebra and number theory. The transcript also emphasizes that the authors warn against overhyping the result as AGI.

Why do the proofs sometimes look worse than human solutions?

The system is not biased toward aesthetic criteria like symmetry. The transcript describes the authors admitting solutions tend to be non-symmetrical and can look “like trash,” even though they work. This reflects a search-and-verify approach focused on correctness rather than human-style elegance.

How does AlphaCodium extend the same theme outside math proofs?

AlphaCodium uses unit tests as an external environment for feedback. Instead of asking an LLM to output a final answer immediately, it generates candidate code, runs it on tests, and iterates when tests fail—turning correctness into a measurable signal. The transcript frames this as a shift toward LLM–environment interaction where the model improves through verification rather than one-shot generation.

What role do compute and search budget play in AlphaGeometry’s results?

Success depends on how many candidate constructs are explored at each step. The transcript notes the system can use less than 2% of the search budget—sampling eight constructs instead of 512 during test time—yet still solves 21 problems. It also highlights that AlphaGeometry used NVIDIA V100 GPUs, implying that scaling to newer hardware could expand the search space and improve outcomes.

Review Questions

How does AlphaGeometry decide when to ask the language model for a new construct, and what happens after the symbolic solver fails?
Why might a system that prioritizes correctness over aesthetics produce proofs that look non-symmetrical or “trash-like”?
In AlphaCodium’s workflow, how do unit tests function as the feedback mechanism that drives iteration?

Key Points

1
AlphaGeometry achieves near–IMO gold-medalist performance on geometry by combining a language model’s construct invention with a symbolic solver’s mechanical deduction.
2
The system uses a loop: propose constructs, attempt symbolic proof, and request more constructs if deduction stalls.
3
Performance is reported for 30 geometry IMO questions rather than the full IMO, and the authors caution against treating the result as AGI.
4
AlphaGeometry’s proofs may lack human aesthetic properties because the method optimizes for correctness through search and verification, not elegance.
5
Compute and search budget strongly influence outcomes; exploring more candidate constructs per step can raise success rates.
6
AlphaCodium applies the same LLM-plus-verification idea to programming by iterating on code using unit test results as feedback.
7
A broader trend emerges: LLMs increasingly serve as idea generators, while external search and execution systems provide reliability.

Highlights

AlphaGeometry’s key move is outsourcing “missing step” invention to a language model, then letting a symbolic engine finish the proof mechanically.

When the symbolic solver can’t complete a proof, the system loops back to generate additional constructs until it succeeds.

AlphaCodium’s test-driven iteration reframes LLM use as an interactive process with an environment, not a one-shot answer generator.

The reported performance depends on search budget and compute, with the transcript pointing to newer GPU generations as a likely accelerator.

Topics

AlphaGeometry
Neurosymbolic Reasoning
IMO Geometry
AlphaCodium
Test-Driven Code Generation

Mentioned

Demis Hassabis
Andre Karpathy
Santiago
Ralph
Paul Cristiano
Donato Capitella
AGI
IMO
DX
GPT-4
OP