The Future of Math with o1 Reasoning with Terence Tao, Mark Chen, and James Donovan

TL;DR

o-series reasoning models are designed to reduce shallow pattern-matching errors by encouraging more reflective “system two” behavior before answering.

Briefing Cornell Notes

Briefing

The central takeaway is that progress in “reasoning math” is less about making large language models magically correct and more about rebuilding the workflow around mathematics: modular collaboration, iterative verification, and formal proof layers that can catch mistakes at scale. Terence Tao frames the opportunity as a shift from mathematicians tackling one problem at a time to teams—human and AI—working in parallel on thousands of problems, with different roles handling vision (what to study), computation, proof writing, and checking. Mark Chen adds that today’s models like GPT-4 can be “smart but stupid,” often falling into pattern-matching errors, which is why OpenAI has emphasized o-series reasoning models designed to behave more like “system two” thinkers—slower, reflective, and less reliant on shallow priors.

A key mechanism in the discussion is formal verification. Both Tao and Chen argue that for complex mathematics, a proof assistant such as Lean is not optional: one wrong step can collapse an entire argument. The proposed workflow is iterative: an AI proposes a proof in a formal language (Lean), the system compiles it, and any failure returns an error message that guides the next attempt. This approach is described as already effective for smaller, homework-scale proofs, while large, PhD-level proof generation still demands more compute and doesn’t yet scale cleanly to “ask a high-level question, get a huge proof” automation.

Tao also emphasizes how AI changes the economics and organization of math. Mathematics currently bundles many tasks—posing questions, finding tools, reading literature, computing, checking, writing, presenting, and grant work—into a single person’s workload. AI and software tooling can decouple these tasks, enabling specialization: some contributors focus on conjecture and pattern discovery, others on formalizing theorems, others on running code and project coordination (with GitHub-style workflows), and still others on visualization and communication. He notes that large collaborations already exist in formalization projects, where verification is built in, and cites Lean’s mathlib as a major example.

The conversation repeatedly returns to what remains hard. Chen says AI struggles most when data is scarce and the next step requires strategic planning—new abstractions, taste, and reasoning in “negative space.” Tao agrees that models are weak at deciding what to ask and what abstractions to build, even if they can help with pattern recognition, conjectures, verification, and counterexample search. Both also warn that trust and oversight become central as model-generated insights grow in volume; scalable oversight is treated as an open problem, with math singled out as a domain where formal verification offers a rare path to automated trust.

Finally, the panel connects accelerated math to broader science and society. Faster foundational progress could expand citizen participation in math through interactive modeling and visualization, and it may make math “optional” for some scientific workflows by letting AI handle the calculations behind the scenes. Yet the speakers stress that human expertise still matters for supervision, interpretation, and steering—especially when models produce plausible but potentially wrong reasoning. The event ends with a practical message: the near-term value is compounding—accelerating parts of research, improving verification, and enabling new collaboration structures—rather than replacing mathematicians outright.

Cornell Notes

The discussion argues that accelerating mathematics with AI depends on changing the research workflow, not just increasing model size. o-series reasoning models are designed to reflect more before answering, addressing failure modes where faster pattern-matching leads to simple mistakes. For complex proofs, formal proof assistants like Lean provide an indispensable verification layer: AI can propose steps, the proof compiles or fails, and error feedback guides iteration. This modular, team-based approach can scale collaboration by splitting tasks such as computation, formalization, and project coordination. The payoff matters because it enables parallel progress on many problems while keeping trust high through machine-checkable correctness.

Why do the speakers treat formal proof assistants (e.g., Lean) as a “necessary intermediary layer” between AI outputs and trustworthy mathematics?

Math proofs are brittle: if a proof has 100 steps and even one is wrong, the entire argument can fail. Because AI systems can produce mistakes, the speakers describe a workflow where the model outputs a proof in a formal language (Lean). If the proof compiles, correctness is mechanically verified; if it fails, the system returns an error message and the model updates its attempt. They note this iterative compile-and-fix approach can handle smaller proofs (like undergraduate-level assignments), but generating very large proofs directly from a high-level prompt still doesn’t scale cleanly yet.

What does “reworking mathematics from the ground up” mean in practice—how does AI change collaboration?

Tao argues that modern math work bundles many distinct skills into one person’s workload: posing the right question, selecting tools, learning literature, trying arguments, doing computations, checking correctness, writing up, presenting, and grant work. AI and software tooling can decouple these tasks so different contributors can specialize. Examples include AI doing computations, humans focusing on vision and abstraction choices, and formal proof assistance handling verification. He also points to formalization ecosystems (like Lean’s mathlib) where contributions are verified in a shared system, enabling large collaborations.

What are the main limitations of current reasoning models highlighted by Mark Chen?

Chen contrasts GPT-4’s strength with its weaknesses: it can be “smart” yet still make errors on simple puzzles and rely too heavily on prior expectations about how a solution should look. This motivates o-series models intended to behave more like “system two” thinkers—slower, more reflective, and less prone to shallow pattern-matching. Even with better reasoning, Chen emphasizes that strategic planning in data-scarce settings (where there’s no large training signal for the next step) remains difficult, especially tasks requiring taste and abstraction selection.

How do the speakers reconcile the fear that AI will reduce human intuition or “number sense”?

Tao uses a chess analogy: even after chess became effectively “solved,” people still play, but they practice differently—experimenting with moves and consulting engines for evaluation. That changes the kind of intuition people develop. He also compares calculators: they can reduce manual computation skills, but they can create a different kind of intuition through interaction. The speakers expect a shift in beauty standards and abstraction layers rather than a total loss of human mathematical creativity.

What does the panel say about scaling from Olympiad-level performance to PhD-level math?

The speakers suggest the gap depends on whether humans supervise. With human assistance, models can already help with many tasks in a math project. Without supervision, the missing piece is often strategic planning when there’s little or no data to guide the next move. Chen adds that AI excels when it can generate lots of similar training-like instances, but struggles when problems are genuinely research-level and only a small number of experts have worked on them. Tao and Chen both imply more breakthroughs are needed for autonomous PhD-level reasoning in data-scarce environments.

Why is “scalable oversight” treated as a central challenge as AI-generated insights grow?

Chen frames oversight as a general problem: when a model spends lots of time thinking and produces a fundamental insight, it’s crucial to know whether it made a mistake. In most domains, verifying correctness at scale is hard. Math is singled out as a place where formal verification can be automated, offering a path to trust that other sciences still lack. The speakers connect this to the broader OpenAI concern about vetting and reliability as capabilities expand.

Review Questions

What workflow changes are needed so AI can contribute to complex proofs without sacrificing correctness?
Which specific failure modes (e.g., pattern-matching, reliance on priors, data scarcity) limit current reasoning models, and how do o-series models address them?
How do the speakers use analogies like chess and calculators to argue that human skills may shift rather than erode?

Key Points

1
o-series reasoning models are designed to reduce shallow pattern-matching errors by encouraging more reflective “system two” behavior before answering.
2
Complex proofs require machine-checkable verification; iterative proof attempts in Lean can be validated by compilation and corrected using error feedback.
3
AI can decouple math work into modular roles—question selection, computation, formalization, and project coordination—enabling parallel progress across many problems.
4
AI is strongest when tasks generate lots of training-like data; it struggles more in data-scarce research settings that require strategic planning and abstraction “taste.”
5
Collaboration scales better in formalization ecosystems (e.g., Lean’s mathlib) because contributions can be verified in a shared formal language.
6
Trust and oversight become harder as model-generated insights increase; math is a rare domain where formal verification can provide scalable confidence.
7
Accelerated foundational math could broaden participation and speed scientific workflows, but human supervision and expertise remain important for steering and interpretation.

Highlights

The speakers argue that AI’s path to trustworthy math runs through formal proof assistants: propose in Lean, compile, and iterate based on errors.

Tao describes a shift from one-person “checklist” math work to modular collaboration where different contributors handle different stages of the research pipeline.

Chen emphasizes that reasoning models still struggle most when data is scarce and the next step requires strategy, abstraction, and taste rather than pattern completion.

Formal verification is presented as the key scalable oversight mechanism—something other sciences lack today.

Topics

Reasoning Models
Formal Proofs
Lean Formalization
AI Collaboration
Scalable Oversight

Mentioned

OpenAI
Lean
GitHub
Terence Tao
Mark Chen
James Donovan
Natalie
Eduardo Sengupta
Neoy Sengupta
Ashish B
Daniel McNea
Jordan
Belinda Mo
Lizzie
Ahmed Elgamal
Michael Skyba
Caitlyn
GPT-4
IMO
RL
AI
PDE
LLM
VC