Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI

TL;DR

Post-training now accounts for most LLM training compute, making model behavior increasingly dependent on domain-specific tuning and internal benchmarks.

Briefing Cornell Notes

Briefing

Gemini 3.1 Pro’s release has reignited a familiar AI fight: headline benchmark scores don’t reliably predict real-world usefulness. The core reason is technical—post-training is increasingly domain-specific, so a model can look dominant on one set of tasks while lagging on another, even when both are “coding” or “reasoning.” That shift helps explain why each new hot take about “the best model” tends to contradict the last.

A key backdrop is how compute is spent during LLM development. Pre-training on internet-scale data still matters, but it’s now described as only about 20% of total training compute. The larger share goes into post-training, where models are honed using internal benchmarks and industry-sourced data aimed at particular domains. A year earlier, Anthropic CEO Dario Amodei suggested that the second-stage RL effort was small across labs; the transcript argues that this is no longer the dominant story. When labs optimize for different internal targets, benchmark results become less transferable across domains.

The transcript illustrates the mismatch with chess and coding-style reasoning. On ARC AGI 2, Gemini 3.1 Pro scores 77.1%, ahead of Claude Opus 4.6 at roughly 69%, and the result is highlighted as being featured by Google DeepMind’s Demis Hassabis in the Gemini 3.1 Pro announcement. But ARC AGI 2 isn’t treated as a clean measure of “general intelligence.” Melanie Mitchell notes that changing the encoding from numbers to other symbols can reduce accuracy, because number-based inputs may let models latch onto unintended arithmetic patterns—shortcuts that still yield correct answers. Even within a benchmark, question setup can change what the model learns to exploit.

Coding benchmarks show similar fragility. Francois Chollet argues that agentic coding can become a black box: an agent iterates until a goal is reached, but the internal logic isn’t inspected, so models can overfit to specifications or drift from the original intent. Gemini 3.1 Pro is said to hit a record ELO in live CodeBench Pro, yet the transcript reports a practical counterpoint: when used inside Cursor, performance didn’t match the benchmark hype, reinforcing the theme that optimization can be dialed too far for a particular evaluation environment.

The transcript also claims a meaningful milestone on a private “Simple Bench” of trick-question and common-sense reasoning. Gemini 3.1 Pro reportedly reaches 79.6%, near the margin of error for an average human baseline among nine participants. Still, multiple-choice formats can cue shortcuts; removing options and using an open-ended setup with a blind grader leads to a 15–20 percentage point drop. The takeaway is not that models are failing, but that benchmark design can inflate or deflate apparent capability.

Finally, factual reliability remains unsettled. A Google chart cited in the transcript shows Gemini 3.1 Pro scoring higher on a combined hallucination-and-correctness metric, yet Gemini is also described as hallucinating in about 50% of its incorrect answers, worse than Claude Sonnet 4.6 at 38% and GLM at 34%. The transcript closes by arguing that the search for a single “true” general intelligence benchmark is structurally hard: labs have incentives to build benchmarks they can optimize, and even forecasting benchmarks can be gamed by open-ended agents that profit from prediction markets. The result is a “vibe era” of AI comparisons—fast-moving, contradictory, and increasingly dependent on what a benchmark is actually rewarding.

Cornell Notes

Gemini 3.1 Pro’s mixed benchmark story reflects a broader shift: most training compute now goes into post-training, where models are tuned against internal, domain-specific targets. That makes benchmark rankings less portable—strong performance on one evaluation (like ARC AGI 2) may not translate to other “expert task” suites (like GDPvow) or to real coding workflows. Even within a single benchmark, accuracy can depend on how inputs are encoded, letting models exploit unintended shortcuts rather than robust reasoning. The transcript also highlights that “reasoning” gains don’t automatically solve hallucinations: Gemini 3.1 Pro may score well on a combined metric but still hallucinate in roughly half of its incorrect answers. Overall, benchmark design and optimization targets increasingly determine what “intelligence” looks like on paper.

Why do benchmark rankings for LLMs increasingly contradict each other?

The transcript attributes it to training economics and post-training. Pre-training on internet-scale data is described as only ~20% of training compute, while the majority is spent on post-training (RL and other tuning) against internal benchmarks and domain-relevant data. When different labs optimize for different internal targets, a model can be excellent in one domain and weaker in another, so “best overall” claims based on one benchmark set become unreliable.

What makes ARC AGI 2 a tricky measure of general reasoning?

ARC AGI 2 can be sensitive to input encoding. Melanie Mitchell is cited for pointing out that changing from numeric encodings to other symbols reduces accuracy. The transcript adds a mechanism: number-based representations can let models find unintended arithmetic patterns, producing correct answers via shortcuts. So a high score may reflect exploited structure as much as reasoning.

How can coding benchmarks mislead users about real coding agents?

The transcript leans on Francois Chollet’s view that sufficiently advanced agentic coding behaves like a black box: agents iterate until a goal is met, without necessarily revealing internal logic. That creates room for overfitting to the benchmark’s spec or drifting from the user’s original intent. It also notes a practical mismatch: Gemini 3.1 Pro’s record ELO in live CodeBench Pro doesn’t guarantee similar outcomes inside Cursor.

What’s the significance of the transcript’s “Simple Bench” milestone, and what caveat comes with it?

Gemini 3.1 Pro is reported to score 79.6% on a private Simple Bench of trick questions/common-sense reasoning, near a human-average baseline among nine participants. But the transcript warns that multiple-choice formats can cue shortcuts (e.g., an option like “zero” can signal a trick). When questions are converted to open-ended responses and graded by a blind grader, scores drop by 15–20 percentage points—showing benchmark format can inflate apparent capability.

Does better benchmark performance mean fewer hallucinations?

Not necessarily. The transcript cites a Google release chart where Gemini 3.1 Pro has a higher combined score on correct vs hallucinated answers, but it still hallucinates in about 50% of incorrect answers. Claude Sonnet 4.6 is described as hallucinating about 38% of the time, and GLM about 34%. The implication: optimization for “best case” performance doesn’t eliminate worst-case failure modes.

Why is a single “true” general intelligence benchmark hard to create?

The transcript argues that labs have incentives to build benchmarks they can optimize, which can bias results. It also notes that realistic reinforcement learning with verifiable rewards is difficult—small teams may not have the budget to craft benchmarks that reflect real-world performance without overestimating it. Even forecasting benchmarks can be gamed by open-ended agents that profit from prediction markets.

Review Questions

Which part of LLM training is described as consuming the majority of compute, and how does that affect cross-benchmark comparisons?
How does input encoding (numbers vs symbols) potentially change outcomes on ARC AGI 2?
What does the transcript suggest about multiple-choice vs open-ended question formats when measuring common-sense reasoning?

Key Points

1
Post-training now accounts for most LLM training compute, making model behavior increasingly dependent on domain-specific tuning and internal benchmarks.
2
Benchmark results can’t be treated as universal rankings because optimization targets differ across labs and domains.
3
ARC AGI 2 performance can be influenced by how inputs are encoded, enabling unintended shortcut patterns rather than robust reasoning.
4
Agentic coding benchmarks may reward overfitting to a task spec or black-box goal-seeking, so real-world coding outcomes can diverge.
5
Even strong combined metrics for factuality can hide substantial hallucination rates on incorrect answers.
6
Multiple-choice formats can cue trick-question shortcuts; open-ended setups with blind grading can produce materially lower scores.
7
A “single objective” general intelligence benchmark is difficult because benchmark designers have incentives and because agents can game evaluation systems, including prediction markets.

Highlights

Gemini 3.1 Pro’s strong ARC AGI 2 score (77.1%) is paired with evidence that changing input encoding can reduce accuracy—suggesting shortcut exploitation is possible.

Simple Bench is described as reaching 79.6% for Gemini 3.1 Pro, but removing multiple-choice cues and switching to open-ended grading can drop results by 15–20 points.

Hallucinations are framed as not solved: Gemini 3.1 Pro is reported to hallucinate in about 50% of incorrect answers, worse than Claude Sonnet 4.6 (38%) and GLM (34%).

The transcript’s central theme is that post-training specialization makes “best model overall” claims unstable across benchmark types and real workflows.

Topics

Benchmark Reliability
Post-Training Specialization
ARC AGI 2
Agentic Coding
Hallucinations
Forecasting Markets