Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Post-training now accounts for most LLM training compute, making model behavior increasingly dependent on domain-specific tuning and internal benchmarks.
Briefing
Gemini 3.1 Pro’s release has reignited a familiar AI fight: headline benchmark scores don’t reliably predict real-world usefulness. The core reason is technical—post-training is increasingly domain-specific, so a model can look dominant on one set of tasks while lagging on another, even when both are “coding” or “reasoning.” That shift helps explain why each new hot take about “the best model” tends to contradict the last.
A key backdrop is how compute is spent during LLM development. Pre-training on internet-scale data still matters, but it’s now described as only about 20% of total training compute. The larger share goes into post-training, where models are honed using internal benchmarks and industry-sourced data aimed at particular domains. A year earlier, Anthropic CEO Dario Amodei suggested that the second-stage RL effort was small across labs; the transcript argues that this is no longer the dominant story. When labs optimize for different internal targets, benchmark results become less transferable across domains.
The transcript illustrates the mismatch with chess and coding-style reasoning. On ARC AGI 2, Gemini 3.1 Pro scores 77.1%, ahead of Claude Opus 4.6 at roughly 69%, and the result is highlighted as being featured by Google DeepMind’s Demis Hassabis in the Gemini 3.1 Pro announcement. But ARC AGI 2 isn’t treated as a clean measure of “general intelligence.” Melanie Mitchell notes that changing the encoding from numbers to other symbols can reduce accuracy, because number-based inputs may let models latch onto unintended arithmetic patterns—shortcuts that still yield correct answers. Even within a benchmark, question setup can change what the model learns to exploit.
Coding benchmarks show similar fragility. Francois Chollet argues that agentic coding can become a black box: an agent iterates until a goal is reached, but the internal logic isn’t inspected, so models can overfit to specifications or drift from the original intent. Gemini 3.1 Pro is said to hit a record ELO in live CodeBench Pro, yet the transcript reports a practical counterpoint: when used inside Cursor, performance didn’t match the benchmark hype, reinforcing the theme that optimization can be dialed too far for a particular evaluation environment.
The transcript also claims a meaningful milestone on a private “Simple Bench” of trick-question and common-sense reasoning. Gemini 3.1 Pro reportedly reaches 79.6%, near the margin of error for an average human baseline among nine participants. Still, multiple-choice formats can cue shortcuts; removing options and using an open-ended setup with a blind grader leads to a 15–20 percentage point drop. The takeaway is not that models are failing, but that benchmark design can inflate or deflate apparent capability.
Finally, factual reliability remains unsettled. A Google chart cited in the transcript shows Gemini 3.1 Pro scoring higher on a combined hallucination-and-correctness metric, yet Gemini is also described as hallucinating in about 50% of its incorrect answers, worse than Claude Sonnet 4.6 at 38% and GLM at 34%. The transcript closes by arguing that the search for a single “true” general intelligence benchmark is structurally hard: labs have incentives to build benchmarks they can optimize, and even forecasting benchmarks can be gamed by open-ended agents that profit from prediction markets. The result is a “vibe era” of AI comparisons—fast-moving, contradictory, and increasingly dependent on what a benchmark is actually rewarding.
Cornell Notes
Gemini 3.1 Pro’s mixed benchmark story reflects a broader shift: most training compute now goes into post-training, where models are tuned against internal, domain-specific targets. That makes benchmark rankings less portable—strong performance on one evaluation (like ARC AGI 2) may not translate to other “expert task” suites (like GDPvow) or to real coding workflows. Even within a single benchmark, accuracy can depend on how inputs are encoded, letting models exploit unintended shortcuts rather than robust reasoning. The transcript also highlights that “reasoning” gains don’t automatically solve hallucinations: Gemini 3.1 Pro may score well on a combined metric but still hallucinate in roughly half of its incorrect answers. Overall, benchmark design and optimization targets increasingly determine what “intelligence” looks like on paper.
Why do benchmark rankings for LLMs increasingly contradict each other?
What makes ARC AGI 2 a tricky measure of general reasoning?
How can coding benchmarks mislead users about real coding agents?
What’s the significance of the transcript’s “Simple Bench” milestone, and what caveat comes with it?
Does better benchmark performance mean fewer hallucinations?
Why is a single “true” general intelligence benchmark hard to create?
Review Questions
- Which part of LLM training is described as consuming the majority of compute, and how does that affect cross-benchmark comparisons?
- How does input encoding (numbers vs symbols) potentially change outcomes on ARC AGI 2?
- What does the transcript suggest about multiple-choice vs open-ended question formats when measuring common-sense reasoning?
Key Points
- 1
Post-training now accounts for most LLM training compute, making model behavior increasingly dependent on domain-specific tuning and internal benchmarks.
- 2
Benchmark results can’t be treated as universal rankings because optimization targets differ across labs and domains.
- 3
ARC AGI 2 performance can be influenced by how inputs are encoded, enabling unintended shortcut patterns rather than robust reasoning.
- 4
Agentic coding benchmarks may reward overfitting to a task spec or black-box goal-seeking, so real-world coding outcomes can diverge.
- 5
Even strong combined metrics for factuality can hide substantial hallucination rates on incorrect answers.
- 6
Multiple-choice formats can cue trick-question shortcuts; open-ended setups with blind grading can produce materially lower scores.
- 7
A “single objective” general intelligence benchmark is difficult because benchmark designers have incentives and because agents can game evaluation systems, including prediction markets.