"A PHD in Everything" Grok 4 CRUSHES Every Leading AI Model

TL;DR

Grok 4 is reported to reach a little over 16% on ARC AGI 2, about double the prior public best near 8–9%.

Briefing Cornell Notes

Briefing

XAI’s Grok 4 has surged to the top of multiple high-stakes AI benchmarks, posting standout gains in reasoning-heavy tests while matching competitors on cost. On ARC AGI 2—a notoriously difficult benchmark where models must learn a mini-skill from examples and then apply it at test time—Grok 4 lands at a little over 16%, roughly double the prior best public score (Claude Opus 4 thinking at about half that). The improvement matters because ARC AGI 2 is designed so that scores below 10% are often treated as noise; breaking that threshold signals more than incremental progress.

Concerns about benchmark credibility are met with a claim of independent validation: Arc Prize president Greg says XAI contacted the organization about a day earlier to test Grok 4 on ARC AGI using a semi-private evaluation set. The testing policy described includes no data retention and a temporary rate-limit increase for burst testing. Arc Prize’s private results reportedly conclude Grok 4 is the top performing publicly available model on ARC AGI 2, even outperforming purpose-built Kaggle submissions. The transcript also frames Grok 4’s performance as efficient—delivering higher scores at the “same exact price” compared with leading alternatives.

The benchmark run extends beyond ARC. On “Humanity’s Last Exam,” Grok 4 reaches 38.6%, far above the narrator’s current daily driver (about 25% for o3) and above Gemini 2.5 Pro (around 27%). Grok 4 Heavy, the larger variant, pushes to 44.4%, nearly doubling the gap versus o3 in this test. Other leaderboards are described as largely saturated: GPQA where Grok 4 narrowly edges Gemini 2.5 Pro and “Amy 25,” LCB where Grok 4 sits near 79.3%, and a “perfect score of 100%” for Grok 4 Heavy.

A key technical detail is how Grok 4 Heavy scales test-time compute. Instead of merely “thinking longer,” it splits reasoning across multiple agents, compares their outputs, and selects a final answer—likened to a research focus group. That approach is portrayed as powerful but expensive, with the transcript noting the Grok 4 Heavy plan costs $300, limiting access.

Still, hands-on trials show that benchmark strength doesn’t automatically translate to flawless real-world performance. In a web-research task about ranking the top 50 snack foods by unit consumption, Grok 4 produces a confident ranking but its self-critique ultimately sides with OpenAI Deep Research as more accurate, citing mismatches such as overestimating peanuts and under-representing sunflower seeds. In a “who invented the Bigfoot AI trend” question, Grok 4 corrects the credit toward an earlier origin tied to the creator’s own TikTok/YouTube activity, but the transcript warns that even strong models can fall into the same traps as weaker ones on nuanced, timing-sensitive questions.

The most striking limitation is multimodal reliability. When asked to verify whether an image (generated with GPT-4o) is real, multiple models—including Grok 4—are described as failing to detect that the input is AI-generated, with Grok 4 reportedly treating it as a real photograph. The transcript contrasts this with the expectation that Grok 4 should be able to analyze images, while noting that native image input appears to be a downgrade for now, even as native audio features are announced.

Overall, Grok 4 is presented as a major leap in benchmark performance—especially for reasoning tasks—while the practical takeaway is more cautious: nuanced questions and image authenticity checks remain weak spots across leading models, even at the top of the leaderboard.

Cornell Notes

Grok 4 from XAI is reported to dominate reasoning benchmarks, especially ARC AGI 2, where it posts a little over 16%—about double the prior public best around 8–9%. Arc Prize president Greg claims XAI arranged semi-private testing to validate Grok 4’s ARC score and check for overfitting, concluding it’s the top publicly available model on ARC AGI 2. Grok 4 also scores strongly on “Humanity’s Last Exam” (38.6%), with Grok 4 Heavy reaching 44.4% using a multi-agent test-time compute approach. Hands-on checks, however, show persistent weaknesses: Grok 4 can still miss nuanced research details and struggles to reliably detect AI-generated images when asked to verify authenticity.

Why does ARC AGI 2 matter, and what score did Grok 4 achieve there?

ARC AGI 2 is framed as a hard benchmark because models must learn a mini-skill from training examples and then apply it correctly at test time. The transcript notes that scores below 10% are often considered noisy. Grok 4 lands at a little over 16%, described as the best score ever seen on the benchmark and roughly double the prior top public score (Claude Opus 4 thinking at about half that).

What validation claim is made to address concerns that XAI-controlled benchmarks might be biased?

Arc Prize president Greg is quoted as saying XAI called about 24 hours earlier to test Grok 4 on ARC AGI. Arc Prize reportedly self-tested Grok 4 on a semi-private evaluation set using a policy that avoids data retention and only temporarily increases rate limits for burst testing. The private conclusion described is that Grok 4 becomes the top performing publicly available model on ARC AGI 2 and even beats purpose-built Kaggle solutions.

How does Grok 4 Heavy differ from Grok 4 in its reasoning approach?

Grok 4 Heavy is described as scaling test-time compute differently from models that simply “think for longer.” It splits reasoning into multiple agents, compares their outputs, and then produces a final answer—likened to a research focus group. The transcript emphasizes that this improves results but increases cost, and the narrator notes limited access due to the $300 Grok 4 Heavy plan.

What happened in the snack-food ranking test when Grok 4 compared itself to another system?

For a task ranking the top 50 snack foods by unit consumption, Grok 4 produced a ranking and then performed a self-analysis comparing its results to OpenAI Deep Research. The transcript says Grok 4 overestimated peanuts and under-represented sunflower seeds, and it ultimately judged Deep Research’s data-driven rankings and citations as more accurate overall.

What limitation showed up in the image authenticity test?

When given an AI-generated image (created with GPT-4o) and asked whether it was a real photo, Grok 4 reportedly failed to detect the fabrication. The transcript describes Grok 4 as treating the image as an authentic snapshot, while other models also produced confident but incorrect authenticity conclusions. The narrator concludes that Grok 4’s image verification is unreliable right now, implying a native image-input limitation.

Review Questions

Which benchmark threshold on ARC AGI 2 is treated as “noise,” and how does Grok 4’s score compare to it?
What specific mechanism does Grok 4 Heavy use to improve answers at test time, and how is it different from simply thinking longer?
In the snack-food ranking task, what two concrete ranking errors does the transcript attribute to Grok 4 (relative to Deep Research)?

Key Points

1
Grok 4 is reported to reach a little over 16% on ARC AGI 2, about double the prior public best near 8–9%.
2
Arc Prize president Greg claims semi-private ARC AGI testing validated Grok 4’s leaderboard score and found it top among publicly available models.
3
Grok 4 scores 38.6% on “Humanity’s Last Exam,” while Grok 4 Heavy reaches 44.4% using multi-agent test-time compute.
4
Grok 4 Heavy’s improvement is attributed to splitting reasoning across multiple agents and comparing outputs before answering.
5
Hands-on research tasks show Grok 4 can still produce confident but incorrect rankings, then self-correct by deferring to better-cited sources.
6
Image authenticity checks are a current weakness: Grok 4 reportedly misclassifies an AI-generated image as real.
7
New product features mentioned include a Grok voice mode and an upcoming AI video announcement for XAI.

Highlights

On ARC AGI 2, Grok 4 posts a little over 16%, crossing the transcript’s cited “noise” barrier at 10% and doubling the prior best public score.

Grok 4 Heavy reaches 44.4% on “Humanity’s Last Exam,” with gains attributed to multi-agent test-time reasoning rather than longer single-agent thinking.

Despite benchmark dominance, Grok 4 reportedly fails to detect that an AI-generated image is fake when asked to verify photo authenticity.

Topics

Grok 4 Benchmarks
ARC AGI 2
Grok 4 Heavy Multi-Agent
Test-Time Compute
Multimodal Image Limits

Mentioned

XAI
Grok
Arc Prize
OpenAI
Deep Research
Gemini
Claude
GPT-4o
GPT40
GPT-4o mini
GPT-4o mini high
Greg
ARC AGI 2
GPQA
LCB

"A PHD in Everything" Grok 4 CRUSHES Every Leading AI Model | HANDS ON DEMO