"A PHD in Everything" Grok 4 CRUSHES Every Leading AI Model | HANDS ON DEMO
Based on MattVidPro's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Grok 4 is reported to reach a little over 16% on ARC AGI 2, about double the prior public best near 8–9%.
Briefing
XAI’s Grok 4 has surged to the top of multiple high-stakes AI benchmarks, posting standout gains in reasoning-heavy tests while matching competitors on cost. On ARC AGI 2—a notoriously difficult benchmark where models must learn a mini-skill from examples and then apply it at test time—Grok 4 lands at a little over 16%, roughly double the prior best public score (Claude Opus 4 thinking at about half that). The improvement matters because ARC AGI 2 is designed so that scores below 10% are often treated as noise; breaking that threshold signals more than incremental progress.
Concerns about benchmark credibility are met with a claim of independent validation: Arc Prize president Greg says XAI contacted the organization about a day earlier to test Grok 4 on ARC AGI using a semi-private evaluation set. The testing policy described includes no data retention and a temporary rate-limit increase for burst testing. Arc Prize’s private results reportedly conclude Grok 4 is the top performing publicly available model on ARC AGI 2, even outperforming purpose-built Kaggle submissions. The transcript also frames Grok 4’s performance as efficient—delivering higher scores at the “same exact price” compared with leading alternatives.
The benchmark run extends beyond ARC. On “Humanity’s Last Exam,” Grok 4 reaches 38.6%, far above the narrator’s current daily driver (about 25% for o3) and above Gemini 2.5 Pro (around 27%). Grok 4 Heavy, the larger variant, pushes to 44.4%, nearly doubling the gap versus o3 in this test. Other leaderboards are described as largely saturated: GPQA where Grok 4 narrowly edges Gemini 2.5 Pro and “Amy 25,” LCB where Grok 4 sits near 79.3%, and a “perfect score of 100%” for Grok 4 Heavy.
A key technical detail is how Grok 4 Heavy scales test-time compute. Instead of merely “thinking longer,” it splits reasoning across multiple agents, compares their outputs, and selects a final answer—likened to a research focus group. That approach is portrayed as powerful but expensive, with the transcript noting the Grok 4 Heavy plan costs $300, limiting access.
Still, hands-on trials show that benchmark strength doesn’t automatically translate to flawless real-world performance. In a web-research task about ranking the top 50 snack foods by unit consumption, Grok 4 produces a confident ranking but its self-critique ultimately sides with OpenAI Deep Research as more accurate, citing mismatches such as overestimating peanuts and under-representing sunflower seeds. In a “who invented the Bigfoot AI trend” question, Grok 4 corrects the credit toward an earlier origin tied to the creator’s own TikTok/YouTube activity, but the transcript warns that even strong models can fall into the same traps as weaker ones on nuanced, timing-sensitive questions.
The most striking limitation is multimodal reliability. When asked to verify whether an image (generated with GPT-4o) is real, multiple models—including Grok 4—are described as failing to detect that the input is AI-generated, with Grok 4 reportedly treating it as a real photograph. The transcript contrasts this with the expectation that Grok 4 should be able to analyze images, while noting that native image input appears to be a downgrade for now, even as native audio features are announced.
Overall, Grok 4 is presented as a major leap in benchmark performance—especially for reasoning tasks—while the practical takeaway is more cautious: nuanced questions and image authenticity checks remain weak spots across leading models, even at the top of the leaderboard.
Cornell Notes
Grok 4 from XAI is reported to dominate reasoning benchmarks, especially ARC AGI 2, where it posts a little over 16%—about double the prior public best around 8–9%. Arc Prize president Greg claims XAI arranged semi-private testing to validate Grok 4’s ARC score and check for overfitting, concluding it’s the top publicly available model on ARC AGI 2. Grok 4 also scores strongly on “Humanity’s Last Exam” (38.6%), with Grok 4 Heavy reaching 44.4% using a multi-agent test-time compute approach. Hands-on checks, however, show persistent weaknesses: Grok 4 can still miss nuanced research details and struggles to reliably detect AI-generated images when asked to verify authenticity.
Why does ARC AGI 2 matter, and what score did Grok 4 achieve there?
What validation claim is made to address concerns that XAI-controlled benchmarks might be biased?
How does Grok 4 Heavy differ from Grok 4 in its reasoning approach?
What happened in the snack-food ranking test when Grok 4 compared itself to another system?
What limitation showed up in the image authenticity test?
Review Questions
- Which benchmark threshold on ARC AGI 2 is treated as “noise,” and how does Grok 4’s score compare to it?
- What specific mechanism does Grok 4 Heavy use to improve answers at test time, and how is it different from simply thinking longer?
- In the snack-food ranking task, what two concrete ranking errors does the transcript attribute to Grok 4 (relative to Deep Research)?
Key Points
- 1
Grok 4 is reported to reach a little over 16% on ARC AGI 2, about double the prior public best near 8–9%.
- 2
Arc Prize president Greg claims semi-private ARC AGI testing validated Grok 4’s leaderboard score and found it top among publicly available models.
- 3
Grok 4 scores 38.6% on “Humanity’s Last Exam,” while Grok 4 Heavy reaches 44.4% using multi-agent test-time compute.
- 4
Grok 4 Heavy’s improvement is attributed to splitting reasoning across multiple agents and comparing outputs before answering.
- 5
Hands-on research tasks show Grok 4 can still produce confident but incorrect rankings, then self-correct by deferring to better-cited sources.
- 6
Image authenticity checks are a current weakness: Grok 4 reportedly misclassifies an AI-generated image as real.
- 7
New product features mentioned include a Grok voice mode and an upcoming AI video announcement for XAI.