Is Elon’s Grok 3 the new AI king?
Based on Fireship's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Grok 3 is claimed to have reached No. 1 on the LM Marina leaderboard, a human blind comparison test.
Briefing
Grok 3 is being positioned as a top-tier AI model—potentially the “AI king”—after it surged to the No. 1 spot on the LM Marina leaderboard and posted strong results on math, science, and coding tests. The pitch hinges on two themes: performance and a distinctive “truth-seeking” style that’s framed as less constrained than many competitors, including a “deep thinking mode” and capabilities such as text-to-video.
Support for the hype comes from benchmark placement and comparisons. Grok 3 is described as sitting at the top of LM Marina, a human “blind taste test” where people compare LLMs side by side; topping it is treated as a practical signal of quality. Additional benchmark claims say Grok 3 beats models including Gemini, Claude, DeepSeek, and GPT-4 in areas like math, science, and coding. At the same time, the transcript flags that some important models and benchmark suites are missing from the comparison set—specifically OpenAI’s o3, plus code-focused and AGI-oriented benchmarks—casting doubt on how complete the picture really is. The overall conclusion is that Grok 3 looks excellent, but its gains may be leveling off alongside the broader field’s recent shift from building ever-larger base models toward improving prompting and reasoning frameworks.
The other major differentiator is Grok’s access to Twitter/X data and its training approach. The transcript claims Grok has direct access to the “fire hose” of Twitter data, and that xAI optimized the system for “maximum truth seeking,” even if that reduces political correctness. That design choice is illustrated with an example: a prompt that allegedly produced highly offensive content about racial stereotypes, which the transcript says was blocked by other LLMs but not by Grok—so offensive that it can’t be shown on YouTube, and potentially illegal in some jurisdictions. The model is also said to be moving toward broader availability in countries like Germany and the UK.
Behind the scenes, xAI’s training infrastructure is presented as a key part of Grok’s competitiveness. Grok 3’s training details are said to include work at the Colossus supercomputer in Memphis, Tennessee—described as the world’s largest AI supercomputer—built around a cluster of more than 200,000 Nvidia H100 GPUs with plans to scale to 1 million. The facility’s power demands are so high that it reportedly relies on portable diesel generators because grid supply can’t cover everything.
Finally, the transcript points to a near-term product roadmap: a paid subscription tier called “Super Grok,” expected to cost $30 per month, with the claim that it will be more powerful than Grok 3. It also notes that Grok’s current capabilities include “deep thinking mode” (compared to DeepCare 1) and text-to-video, while practical performance is framed as strong but not dramatically surpassing the state of the art—suggesting the real battleground may now be how models are prompted and integrated rather than only how they’re pre-trained.
Cornell Notes
Grok 3 is portrayed as a leading LLM after reaching No. 1 on the LM Marina leaderboard and performing strongly on math, science, and coding comparisons. Its standout feature is a “truth-seeking” style paired with claimed direct access to Twitter/X data, which is framed as enabling more uncensored outputs than many competitors. The transcript also stresses that benchmark comparisons may be incomplete, citing missing models (like OpenAI o3) and missing benchmark categories (code forces and Arc AGI). Grok 3’s training is tied to xAI’s Colossus supercomputer in Memphis, described as extremely large and power-hungry. The near-term roadmap includes a $30/month “Super Grok” subscription, with the broader industry trend shifting toward better prompting frameworks.
What evidence is used to claim Grok 3 is near the top of the LLM field?
How does Grok 3’s “truth-seeking” approach differ from many other models?
What role does training infrastructure play in the Grok 3 story?
Why does the transcript argue that benchmark results may not tell the whole story?
What does the transcript suggest about where progress is happening across the AI industry?
What product and pricing changes are mentioned for Grok?
Review Questions
- Which leaderboard and benchmark claims are used to support Grok 3’s top ranking, and what specific omissions are flagged?
- How does the transcript connect Grok 3’s training data access to its uncensored or politically nonconforming behavior?
- What does the transcript say about the industry shift from larger base models to prompting frameworks, and how does that affect how Grok should be evaluated?
Key Points
- 1
Grok 3 is claimed to have reached No. 1 on the LM Marina leaderboard, a human blind comparison test.
- 2
The transcript credits Grok 3’s performance to both strong benchmark results and a distinctive “truth-seeking” approach.
- 3
Grok is described as having direct access to Twitter/X’s data stream, which is presented as a differentiator.
- 4
Benchmark comparisons are criticized for omissions, including OpenAI o3 and certain code/AGI benchmark categories.
- 5
Grok 3’s training is tied to the Colossus supercomputer in Memphis, described as extremely large and power-intensive.
- 6
A new paid tier called “Super Grok” is expected at $30 per month, with claims of greater capability than Grok 3.