Is Elon’s Grok 3 the new AI king?

TL;DR

Grok 3 is claimed to have reached No. 1 on the LM Marina leaderboard, a human blind comparison test.

Briefing Cornell Notes

Briefing

Grok 3 is being positioned as a top-tier AI model—potentially the “AI king”—after it surged to the No. 1 spot on the LM Marina leaderboard and posted strong results on math, science, and coding tests. The pitch hinges on two themes: performance and a distinctive “truth-seeking” style that’s framed as less constrained than many competitors, including a “deep thinking mode” and capabilities such as text-to-video.

Support for the hype comes from benchmark placement and comparisons. Grok 3 is described as sitting at the top of LM Marina, a human “blind taste test” where people compare LLMs side by side; topping it is treated as a practical signal of quality. Additional benchmark claims say Grok 3 beats models including Gemini, Claude, DeepSeek, and GPT-4 in areas like math, science, and coding. At the same time, the transcript flags that some important models and benchmark suites are missing from the comparison set—specifically OpenAI’s o3, plus code-focused and AGI-oriented benchmarks—casting doubt on how complete the picture really is. The overall conclusion is that Grok 3 looks excellent, but its gains may be leveling off alongside the broader field’s recent shift from building ever-larger base models toward improving prompting and reasoning frameworks.

The other major differentiator is Grok’s access to Twitter/X data and its training approach. The transcript claims Grok has direct access to the “fire hose” of Twitter data, and that xAI optimized the system for “maximum truth seeking,” even if that reduces political correctness. That design choice is illustrated with an example: a prompt that allegedly produced highly offensive content about racial stereotypes, which the transcript says was blocked by other LLMs but not by Grok—so offensive that it can’t be shown on YouTube, and potentially illegal in some jurisdictions. The model is also said to be moving toward broader availability in countries like Germany and the UK.

Behind the scenes, xAI’s training infrastructure is presented as a key part of Grok’s competitiveness. Grok 3’s training details are said to include work at the Colossus supercomputer in Memphis, Tennessee—described as the world’s largest AI supercomputer—built around a cluster of more than 200,000 Nvidia H100 GPUs with plans to scale to 1 million. The facility’s power demands are so high that it reportedly relies on portable diesel generators because grid supply can’t cover everything.

Finally, the transcript points to a near-term product roadmap: a paid subscription tier called “Super Grok,” expected to cost $30 per month, with the claim that it will be more powerful than Grok 3. It also notes that Grok’s current capabilities include “deep thinking mode” (compared to DeepCare 1) and text-to-video, while practical performance is framed as strong but not dramatically surpassing the state of the art—suggesting the real battleground may now be how models are prompted and integrated rather than only how they’re pre-trained.

Cornell Notes

Grok 3 is portrayed as a leading LLM after reaching No. 1 on the LM Marina leaderboard and performing strongly on math, science, and coding comparisons. Its standout feature is a “truth-seeking” style paired with claimed direct access to Twitter/X data, which is framed as enabling more uncensored outputs than many competitors. The transcript also stresses that benchmark comparisons may be incomplete, citing missing models (like OpenAI o3) and missing benchmark categories (code forces and Arc AGI). Grok 3’s training is tied to xAI’s Colossus supercomputer in Memphis, described as extremely large and power-hungry. The near-term roadmap includes a $30/month “Super Grok” subscription, with the broader industry trend shifting toward better prompting frameworks.

What evidence is used to claim Grok 3 is near the top of the LLM field?

The transcript points to Grok 3 taking the No. 1 spot on the LM Marina leaderboard, described as a human “blind taste test” where people compare models side by side. It also cites another benchmark where Grok is said to beat Gemini, Claude, DeepSeek, and GPT-4 on math, science, and coding. However, it notes that some key comparisons are missing—OpenAI o3 is absent, and code-focused and Arc AGI benchmarks are also not included—so the strength of the conclusion depends on what’s left out.

How does Grok 3’s “truth-seeking” approach differ from many other models?

Grok is described as being optimized for “maximum truth seeking” even if that comes at the expense of political correctness. The transcript claims Grok has direct access to Twitter/X’s data stream (“fire hose”) and uses that to generate content that other models would block. An example is given where Grok allegedly returns highly offensive text about racial stereotypes, while other LLMs block the prompt; the transcript adds that such content could lead to legal trouble in some countries.

What role does training infrastructure play in the Grok 3 story?

Training is linked to the Colossus supercomputer in Memphis, Tennessee. The transcript describes it as the world’s largest AI supercomputer, with a cluster of over 200,000 Nvidia H100 GPUs and plans to expand to 1 million GPUs. It also emphasizes power constraints—so much electricity is required that the facility reportedly brings in portable diesel generators because grid power can’t cover the load.

Why does the transcript argue that benchmark results may not tell the whole story?

It highlights that benchmark suites are often cherry-picked. Specific omissions are called out: OpenAI o3 is missing from the comparison set, and benchmarks like code forces and Arc AGI are not included. That means Grok’s ranking could look stronger (or weaker) depending on which models and evaluation categories are selected.

What does the transcript suggest about where progress is happening across the AI industry?

It claims the field’s focus is shifting from building bigger base models to improving prompting and reasoning frameworks—citing examples like “deep research” and “big brain mode.” Grok is treated as part of this era: strong performance is real, but it appears to be plateauing at roughly the same level as other state-of-the-art models.

What product and pricing changes are mentioned for Grok?

The transcript says Grok 3 is already accessible in some form and that a paid subscription called “Super Grok” is expected soon. Super Grok is described as costing $30 per month and being more powerful than Grok 3, with a comparison to ChatGPT Pro at $200 per month.

Review Questions

Which leaderboard and benchmark claims are used to support Grok 3’s top ranking, and what specific omissions are flagged?
How does the transcript connect Grok 3’s training data access to its uncensored or politically nonconforming behavior?
What does the transcript say about the industry shift from larger base models to prompting frameworks, and how does that affect how Grok should be evaluated?

Key Points

1
Grok 3 is claimed to have reached No. 1 on the LM Marina leaderboard, a human blind comparison test.
2
The transcript credits Grok 3’s performance to both strong benchmark results and a distinctive “truth-seeking” approach.
3
Grok is described as having direct access to Twitter/X’s data stream, which is presented as a differentiator.
4
Benchmark comparisons are criticized for omissions, including OpenAI o3 and certain code/AGI benchmark categories.
5
Grok 3’s training is tied to the Colossus supercomputer in Memphis, described as extremely large and power-intensive.
6
A new paid tier called “Super Grok” is expected at $30 per month, with claims of greater capability than Grok 3.

Highlights

Grok 3 is described as topping LM Marina, a human blind taste test where people compare LLMs directly.

The transcript links Grok’s “uncensored” behavior to a “truth-seeking” training goal and claimed direct access to Twitter/X data.

Colossus in Memphis is portrayed as the scale driver behind training—over 200,000 Nvidia H100 GPUs and plans to reach 1 million, with heavy power infrastructure needs.

The comparison set for Grok’s benchmarks is called out as incomplete, including missing OpenAI o3 and missing code/AGI benchmark suites.

Topics

Grok 3
LM Marina
Benchmarking
Training Infrastructure
Super Grok Subscription

Mentioned

LM Marina
LLM
GPT-4
AGI
GPU
H100