Grok 3: “Smartest AI on Earth” Takes Down o3 mini, DeepSeek in Record time.

TL;DR

Grok 3’s early positioning hinges on both massive compute scaling (about 200,000 GPUs across two training phases) and a dedicated fine-tuned reasoning model, not just incremental updates.

Briefing Cornell Notes

Briefing

Grok 3 is being positioned as a near-instant leap in frontier chatbot capability—powered by a massive compute ramp, a dedicated reasoning model, and “think”/agent-style features—while early benchmark and community tests suggest it can compete with (and in some cases outperform) top models from OpenAI, Google, and DeepSeek. The stakes are straightforward: XAI is trying to shift market perception from “incremental upgrade” to “new benchmark leader,” and the speed of the rollout is meant to pressure the rest of the industry to respond.

Early results highlighted in LMIS Arena place Grok 3 (an earlier training version than the one shown publicly) at the top with an ELO around 1402, narrowly ahead of Google’s Gemini 2.0 Flash Thinking Experimental (about 1385) and OpenAI’s ChatGPT-4o latest (about 1377). Even so, the transcript stresses that these scores are not a single universal truth: user preferences, task differences, and benchmark methodology can shift outcomes. It also notes that Grok 3 mini sits much lower in that specific leaderboard (around 1305), reinforcing that “Grok 3” performance depends heavily on which variant and settings are used.

What’s new is framed as more than scaling. XAI reportedly trained Grok 3 on roughly 200,000 GPUs, with a two-phase run: 100,000 GPUs for 122 days followed by 92 days scaling up to 200,000. Beyond the base model, XAI built multiple models, including a fine-tuned reasoning model designed to mimic human-style critical thinking. In interface demos, the “think” option is used to tackle tasks that require multi-step computation and code generation—such as calculating a trajectory for a spacecraft traveling Earth→Mars→Earth and outputting code to animate a 3D plot. Another demo combines Tetris-like mechanics with a Beed-style removal rule, described as “Big Brain mode,” which appears to spend more compute at inference time (possibly by running multiple chains of thought and combining them).

Access and packaging are also part of the competitive story. The transcript says Grok 3 features—deep search, advanced reasoning, increased usage limits, and “think” modes—are available through X Premium Plus, plus a separate Grok app. Pricing is described as about $30 per month, compared with OpenAI’s higher-cost Pro tier for “deep research,” which the transcript treats as a direct competitive analogue.

On harder numbers, the transcript cites benchmark-style comparisons across math, science, coding, and other categories. Grok 3 reasoning (and Grok 3 mini reasoning) are described as strong performers, with particularly high coding and math scores in the cited charts. It also flags a key caveat: some top competitors’ models (like OpenAI’s o3 mini high) aren’t publicly accessible in the same way, and DeepSeek models are open source while Grok 3 is not—meaning developers may choose differently based on cost, openness, and integration needs.

Community reactions add a second layer of evidence. Several users and accounts report impressive coding demos (including a Portal 2-like build in a game environment and generative graphics/simulations), while others raise concerns about safety and controllability. One account claims “guard rails” can be bypassed quickly and even reports leaking a system prompt, including tool context and a knowledge cutoff note. Overall sentiment in the transcript is cautiously optimistic: Grok 3 looks very capable, but claims of “best on Earth” still need broader, fairer comparisons.

The bigger takeaway is competitive acceleration. The transcript argues Grok 3’s combination of compute scale, reasoning specialization, and fast feature shipping is forcing rivals to respond—benefiting users through more choice and faster iteration, while also reigniting the open-source debate across the industry.

Cornell Notes

Grok 3 is framed as a major capability jump driven by heavy compute scaling (about 200,000 GPUs across two training phases) plus a dedicated fine-tuned reasoning model. Early leaderboard results in LMIS Arena place a Grok 3 variant at or near the top (around 1402 ELO), while cited benchmark charts suggest strong performance in coding and math when reasoning is enabled. XAI also adds inference-time “think” and “Big Brain mode” features that spend more compute to improve multi-step outputs, including code-heavy tasks like trajectory animation and game-like logic hybrids. Access is described as bundled in X Premium Plus and the Grok app, with pricing around $30/month. Community demos show impressive coding and visual generation, while safety testers report guard-rail bypasses and system-prompt leakage, underscoring the need for more rigorous, apples-to-apples evaluation.

What concrete changes are claimed to make Grok 3 different from Grok 2?

The transcript attributes the jump to both scale and specialization: Grok 3 is trained on a very large GPU cluster (about 200,000 GPUs total), with a two-phase schedule (100,000 GPUs for 122 days, then 92 days expanding to 200,000). It also emphasizes that XAI didn’t just scale a base model—multiple models were built, including a fine-tuned reasoning model intended to mimic human-like critical thinking. On top of that, interface features like “think” and “Big Brain mode” are used to increase compute at inference time for harder tasks.

How do “think” and “Big Brain mode” show up in practice?

In the transcript’s examples, “think” is enabled to solve a long, sentence-length prompt that requires heavy computation and code generation: calculating an Earth→Mars→Earth trajectory and producing code to animate a 3D plot. “Big Brain mode” is described as a mode that uses more compute live; the transcript speculates it may run multiple chains of thought in parallel and combine them. A demonstration combines Tetris-style falling blocks with a Beed-style removal mechanic, producing a hybrid that the transcript calls “primitive” but still impressive.

What do the early benchmark claims say about Grok 3 versus rivals?

LMIS Arena results cited in the transcript place a Grok 3 earlier training version at #1 with an ELO around 1402, narrowly ahead of Gemini 2.0 Flash Thinking Experimental (~1385) and ChatGPT-4o latest (~1377), with DeepSeek R1 (~1361) further down. The transcript also warns that these scores reflect a specific benchmark setup and can vary by task and user preferences. Separate benchmark-style charts for math/science/coding are described as showing Grok 3 reasoning and Grok 3 mini reasoning performing very strongly, especially in coding and math, though some models (like OpenAI o3 mini high) are not publicly accessible for direct testing.

How is Grok 3 positioned for consumers—what features and pricing are mentioned?

The transcript says Grok 3 features are available through X Premium Plus, including deep search, advanced reasoning, increased usage limits, and access to “think” modes. It also mentions a separate Grok app. Pricing is described as about $30 per month ($300 a year). The transcript frames this as competitive with OpenAI’s higher-cost Pro plan for “deep research,” especially for users who only want research-like capabilities and not other bundled products.

What do community tests and safety checks suggest?

Community reactions include praise for coding and creative demos: a Portal 2-like project reportedly coded by Grok 3, plus p5.js simulations (sphere made of characters with brightness gradients) and other visual experiments like a supernova simulation. At the same time, a jailbreak-focused account claims guard rails can be bypassed quickly, reports system-prompt leakage, and even cites examples like generating harmful instructions (e.g., thermite guidance) after bypassing protections. The transcript concludes that Grok 3 looks strong, but “best model” claims need more fair verification.

Review Questions

Which parts of Grok 3’s improvement are attributed to training scale versus inference-time features like “think” and “Big Brain mode”?
Why might LMIS Arena ELO rankings not translate directly into a universal “best model” conclusion?
What safety-related concerns were raised in community testing, and how do they affect confidence in performance claims?

Key Points

1
Grok 3’s early positioning hinges on both massive compute scaling (about 200,000 GPUs across two training phases) and a dedicated fine-tuned reasoning model, not just incremental updates.
2
LMIS Arena results cited in the transcript place a Grok 3 variant at the top with an ELO around 1402, but the transcript stresses benchmark sensitivity to task and setup.
3
Inference-time features like “think” and “Big Brain mode” are used to boost multi-step, code-generating tasks, including trajectory animation and hybrid game-logic prompts.
4
Grok 3 capabilities are described as bundled under X Premium Plus (and accessible via the Grok app), with pricing around $30/month, framed as competitive with OpenAI’s higher-cost research tier.
5
Community demos report strong coding and visual generation, including game-like and p5.js simulation outputs.
6
Safety testing claims suggest guard rails can be bypassed and system prompts may be leaked, indicating controllability and verification remain open issues.
7
The overall competitive impact is framed as accelerating the AI arms race, pushing rivals to respond quickly with comparable reasoning and research-style features.

Highlights

A Grok 3 variant is cited at #1 on LMIS Arena with an ELO around 1402, narrowly ahead of Gemini 2.0 Flash Thinking Experimental and ChatGPT-4o latest.

“Think” mode is demonstrated on a long prompt that requires trajectory math plus code to generate an animated 3D Earth–Mars–Earth plot.

“Big Brain mode” is described as spending more live compute—possibly by running multiple chains of thought—to improve outputs on harder tasks like a Tetris/Beed hybrid.

Community testing includes both impressive coding demos (Portal 2-like behavior, p5.js simulations) and jailbreak claims involving guard-rail bypasses and system-prompt leakage.