Grok-2 Actually Out, But What If It Were 10,000x the Size?

TL;DR

Grok 2 is testable via a Twitter chatbot, but the lack of a paper or model card shifts evaluation toward benchmark results and hands-on reasoning checks.

Briefing Cornell Notes

Briefing

Grok 2 is now available for testing through a Twitter chatbot, but the bigger story isn’t just how it benchmarks—it’s what its release signals about where large language models may be heading: toward internal “world models” that let them reason about cause and effect, not merely pattern-match text. With no paper or model card published yet, the only concrete evidence comes from benchmark results and hands-on testing, which suggest Grok 2 is strong, but still not the top performer across the board.

On standard LLM evaluations, Grok 2 lands near the very top. In the Google Proof Science Q&A Benchmark and the MLU Pro think of it as a subject-knowledge test with fewer distractions, Grok 2 places second, behind Claude 3.5 Sonic. On at least one math benchmark, it scores highest in that set. The transcript also notes a separate “Simple bench” effort used for reasoning checks; without Grok 2 API access, testing relied on a custom question set. That informal run still produced answers that often look well-reasoned—scoring over 90% on the author’s reasoning-style questions—but it also misses items that Claude 3.5 Sonic gets right, implying Grok 2 may lag on that particular reasoning profile.

A key practical detail is that Grok 2’s system prompt appears to have been inferred from a jailbreak. The prompt draws inspiration from The Hitchhiker’s Guide to the Galaxy and includes constraints about not having access to internal x/Twitter data and systems, with an overall goal of maximizing truthfulness. That matters because the transcript frames the next bottleneck for society not as raw intelligence alone, but trust: as image and video generation improves, fake media will spread faster than verification can keep up. The discussion points to near-term realism—possibly within months to a couple of years—where even video calls could become hard to trust.

To counter that erosion of shared reality, the transcript argues that technical solutions like zero-knowledge proofs could enable “personhood credentials” without relying on easily forged identifiers. It also highlights a parallel challenge: tracing the provenance of synthetic outputs back to training sources. A Google executive reportedly discussed methods to trace outputs to training data fractions for creator compensation, but the transcript calls that approach nearly impossible for highly creative outputs.

Finally, the transcript widens from Grok 2 to scaling forecasts. An Epoch AI paper is cited for projecting how much compute and model scale could grow by 2030, with the headline claim that models might reach roughly 10,000× the scale of GPT 4. The transcript treats that as plausible given constraints like data scarcity, chip production, and power—while emphasizing that scale alone may not guarantee a step-change. The more important question is whether models are learning latent causal structure—internal simulations of how the world works. Experiments described in the transcript suggest language models can develop hidden concepts and even track states across puzzles, with training sometimes producing correct “instructions” at high rates. The open question remains whether future systems will become AGI directly, or whether they’ll mainly serve as interfaces that drive separate world simulators—or, alternatively, become engines for convincing deepfakes.

Cornell Notes

Grok 2 is available for testing via a Twitter chatbot, but with no paper or model card released yet, evaluation relies on benchmark placements and informal reasoning tests. On major benchmarks, Grok 2 ranks near the top—often second to Claude 3.5 Sonic—and it shows strong math performance in at least one category. A jailbreak reportedly revealed a system prompt inspired by The Hitchhiker’s Guide to the Galaxy, emphasizing truthfulness and clarifying that the model lacks access to internal x/Twitter data. The transcript then shifts to a broader concern: synthetic images and videos may soon be realistic enough to undermine trust, pushing interest toward provenance and cryptographic approaches like zero-knowledge proofs. Underneath it all is a scaling debate—whether massive compute will yield richer internal “world models” that support causal reasoning, not just better text prediction.

What evidence suggests Grok 2 is strong even without a released paper or model card?

The transcript cites benchmark results where Grok 2 places second in the Google Proof Science Q&A Benchmark and the MLU Pro benchmark, both behind Claude 3.5 Sonic. It also claims Grok 2 tops at least one math benchmark (Math Vista). Separately, it describes informal testing using a custom question set aligned with the author’s “Simple bench” reasoning style, where Grok 2 scored over 90%—though it still missed questions that Claude 3.5 Sonic answered correctly.

Why does the inferred system prompt matter for how Grok 2 behaves?

A notorious jailbreak is said to have revealed Grok 2’s system prompt. The prompt reportedly draws inspiration from The Hitchhiker’s Guide to the Galaxy and includes reminders that Grok 2 does not have access to internal x/Twitter data and systems. The stated goal is framed as maximizing truthfulness, which affects how the model responds under adversarial prompts and how users interpret its reliability.

How does the transcript connect model progress to a trust crisis online?

As image and video generation improves, fake media can spread faster than verification. The transcript argues that ubiquitous fake images are likely to accelerate, with Twitter as a distribution channel, and it extends the concern to video calls—suggesting realtime photo realism could arrive within months to a couple of years. In that world, “common sense” may not be enough because people won’t be able to reliably tell whether what they see is real.

What technical fixes are proposed to prevent the internet from “devolving into madness”?

The transcript points to zero-knowledge proofs as a way to issue “personhood credentials” without relying on easily spoofed identifiers like fingerprints. The idea is to provide a cryptographic basis for identity or credentials while preserving privacy and reducing forgery risk. It also discusses provenance tracing for synthetic outputs, noting that tracing creative outputs back to training sources may be extremely difficult.

What does the scaling discussion claim about the path to major capability jumps?

An Epoch AI paper is cited with a headline projection: by 2030, scaling could reach roughly 10,000× the compute scale of GPT 4, constrained by data scarcity, chip production capacity, and power. But the transcript warns that scale alone may not guarantee step changes. It emphasizes experiments suggesting models can learn latent causal structure—internal “world models”—so richer internal simulations could drive qualitative improvements beyond surface correlation.

What experimental results are used to argue that language models can build internal causal models?

The transcript describes puzzle-based experiments where models were trained on large numbers of random puzzles (over a million) and then evaluated on predicting what program caused outputs. It claims the model spontaneously developed conceptions of the underlying simulation, and that by the end of training it generated correct instructions at a 92.4% rate. It also references related findings like entity/state tracking across stories and latent concept learning from prior research.

Review Questions

How do benchmark placements and informal reasoning tests differ in what they reveal about Grok 2’s capabilities?
What role does the system prompt (as inferred from a jailbreak) play in shaping Grok 2’s reliability and access limitations?
Which argument is stronger in the transcript for future breakthroughs: raw scaling to 10,000× GPT 4, or evidence that models learn latent causal “world models”?

Key Points

1
Grok 2 is testable via a Twitter chatbot, but the lack of a paper or model card shifts evaluation toward benchmark results and hands-on reasoning checks.
2
On major benchmarks, Grok 2 repeatedly ranks near the top, often placing second to Claude 3.5 Sonic, with strong math performance in at least one category.
3
A jailbreak is said to have revealed Grok 2’s system prompt, including constraints about no access to internal x/Twitter data and an emphasis on truthfulness.
4
Rapid improvements in synthetic media raise trust problems, with the transcript forecasting that realistic video could arrive within months to a couple of years.
5
Zero-knowledge proofs are proposed as a route to cryptographic “personhood credentials,” aiming to preserve identity trust without relying on easily forged signals.
6
Scaling forecasts cite roughly 10,000× GPT 4 scale by 2030, but the transcript argues that qualitative leaps likely depend on learning causal internal world models, not just more compute.
7
Experiments described in the transcript suggest language models can infer latent causal structure from puzzles, supporting the idea of internal simulation-like reasoning.

Highlights

Grok 2’s benchmark profile is strong—often second to Claude 3.5 Sonic—yet informal reasoning tests still show it can miss questions Claude 3.5 Sonic gets right.

The inferred system prompt reportedly borrows from The Hitchhiker’s Guide to the Galaxy and explicitly frames truthfulness while denying access to internal x/Twitter data.

The trust crisis is framed as a near-term consequence of realistic image/video generation, potentially undermining even video calls.

Zero-knowledge proofs are presented as a practical countermeasure for identity and credentialing in a world of deepfakes.

The transcript’s central capability question is whether scaling yields richer internal world models that support causal reasoning, not just better text prediction.

Topics

Grok 2 Benchmarks
System Prompt
Deepfakes Trust
Zero-Knowledge Proofs
Scaling to 2030

Mentioned

Grok
Claude
GPT
xAI
Ideogram
Runway
TripAdvisor
Worldcoin
OpenAI
Google
Llama
Demis Hassabis
LLM
AGI
Q&A
MLU
API
GPT
xAI