Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them?

TL;DR

OpenAI has reportedly paused Sora to conserve compute for the upcoming Spud model, highlighting how resource constraints shape product timelines.

Briefing Cornell Notes

Briefing

A pair of near-term model releases is forcing a hard tradeoff: scarce compute and high-stakes government relationships are shaping what gets shipped next—and what gets shelved. OpenAI has reportedly shut down its Sora app to free computing resources for the upcoming Spud model, while Anthropic is drawing renewed Pentagon attention to revive a Claude deal that the US government recently set to expire. The common thread is urgency: both companies are being pushed to deliver the next qualitative jump fast enough to matter, even as regulators and defense agencies tighten timelines and expectations.

That pressure is landing alongside a benchmark that’s designed to be brutally revealing. ARC AGI 3, a new installment in the ARC benchmark line, reports a striking gap between human performance and today’s frontier models: humans reach 100% while leading AI systems score under half a percent. The benchmark’s creators argue this gap matters because it measures residual deficiencies between frontier AI and human-level AGI—without relying on language, memorized knowledge, or cultural cues. ARC AGI 3 uses abstract, grid-based tasks where the goal often must be inferred rather than explicitly stated, and it tests multiple abilities at once: exploration, planning, memory, and goal setting.

ARC AGI 3 also tries to prevent “benchmark gaming,” a problem that already saturated earlier ARC AGI 1 and 2. Those earlier benchmarks improved rapidly once models gained publicly demonstrated chain-of-thought reasoning, but the paper warns that similarity between public demonstrations and private tests let training pipelines effectively attack the evaluation distribution. For ARC AGI 3, the public and private sets are designed to be more out-of-distribution, so models can’t rely on shortcuts that mimic the test’s hidden structure.

Scoring rules further shape what “progress” would mean. Performance is capped at 100% based on a human-derived baseline, so even perfect efficiency can’t exceed the ceiling. Turn-based mechanics also remove speed/reflex advantages from the equation, and action efficiency is central: attempts that use more than five times the number of actions taken by humans are scrapped, while inefficiency is quadratically penalized. Even so, the benchmark remains adversarial—human baselines are derived from the second-best human run, and the best first-run human performance can imply substantially different scores under the same rubric.

The benchmark’s low scores are not just a verdict on raw capability; they also reflect constraints on how models can be deployed. Systems built with special ARC AGI 3-specific harnesses are disallowed, and the evaluation emphasizes general-purpose API play with minimal context—no reminders about winning or action minimization. That helps explain why even strong models like Gemini 3.1 reportedly land around 0.37%.

Beyond ARC AGI 3, the broader AI trajectory looks uneven. OpenAI is reportedly pushing toward a fully automated AI researcher that can tackle complex problems with humans acting as reviewers, but historical productivity gains from AI drafting and human editing have been closer to incremental improvements than immediate takeoff. Meanwhile, agentic systems bring new security risks: a “vibe-coded” hack reportedly co-opted an open-source Python library in a way that could leak secrets and keys if updates weren’t handled correctly. The result is a “messy middle” moment—better at drafting and generalization across lower-level tasks, weaker on higher-level objectives like academic integrity and adaptive goal setting—making the next year a high-variance test of both capability and control.

Cornell Notes

ARC AGI 3 is positioned as a high-signal benchmark for whether frontier AI is closing the gap to human-level AGI, and early results are stark: humans score 100% while top models are below 0.5%. The benchmark is abstract and language-free, forcing models to infer goals and learn rules across levels, while scoring emphasizes action efficiency and caps performance at a 100% baseline. ARC AGI 3 also aims to reduce “benchmark gaming” by using private test sets that are more out-of-distribution than public demonstrations, addressing weaknesses seen in ARC AGI 1 and 2. The evaluation disallows specially built harnesses, so results reflect general-purpose API behavior rather than bespoke systems. Together, the findings suggest today’s models still lack key higher-level planning and adaptive goal-setting abilities, even as they improve on lower-level generalization.

Why does ARC AGI 3 claim to measure something closer to AGI than earlier benchmarks?

ARC AGI 3 is built around abstract grid puzzles that avoid language, memorized knowledge, and cultural cues. Goals are often not explicitly stated, so success requires inferring objectives and internalizing rules from earlier levels. The benchmark tests multiple cognitive components at once—exploration, planning, memory, and goal setting—rather than a single narrow skill.

What went wrong with ARC AGI 1 and 2 that ARC AGI 3 tries to fix?

ARC AGI 1 and 2 became saturated partly because chain-of-thought reasoning helped models flexibly combine patterns. But the paper also warns that public and private test sets were too similar, letting training pipelines effectively “attack” the evaluation distribution through thousands of guesses about the private set. ARC AGI 3 addresses this by making private tasks more out-of-distribution relative to public demonstrations.

How do ARC AGI 3’s scoring rules shape what “progress” would look like?

Scores are capped at 100% using a human-derived baseline, so even perfect performance can’t exceed the ceiling. The benchmark is turn-based, so faster reflexes don’t automatically help. Action efficiency is central: attempts using more than five times the human action count are scrapped, and inefficiency is quadratically penalized—meaning extra steps hurt disproportionately.

What does the benchmark disallow, and why does that matter for interpreting results?

The benchmark disallows specially prepared harnesses tailored to ARC AGI 3. The paper notes that some approaches use one model to summarize state for another to prevent context overload, but such harnesses aren’t allowed. That restriction keeps the evaluation focused on general-purpose API systems rather than bespoke engineering that could inflate scores.

What does the reported performance gap imply about current frontier models?

With humans at 100% and leading models far below 0.5% (e.g., Gemini 3.1 around 0.37% is cited), the gap suggests missing capabilities in higher-level adaptive goal setting and efficient planning under constraints. The transcript contrasts this with evidence of better generalization on lower-level topics across languages and coding contexts, implying uneven progress rather than uniform AGI-like competence.

How do the other AI stories reinforce the “messy middle” theme?

OpenAI’s push toward automated AI research aims to shift more work to AI drafting with humans reviewing, but historical productivity gains have been closer to ~40% rather than immediate exponential takeoff. At the same time, agentic systems increase security risk: a co-opted open-source library reportedly could leak secrets if updates aren’t handled correctly. Together, capability gains and control failures are advancing in parallel.

Review Questions

What specific design choices in ARC AGI 3 reduce the likelihood that models can exploit similarity between public and private test sets?
How do the action cap (5x) and quadratic inefficiency penalty change the incentives for model behavior during ARC AGI 3?
Why does disallowing ARC AGI 3-specific harnesses make the benchmark results more comparable across general-purpose models?

Key Points

1
OpenAI has reportedly paused Sora to conserve compute for the upcoming Spud model, highlighting how resource constraints shape product timelines.
2
Anthropic is seeking renewed Pentagon engagement around Claude, with government urgency tied to both offensive and defensive cyber capabilities.
3
ARC AGI 3 reports a dramatic human-versus-model performance gap (humans at 100%, frontier models under 0.5%), suggesting missing higher-level abilities.
4
ARC AGI 3 is designed to be harder to game by making private test sets more out-of-distribution than public demonstrations, addressing saturation seen in ARC AGI 1 and 2.
5
Scoring emphasizes action efficiency with a 100% cap and quadratic penalties for inefficiency, while turn-based rules remove speed advantages.
6
The benchmark disallows specially engineered harnesses, so results reflect general-purpose API behavior rather than bespoke ARC-specific systems.
7
Broader AI progress appears uneven: automated research ambitions may yield incremental productivity gains, while agentic security risks and benchmark-hacking threats remain active.

Highlights

ARC AGI 3’s headline result is a near-total separation: humans hit 100% while top models land below half a percent, despite the benchmark being abstract and language-free.

The paper’s central anti-gaming claim is that ARC AGI 3’s private tasks are more out-of-distribution than public demonstrations, unlike earlier ARC installments.

ARC AGI 3’s scoring is engineered to punish inefficiency and cap performance at a human-derived baseline, making “perfect” scores non-informative about AGI beyond the ceiling.

Automated AI research is framed as a drafting-and-review workflow, not an immediate path to exponential takeoff—while agentic systems simultaneously raise new security failure modes.

Topics

ARC AGI 3 Benchmark
Compute Allocation
Pentagon Claude Deal
Automated AI Research
Agentic Security Risks

Mentioned

Jensen Huang
Dario Amade
Brad Gersonner
Jim Fan
Tim Rockashel
AGI
API
GPT
ARC
LMU
GDP