Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them?
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI has reportedly paused Sora to conserve compute for the upcoming Spud model, highlighting how resource constraints shape product timelines.
Briefing
A pair of near-term model releases is forcing a hard tradeoff: scarce compute and high-stakes government relationships are shaping what gets shipped next—and what gets shelved. OpenAI has reportedly shut down its Sora app to free computing resources for the upcoming Spud model, while Anthropic is drawing renewed Pentagon attention to revive a Claude deal that the US government recently set to expire. The common thread is urgency: both companies are being pushed to deliver the next qualitative jump fast enough to matter, even as regulators and defense agencies tighten timelines and expectations.
That pressure is landing alongside a benchmark that’s designed to be brutally revealing. ARC AGI 3, a new installment in the ARC benchmark line, reports a striking gap between human performance and today’s frontier models: humans reach 100% while leading AI systems score under half a percent. The benchmark’s creators argue this gap matters because it measures residual deficiencies between frontier AI and human-level AGI—without relying on language, memorized knowledge, or cultural cues. ARC AGI 3 uses abstract, grid-based tasks where the goal often must be inferred rather than explicitly stated, and it tests multiple abilities at once: exploration, planning, memory, and goal setting.
ARC AGI 3 also tries to prevent “benchmark gaming,” a problem that already saturated earlier ARC AGI 1 and 2. Those earlier benchmarks improved rapidly once models gained publicly demonstrated chain-of-thought reasoning, but the paper warns that similarity between public demonstrations and private tests let training pipelines effectively attack the evaluation distribution. For ARC AGI 3, the public and private sets are designed to be more out-of-distribution, so models can’t rely on shortcuts that mimic the test’s hidden structure.
Scoring rules further shape what “progress” would mean. Performance is capped at 100% based on a human-derived baseline, so even perfect efficiency can’t exceed the ceiling. Turn-based mechanics also remove speed/reflex advantages from the equation, and action efficiency is central: attempts that use more than five times the number of actions taken by humans are scrapped, while inefficiency is quadratically penalized. Even so, the benchmark remains adversarial—human baselines are derived from the second-best human run, and the best first-run human performance can imply substantially different scores under the same rubric.
The benchmark’s low scores are not just a verdict on raw capability; they also reflect constraints on how models can be deployed. Systems built with special ARC AGI 3-specific harnesses are disallowed, and the evaluation emphasizes general-purpose API play with minimal context—no reminders about winning or action minimization. That helps explain why even strong models like Gemini 3.1 reportedly land around 0.37%.
Beyond ARC AGI 3, the broader AI trajectory looks uneven. OpenAI is reportedly pushing toward a fully automated AI researcher that can tackle complex problems with humans acting as reviewers, but historical productivity gains from AI drafting and human editing have been closer to incremental improvements than immediate takeoff. Meanwhile, agentic systems bring new security risks: a “vibe-coded” hack reportedly co-opted an open-source Python library in a way that could leak secrets and keys if updates weren’t handled correctly. The result is a “messy middle” moment—better at drafting and generalization across lower-level tasks, weaker on higher-level objectives like academic integrity and adaptive goal setting—making the next year a high-variance test of both capability and control.
Cornell Notes
ARC AGI 3 is positioned as a high-signal benchmark for whether frontier AI is closing the gap to human-level AGI, and early results are stark: humans score 100% while top models are below 0.5%. The benchmark is abstract and language-free, forcing models to infer goals and learn rules across levels, while scoring emphasizes action efficiency and caps performance at a 100% baseline. ARC AGI 3 also aims to reduce “benchmark gaming” by using private test sets that are more out-of-distribution than public demonstrations, addressing weaknesses seen in ARC AGI 1 and 2. The evaluation disallows specially built harnesses, so results reflect general-purpose API behavior rather than bespoke systems. Together, the findings suggest today’s models still lack key higher-level planning and adaptive goal-setting abilities, even as they improve on lower-level generalization.
Why does ARC AGI 3 claim to measure something closer to AGI than earlier benchmarks?
What went wrong with ARC AGI 1 and 2 that ARC AGI 3 tries to fix?
How do ARC AGI 3’s scoring rules shape what “progress” would look like?
What does the benchmark disallow, and why does that matter for interpreting results?
What does the reported performance gap imply about current frontier models?
How do the other AI stories reinforce the “messy middle” theme?
Review Questions
- What specific design choices in ARC AGI 3 reduce the likelihood that models can exploit similarity between public and private test sets?
- How do the action cap (5x) and quadratic inefficiency penalty change the incentives for model behavior during ARC AGI 3?
- Why does disallowing ARC AGI 3-specific harnesses make the benchmark results more comparable across general-purpose models?
Key Points
- 1
OpenAI has reportedly paused Sora to conserve compute for the upcoming Spud model, highlighting how resource constraints shape product timelines.
- 2
Anthropic is seeking renewed Pentagon engagement around Claude, with government urgency tied to both offensive and defensive cyber capabilities.
- 3
ARC AGI 3 reports a dramatic human-versus-model performance gap (humans at 100%, frontier models under 0.5%), suggesting missing higher-level abilities.
- 4
ARC AGI 3 is designed to be harder to game by making private test sets more out-of-distribution than public demonstrations, addressing saturation seen in ARC AGI 1 and 2.
- 5
Scoring emphasizes action efficiency with a 100% cap and quadratic penalties for inefficiency, while turn-based rules remove speed advantages.
- 6
The benchmark disallows specially engineered harnesses, so results reflect general-purpose API behavior rather than bespoke ARC-specific systems.
- 7
Broader AI progress appears uneven: automated research ambitions may yield incremental productivity gains, while agentic security risks and benchmark-hacking threats remain active.