AI Won't Be AGI, Until It Can At Least Do This (plus 6 key ways LLMs are being upgraded)

TL;DR

ARC-style abstract reasoning failures are framed as evidence that current LLMs don’t reliably generalize to unseen task structures.

Briefing Cornell Notes

Briefing

Current AI systems fall short of AGI largely because they struggle with genuinely novel abstract reasoning: when a task pattern hasn’t appeared in training, models often can’t generalize, even if they can sometimes recall familiar “reasoning procedures” from prior examples. That gap matters because it reframes today’s breakthroughs as impressive pattern-matching and partial program retrieval—not reliable, on-the-fly problem solving across unfamiliar situations.

A concrete example centers on an abstract reasoning challenge tied to the ARC-style test. The model behavior described is stark: it can’t reliably notice and interpret a grid transformation when the specific configuration wasn’t in its training set. The failure isn’t treated as a minor glitch; it’s presented as evidence that current large language models (LLMs) aren’t “generally intelligent” in the human sense. Instead, they often succeed by retrieving reasoning chains or programs they have effectively seen before. When the needed procedure is missing, the system may produce plausible-sounding but wrong outputs—an issue that shows up not only in benchmark reasoning but also in real-world claims.

From there, the discussion widens into a broader critique of the AI landscape: overpromising and underdelivering, persistent hallucinations, and marketing that outpaces measurable capability. Examples include rolled-back features after error-prone performance, and admissions that even consumer-facing systems can hallucinate. Privacy risks also enter the picture, with concerns about systems that continuously analyze user data such as desktop screenshots. Another major worry is “AI slop”: tools that imitate writing styles can flood platforms with low-quality, high-volume content, making it harder to trust what people read, hear, or share—especially as bots and deepfakes scale.

Yet the narrative refuses a binary “AI is hype” versus “AGI is imminent” framing. The core counterpoint is that the reasoning limitations aren’t necessarily a dead end. Multiple research directions aim to make LLMs more capable at the kind of generalization ARC demands, without relying on naive scaling alone.

Six pathways are highlighted. First is compositional generalization—training models to combine learned reasoning blocks into new structures, with evidence from small transformer models that can mimic aspects of systematic generalization. Second is better program discovery using verifiers and search: approaches that train reward models to flag faulty reasoning steps, or use external simulators to evaluate many candidate solutions. Third is active inference via test-time fine-tuning, where models adapt during evaluation using a small set of examples plus synthetic data to learn the task’s structure on the fly. Fourth is hybrid planning that pairs LLMs with symbolic systems, letting the language model generate candidate plans while a symbolic checker validates and refines them. Fifth is algorithmic grounding through specialized neural components (e.g., graph neural networks) whose learned algorithms are then made usable by language models through embeddings. Sixth is “tacit data” transfer—capturing the unprinted, conversational, and failure-driven knowledge that humans accumulate, especially in mathematics.

The takeaway is pragmatic: AGI isn’t shown as imminent, but progress toward more general reasoning is portrayed as achievable through combinations of training strategies, external verification, adaptive inference, and tighter coupling between language models and tools that can check or execute plans. The most realistic near-term path, the discussion suggests, is not a single leap but a convergence of methods that reduce the reliance on memorized procedures and increase the ability to synthesize new ones when the situation is unfamiliar.

Cornell Notes

The central claim is that today’s LLMs aren’t AGI because they often fail on genuinely novel abstract reasoning tasks—especially when the exact pattern or procedure wasn’t present in training. Their partial strength comes from recalling reasoning chains they’ve effectively seen before, not from reliably synthesizing new solutions on the fly. The discussion then argues against “all hype” and “all doom” by pointing to six research directions that could close the gap: compositional generalization, verifier/search-based program selection, test-time fine-tuning (active inference), hybrid neural-symbolic planning, algorithm embeddings from specialized networks, and learning from tacit human knowledge. The practical implication is that progress likely comes from combining methods that add adaptation and external checking rather than simply scaling parameters and data.

Why does an ARC-style abstract reasoning failure count as evidence against AGI?

The failure described is not just a wrong answer; it’s a lack of generalization to a new grid transformation pattern. The argument is that if a model hasn’t seen the specific task structure during training, it can’t reliably infer the rule from scratch. That contrasts with human-like reasoning, where a new instance can still be solved by understanding the underlying rule. The key distinction made is between recalling a known reasoning procedure versus synthesizing a new one when the procedure is missing.

What’s the difference between “recalling reasoning procedures” and “doing fresh reasoning,” and why does it matter?

The transcript frames LLM reasoning as often being a retrieval problem: the model can reproduce reasoning chains/programs it has encountered in training, which can work on some benchmarks. But when the needed procedure hasn’t appeared, the model can’t create it from first principles. This helps explain why models can look strong on familiar benchmark formats yet collapse on novel configurations like unseen ARC tasks.

How do verifiers and Monte Carlo tree search reduce reasoning errors?

One approach trains a process reward model to detect faulty steps in a reasoning chain, effectively acting like a supervisor that flags bad programs. Another uses external simulation as a verifier: generate many candidate solutions, run them through a simulator or environment, and keep the ones that produce correct outcomes. The transcript notes a plateau in performance for some verifier-only methods, but also emphasizes that external verifiers can make iterative search more efficient because the model can sample multiple candidates per iteration.

What is “active inference” in this context, and why is test-time fine-tuning emphasized?

Active inference is described as adapting the model during evaluation rather than using static inference with a frozen model. The method uses test-time fine-tuning on the current problem, supplemented by many synthetic examples that mimic the task’s style. The transcript claims that without this adaptation, performance gains are negligible (around 1–2%), but with test-time fine-tuning and additional tricks, results become meaningfully higher—linked to improved performance on the ARC AGI prize.

Why might hybrid neural-symbolic planning outperform LLM-only planning?

The transcript argues that LLMs can generate candidate plans or “ideas,” but symbolic systems can check those plans for coherence and correctness. In a Blocks World-like setting, the symbolic component validates and refines the plan across multiple rounds. The reported outcome is that GPT 4 can reach high accuracy (82% in that described setup) when paired with symbolic checking, while still struggling with tasks involving mysterious language.

What does “tacit data” mean, and how could it change AI progress expectations?

Tacit data refers to knowledge that isn’t fully captured in published papers—intuition, failure modes, and problem-solving heuristics transmitted through conversations, lectures, advising, and trial-and-error. The transcript quotes a view that mathematicians publish success stories while the most precious information includes what didn’t work and how it was fixed. Training on this kind of tacit knowledge is portrayed as promising but slower and dependent on humans to encode it with fidelity, making it less likely to produce an immediate intelligence explosion.

Review Questions

What specific capability gap between memorized procedures and synthesized reasoning is used to explain ARC-style failures?
Which two approaches rely on external evaluation (verifiers/simulators or symbolic checkers), and how do they change the model’s failure modes?
Why does the transcript argue that naive scaling (more parameters/data) is insufficient for generalization to novel tasks?

Key Points

1
ARC-style abstract reasoning failures are framed as evidence that current LLMs don’t reliably generalize to unseen task structures.
2
LLM “reasoning” is often treated as retrieval of familiar reasoning chains rather than synthesis of new procedures when the pattern is novel.
3
Hallucinations and overhyped marketing are presented as symptoms of systems optimized for fluent output rather than guaranteed correctness.
4
The “AI slop” problem is described as a trust and information-quality crisis amplified by bots and style-mimicking tools.
5
Research progress toward more general intelligence is organized around six directions: compositional generalization, verifier/search methods, active inference via test-time fine-tuning, neural-symbolic planning, algorithm embeddings, and learning from tacit human knowledge.
6
External checking—through simulators, reward models, or symbolic systems—is repeatedly emphasized as a practical way to reduce wrong reasoning.
7
Near-term improvement is portrayed as achievable through combinations of methods rather than a single leap to AGI.

Highlights

The argument hinges on a simple but consequential distinction: recalling a known reasoning procedure can work, but synthesizing a new one for an unseen pattern is where current models break.

Verifier-based methods can turn candidate generation into an advantage by sampling many solutions and using simulations or reward models to select the correct ones.

Test-time fine-tuning (“active inference”) is presented as a key lever for adapting to the current task, with synthetic data used to strengthen the adaptation signal.

Hybrid neural-symbolic planning is framed as a division of labor: LLMs propose plans, symbolic systems validate them, and feedback loops improve outcomes.

The most ambitious long-term idea is learning from tacit knowledge—what mathematicians learn from failures and mentorship that rarely makes it into published papers.

Topics

AGI vs LLMs
Abstract Reasoning
Hallucinations
Verifier Search
Active Inference
Neural-Symbolic Planning
Tacit Knowledge

Mentioned

François Chollet
Demis Hassabis
Tim Cook
Jack Cole
Jason Mah
Terrence Tao
Mira Murati
Yan LeCun
Ralph Gomory
AGI
LLMs
ARC
GPT 4
NHS
GANs
PubMed
RTS
GPT
Nvidia
AI
CTO
TLDd
MMLU
GP4
GP4
AR
AGI prize