Bombshell Paper Shows AI Has Thinking Collapse. Or Does It?

TL;DR

A study using odd-one-out image classification found large models can form object representations that behave like human concepts, and those representations align with neural activity patterns in object-related brain regions.

Briefing Cornell Notes

Briefing

A pair of near-simultaneous research papers is forcing a rethink of what “AI reasoning” really means: one line of work finds striking human-like object representations and brain alignment, while another reports a sharp accuracy breakdown in more complex reasoning tasks—followed by a rebuttal that blames output limits rather than cognition. The practical takeaway is that today’s large models may show fragments of human-like thinking, but they don’t yet display the broad, generalizable intelligence people often assume.

One lesser-known study examined how large language models and vision-capable counterparts classify images. Researchers repeatedly presented both humans and the models with sets of three images and asked them to pick the odd one out. By varying these comparisons, they built a quantitative measure of “similarity” among images. The results suggested the models develop “human-like conceptual representations of objects.” Even more consequential, the team compared internal activation patterns from the model networks with neural activity in the brain and reported “strong alignment” between model embeddings and neural activity in brain regions tied to object understanding. The authors interpret this as evidence that, while model representations aren’t identical to human ones, they share fundamental similarities reflecting core aspects of human conceptual knowledge.

The headline-grabbing Apple paper took a different angle: it focused on large reasoning models—large language models augmented with chain-of-thought prompting and training. These systems break prompts into smaller steps, solve subproblems, analyze intermediate results, and then recombine them. The Apple team tested performance on deterministic logic puzzles with algorithmic solutions that the models could, in principle, use. Their central claim was that accuracy collapses at a “frontier” as puzzle complexity increases, implying a limit in how far the models can carry out the kind of stepwise reasoning the setup demands.

Within days, a follow-up paper challenged that conclusion. It argued the apparent collapse wasn’t a failure of reasoning ability itself, but a constraint tied to how many tokens the model can output—effectively the maximum length of the generated chain-of-thought. Preliminary tests were presented in support of this alternative explanation.

The broader debate, as framed here, isn’t just about which paper is right; it’s about definitions. Image classification and algorithm execution are narrow proxies for “thinking” and “reasoning” as humans experience them. Even so, the evidence points to a limited form of cognition: current systems can reason a little and execute algorithms, but the capability doesn’t generalize in the way people expect from human-like intelligence. Instead of acquiring hallmarks such as deductive and inductive analysis, abstract theorizing, rapid learning, and robust generalization, the models appear to scale by handling more tasks with more training and compute—without necessarily developing the deeper cognitive machinery associated with intelligence.

The implication is sobering for anyone expecting imminent AGI. If “reasoning” is measured by general, theory-building intelligence, today’s models may not be on that trajectory. And as these systems become embedded in everyday internet workflows—coding, browsing, and interacting with users—the safety implications of increasingly capable but still limited reasoning could arrive before the field fully agrees on what the models can truly do.

Cornell Notes

Two research threads pull in opposite directions on whether current AI “thinks.” One study uses odd-one-out image classification and finds human-like object representations in large models, plus strong alignment between model embeddings and neural activity in brain regions related to object concepts. Another study on Apple’s chain-of-thought reasoning models reports a sharp accuracy collapse as deterministic logic puzzles become more complex. A follow-up paper argues the collapse is driven by token-output limits (maximum chain length) rather than a fundamental reasoning failure. Together, the results suggest today’s systems can mirror parts of human conceptual processing but lack the general, theory-building intelligence people often expect.

What evidence suggests large models develop human-like object concepts?

Researchers repeatedly showed humans and AI systems sets of three images and asked them to identify the odd one out. By aggregating many such trials, they constructed a similarity measure among images. The models’ behavior indicated they learn “human-like conceptual representations of objects.” They then compared internal model activation patterns to neural activity recorded in brain regions associated with object understanding, reporting “strong alignment” between model embeddings and neural activity patterns—evidence that model object representations share fundamental similarities with human conceptual knowledge, even if not identical.

How do “large reasoning models” differ from standard large language models in these studies?

In the Apple-focused work, a large reasoning model is essentially a large language model augmented with chain-of-thought. The system receives extra instructions and training to decompose a prompt into smaller steps, solve those subparts separately, analyze intermediate results, and recombine them into a final answer. The goal is to improve accuracy by making intermediate reasoning explicit rather than relying only on end-to-end generation.

What did the Apple paper claim happens as puzzle complexity increases?

The Apple paper tested models on deterministic logic puzzles with algorithmic solutions and increasing complexity. The intended definition of “reasoning” was whether the model knows when and how to apply the available algorithmic procedure to solve the puzzle. The reported outcome was a “complete accuracy collapse” beyond a certain complexity frontier, suggesting a limit in the models’ ability to carry out the stepwise algorithmic reasoning required by the tasks.

Why did a follow-up paper dispute the “reasoning collapse” interpretation?

The rebuttal argued that the observed collapse stemmed from output constraints rather than reasoning competence. Specifically, it pointed to the number of tokens the model can output—effectively the maximum length of the chain-of-thought. Preliminary tests were presented to support the idea that when puzzles require longer intermediate reasoning traces than the model can generate, performance drops even if the underlying reasoning mechanism could work given sufficient output budget.

What is the key critique of using these benchmarks to define “thinking” or “reasoning”?

The critique is definitional: image classification and executing an algorithm are narrow measures of cognition. If “reasoning” is meant to capture broad human-like intelligence—deductive and inductive analysis, abstract thinking, building correct theories, and quick learning—then these tasks may not reflect the full capability set. The argument is that current models show limited reasoning and some conceptual alignment, but not the generalizable cognitive hallmarks associated with human intelligence.

Review Questions

Which experimental design elements (odd-one-out trials, similarity construction, neural alignment) support the claim of human-like conceptual representations in models?
What two competing explanations are offered for the reported accuracy collapse in complex reasoning puzzles?
Why does the transcript argue that algorithm execution and classification are insufficient proxies for general human reasoning?

Key Points

1
A study using odd-one-out image classification found large models can form object representations that behave like human concepts, and those representations align with neural activity patterns in object-related brain regions.
2
Apple’s chain-of-thought reasoning models were tested on deterministic, algorithmic puzzles, with results reported as a sharp accuracy collapse beyond a complexity threshold.
3
A follow-up paper challenged that conclusion by attributing the collapse to token-output limits that cap the length of generated chain-of-thought traces.
4
The debate turns on definitions: classification and algorithm execution are narrow proxies for the broader, generalizable reasoning associated with human intelligence.
5
Current models appear to scale task coverage with more training and compute, but they don’t clearly acquire the deeper cognitive hallmarks—abstract theory-building, rapid learning, and robust generalization.
6
Expectations of imminent AGI may be overstated if “reasoning” is measured by general intelligence rather than benchmark performance under constraints.

Highlights

Odd-one-out image experiments suggest large models learn object concepts that track human similarity judgments—and internal embeddings show strong alignment with brain activity in object-processing regions.

Apple’s chain-of-thought reasoning tests reported a “complete accuracy collapse” as deterministic puzzle complexity increases, implying a limit in algorithmic stepwise reasoning.

A quick rebuttal reframed the collapse as a token-budget problem: insufficient output length can masquerade as a reasoning failure.

The central caution is definitional: algorithm execution and classification don’t automatically translate into human-like, general reasoning. 

Topics

Chain-of-Thought Reasoning
Large Language Models
Token Output Limits
Neural Embedding Alignment
Deterministic Logic Puzzles

Mentioned

Sabine Hossenfelder