Apple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know

TL;DR

LLMs are probabilistic text generators, so they can imitate reasoning but aren’t guaranteed to execute long, exact sequences without error.

Briefing Cornell Notes

Briefing

A widely circulated claim that Apple’s latest AI work shows large language models can’t “reason” is met with a blunt counterpoint: these systems struggle with exact computation and long, rule-bound tasks when they’re forced to answer without tools, but that limitation is neither new nor proof of “no reasoning.” The core takeaway is that LLMs are probabilistic generators that can produce plausible next steps—yet they will eventually fail on tasks requiring long sequences of precise, error-free steps, especially when output length is constrained.

Apple’s paper is described as focusing on the idea that LLMs don’t follow explicit algorithms and degrade as task complexity rises. To test that, it uses puzzle-style benchmarks such as Tower of Hanoi (moving discs without placing larger on smaller), checkers (moving tokens to the correct side under rules), and the fox-and-chicken “river crossing” problem. If LLMs were behaving like fixed programs, the transcript argues, performance would stay stable as the number of pieces increases. Instead, accuracy drops as puzzles scale—consistent with the long-known behavior of probabilistic models that aren’t guaranteed to produce the same correct output every time.

The transcript then connects those failures to two practical constraints. First, without external tools, LLMs can’t reliably perform exact arithmetic once numbers get large enough; multiplication becomes a point where they “never get the sum right” rather than merely making occasional mistakes. Second, even when models know the algorithmic steps in principle, long tasks can exceed token limits. One example given is a Claude model tested with a 128,000-token output cap; some questions require more than that, so the model can’t produce a full solution trace and instead outputs shorter responses such as “here is the algorithm” or “use this tool.”

A further critique targets the framing of “thinking” versus “non-thinking.” The transcript says Apple originally wanted to compare long chain-of-thought style models against shorter-trace models on math benchmarks, but results didn’t match expectations; the testing emphasis shifted toward puzzles. Another surprise in Apple’s findings—failure even when the algorithm is provided in the prompt—is treated as unsurprising for probabilistic systems: knowing the procedure doesn’t guarantee error-free execution across millions of steps.

The broader implication is that headline writers may overstate “fundamental barriers to generalizable reasoning.” The transcript argues that serious researchers were unlikely to be shocked by results showing LLMs’ limits in tool-free, long-horizon precision tasks. At the same time, it stresses that LLMs can become far more reliable when paired with tools and environments that correct mistakes—turning generative text into a component of a system that can support scientific progress.

The closing guidance shifts from debate to selection. For free use, it recommends Google’s Gemini 2.5 Pro, with an honorary mention for Deepseek R1 via API. It also warns that benchmark headlines can mislead: OpenAI’s 03 Pro at a $200 tier shows strong math/science/coding scores, but the transcript claims the teased 03 system from December 2024 performed better than 03 Pro on some tracked results, urging users to evaluate performance on their own use case rather than rely on record-setting charts.

Cornell Notes

Apple’s “AI can’t reason” headlines are challenged on the grounds that LLMs are probabilistic generators: they can imitate reasoning and solve many problems, but they fail on exact, long-horizon tasks when forced to answer without tools. The discussed Apple paper reports accuracy dropping as puzzles scale (Tower of Hanoi, checkers, river crossing), and the transcript links that to known issues like multiplication breaking down for large digit counts and token-length limits (e.g., a 128,000-token cap). Even when an algorithm is provided, probabilistic execution can still accumulate errors across many steps. The practical lesson is that LLMs improve dramatically when integrated with tools and environments that constrain or verify outputs, turning “plausible BS” into more dependable computation.

Why do LLMs struggle with puzzles like Tower of Hanoi or scaled-up checkers when no tools are allowed?

The transcript frames LLMs as probabilistic neural networks that generate plausible next steps rather than executing a guaranteed algorithm like a calculator. As puzzle complexity increases, solutions require longer sequences of precise, rule-following moves. With no external verification, small errors accumulate, so performance drops noticeably as the number of pieces/discs grows.

What’s the connection between “can’t do exact computation” and multiplication failures?

A key example is multiplication: when models lack tools and are asked to multiply numbers directly, they can handle small digit lengths but then fail dramatically once digits get too large. The transcript emphasizes that this isn’t occasional rounding—it can mean the model never produces the correct sum—because the system isn’t designed to be fully predictable.

How do token limits affect the ability to produce correct long solutions?

The transcript highlights a token-cap issue tied to Claude Opus testing: if the required solution trace exceeds the model’s output capacity (given as 128,000 tokens), the model can’t emit the full answer. Instead, it may output shorter traces such as “here is the algorithm you need to use” or “use this tool,” effectively acknowledging it can’t fit the complete computation.

Why does providing the algorithm in the prompt not guarantee correct results?

Even if the model is given the procedure, probabilistic generation still has to execute many steps without error. The transcript argues that LLMs may know the algorithmic steps in principle, but across millions of steps the chance of at least one mistake becomes high, so the final answer can still fail.

What changes when LLMs are allowed to use tools or code?

The transcript claims the same multiplication example becomes correct when the model is allowed to use code/tools. It also notes that LLMs often hallucinate when they can’t access tools, but tool use can correct those mistakes—turning generative text into a system that can perform reliable computation.

How should users interpret benchmark scores for choosing a model?

The transcript warns that benchmark headlines can mislead because companies may omit comparisons, show only selected results, or hide details like multiple parallel attempts and strict usage limits. It also claims 03 Pro’s headline performance doesn’t necessarily beat the earlier teased 03 system on some tracked results, so users should evaluate against their own use case rather than rely solely on record charts.

Review Questions

In what ways do probabilistic generation and long-horizon step execution explain accuracy collapse on scaled puzzles?
How do token output limits change what a model can produce, even if it “knows” an algorithm?
Why might a model’s benchmark performance differ from its real-world performance for a specific task?

Key Points

1
LLMs are probabilistic text generators, so they can imitate reasoning but aren’t guaranteed to execute long, exact sequences without error.
2
Accuracy drops on scaled rule-bound puzzles when models must answer without tools, consistent with accumulated mistakes over longer solution traces.
3
Exact arithmetic like multiplication can fail dramatically at larger digit lengths when models lack tool access.
4
Token/output limits can prevent full solution traces; models may respond with shorter instructions (e.g., “use this tool”) instead of completing the computation.
5
Providing an algorithm in the prompt doesn’t ensure correctness because probabilistic execution can still introduce errors across many steps.
6
Tool use (including code) can convert unreliable generation into more dependable computation by enabling verification or exact calculation.
7
Benchmark headlines can mislead; users should check comparisons, constraints, and how results map to their own tasks rather than rely on top-line percentages.

Highlights

The transcript argues that “no reasoning” headlines miss the point: LLMs can perform reasoning-like steps, but they fail on long, exact tasks without tools because errors accumulate.

Token limits (example given: 128,000 tokens for a Claude test) can force models to abandon full traces and instead output algorithms or tool instructions.

Tool access is portrayed as the dividing line between hallucinated answers and correct computation, with code-enabled multiplication described as working.

The model-selection advice emphasizes that benchmark record scores aren’t enough; usage limits, comparison choices, and task fit matter. 

Topics

LLM Reasoning Limits
Tool Use
Token Constraints
Benchmarks
Model Recommendations

Mentioned

OpenAI
Anthropic
Google DeepMind
Claude
Gemini
Deepseek
Storyblocks
Chatbot
V3
VO video generator model
Story Blocks
Sam Altman
Gary Marcus
Ralph
AGI
LLMs
API
ELO