New Research Reveals How AI “Thinks” (It Doesn’t)
Based on Sabine Hossenfelder's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Anthropic’s attribution graphs map how Claude 3.5 Haiku’s internal clusters influence one another, using interpretable units tied to words, phrases, and phrase properties.
Briefing
A new Anthropic study uses “attribution graphs” to map how Claude 3.5 Haiku’s internal components influence one another, and the results point to a blunt conclusion: today’s large language models don’t “think” in a human sense, don’t show self-awareness, and are unlikely to become conscious. The method clusters parts of the neural network into interpretable groups—linked to words, phrases, or properties of phrases—and then visualizes how activations flow through those clusters when Claude answers questions. In a capital-of-a-state example, the prompt activates nodes tied to “capital,” “state,” and “Dallas.” Those nodes then drive next-token predictions that effectively route the model through an intermediate association: “Dallas” leads to “Texas,” and combining “Texas” with “capital” yields the correct answer “Austin.” The key takeaway is that the model performs multi-step internal computations that can look like reasoning, but they remain tightly coupled to token prediction and learned text associations.
The most striking evidence comes from how Claude handles arithmetic. For “What is 36 plus 59?”, the attribution graph shows activations that correspond to number patterns—clusters for values around 30, exactly 36, and numbers ending in six; plus clusters for numbers starting with 5 and ending in 9. The model’s strongest next-token candidates include mathematical operation text (or even the syllable “th,” as in “Thursday”), and it then performs a cascade of text-matching combinations: it brings in matches for numbers around 59 that have been “added,” and for numbers exactly 9, before converging on a cluster that corresponds to numbers around 90 and numbers ending in 5—leading to the correct final answer, 95. Yet when asked to describe its method, Claude produces a conventional-looking explanation (“I added the ones (6+9=15), carried the 1, then added the tens… resulting in 95”) that does not match the actual internal activation path. That mismatch is treated as a sign of no self-awareness: the explanation is generated as a separate text prediction rather than a faithful account of the internal process.
The study also sheds light on why some jailbreaks work. In an example where Claude is instructed to extract the word “Bomb” from the initial letters of “Babies Outlive Mustard Block,” the model outputs the target word without triggering the cluster that would normally activate a content-warning guardrail. The attribution graph indicates that Claude activates nodes for letter extraction and letter-pair assembly, then produces the word—while skipping the specific internal representation tied to the word itself. The broader implication is that jailbreaks can succeed by routing around the internal nodes that enforce safety.
Taken together, the findings challenge popular narratives about “emergent” capabilities in language models. Claude is shown to use intermediate, interpretable internal steps, but those steps still function as token prediction guided by learned associations—not as an abstract, self-directed reasoning engine. The result is a picture of powerful pattern-based computation that can imitate reasoning while remaining disconnected from the kind of awareness and understanding humans associate with consciousness.
Cornell Notes
Anthropic researchers mapped Claude 3.5 Haiku’s internal activations using “attribution graphs,” which cluster parts of the neural network into interpretable units and show how they influence one another. In a geography question, the model’s internal routing goes through intermediate associations (e.g., “Dallas” → “Texas” → “Austin”) while still operating through next-token prediction. Arithmetic provides the strongest challenge to “self-aware” reasoning: for 36+59, internal activations follow heuristic text-based number associations, but Claude’s own explanation describes standard carry-based arithmetic that doesn’t match the activation path. The same mapping approach shows jailbreaks can work by activating letter-extraction and assembly nodes while bypassing the content-warning cluster tied to the target word. The overall message is that impressive outputs can arise without self-awareness or a genuine internal model of what the system is doing.
How do attribution graphs make Claude’s “reasoning” visible, and what do the clusters represent?
In the “capital of the state containing Dallas” example, what internal routing leads to “Austin”?
Why does the arithmetic example (36+59) matter more than the geography example?
What does the arithmetic “vibes into place” description imply about how the model is doing math?
How does the jailbreak example bypass safety mechanisms in terms of internal clusters?
What does the mismatch between internal activations and Claude’s self-description suggest about self-awareness?
Review Questions
- What kinds of internal clusters does attribution graphing produce, and how does that help interpret model behavior?
- Describe one example where Claude’s internal activations lead to a correct answer, and explain what intermediate association is doing the work.
- In the 36+59 case, what specific discrepancy exists between the activation-based process and Claude’s stated explanation?
Key Points
- 1
Anthropic’s attribution graphs map how Claude 3.5 Haiku’s internal clusters influence one another, using interpretable units tied to words, phrases, and phrase properties.
- 2
Claude can produce multi-step answers that resemble reasoning, but the mechanism remains closely tied to next-token prediction and learned associations.
- 3
In the 36+59 example, internal activations follow heuristic, text-based number pattern matching rather than a faithful carry-based arithmetic procedure.
- 4
When asked to justify its arithmetic, Claude generates a conventional explanation that does not match the mapped internal activation path, suggesting no self-awareness.
- 5
A jailbreak can succeed by activating letter-extraction and assembly clusters while bypassing the specific content-warning cluster associated with the target word.
- 6
The findings challenge claims that large language models develop genuine abstract “math cores” or consciousness-like understanding; impressive outputs can still be produced without self-directed awareness.