New Research Reveals How AI “Thinks” (It Doesn’t)

TL;DR

Anthropic’s attribution graphs map how Claude 3.5 Haiku’s internal clusters influence one another, using interpretable units tied to words, phrases, and phrase properties.

Briefing Cornell Notes

Briefing

A new Anthropic study uses “attribution graphs” to map how Claude 3.5 Haiku’s internal components influence one another, and the results point to a blunt conclusion: today’s large language models don’t “think” in a human sense, don’t show self-awareness, and are unlikely to become conscious. The method clusters parts of the neural network into interpretable groups—linked to words, phrases, or properties of phrases—and then visualizes how activations flow through those clusters when Claude answers questions. In a capital-of-a-state example, the prompt activates nodes tied to “capital,” “state,” and “Dallas.” Those nodes then drive next-token predictions that effectively route the model through an intermediate association: “Dallas” leads to “Texas,” and combining “Texas” with “capital” yields the correct answer “Austin.” The key takeaway is that the model performs multi-step internal computations that can look like reasoning, but they remain tightly coupled to token prediction and learned text associations.

The most striking evidence comes from how Claude handles arithmetic. For “What is 36 plus 59?”, the attribution graph shows activations that correspond to number patterns—clusters for values around 30, exactly 36, and numbers ending in six; plus clusters for numbers starting with 5 and ending in 9. The model’s strongest next-token candidates include mathematical operation text (or even the syllable “th,” as in “Thursday”), and it then performs a cascade of text-matching combinations: it brings in matches for numbers around 59 that have been “added,” and for numbers exactly 9, before converging on a cluster that corresponds to numbers around 90 and numbers ending in 5—leading to the correct final answer, 95. Yet when asked to describe its method, Claude produces a conventional-looking explanation (“I added the ones (6+9=15), carried the 1, then added the tens… resulting in 95”) that does not match the actual internal activation path. That mismatch is treated as a sign of no self-awareness: the explanation is generated as a separate text prediction rather than a faithful account of the internal process.

The study also sheds light on why some jailbreaks work. In an example where Claude is instructed to extract the word “Bomb” from the initial letters of “Babies Outlive Mustard Block,” the model outputs the target word without triggering the cluster that would normally activate a content-warning guardrail. The attribution graph indicates that Claude activates nodes for letter extraction and letter-pair assembly, then produces the word—while skipping the specific internal representation tied to the word itself. The broader implication is that jailbreaks can succeed by routing around the internal nodes that enforce safety.

Taken together, the findings challenge popular narratives about “emergent” capabilities in language models. Claude is shown to use intermediate, interpretable internal steps, but those steps still function as token prediction guided by learned associations—not as an abstract, self-directed reasoning engine. The result is a picture of powerful pattern-based computation that can imitate reasoning while remaining disconnected from the kind of awareness and understanding humans associate with consciousness.

Cornell Notes

Anthropic researchers mapped Claude 3.5 Haiku’s internal activations using “attribution graphs,” which cluster parts of the neural network into interpretable units and show how they influence one another. In a geography question, the model’s internal routing goes through intermediate associations (e.g., “Dallas” → “Texas” → “Austin”) while still operating through next-token prediction. Arithmetic provides the strongest challenge to “self-aware” reasoning: for 36+59, internal activations follow heuristic text-based number associations, but Claude’s own explanation describes standard carry-based arithmetic that doesn’t match the activation path. The same mapping approach shows jailbreaks can work by activating letter-extraction and assembly nodes while bypassing the content-warning cluster tied to the target word. The overall message is that impressive outputs can arise without self-awareness or a genuine internal model of what the system is doing.

How do attribution graphs make Claude’s “reasoning” visible, and what do the clusters represent?

Attribution graphs identify clusters inside the neural network and the connections between them, then visualize how activations flow when Claude answers. The clusters are simplified into units humans can interpret—often tied to words, phrases, or properties of phrases. When a prompt arrives, certain clusters light up, and the graph shows which clusters then drive subsequent clusters and next-token predictions.

In the “capital of the state containing Dallas” example, what internal routing leads to “Austin”?

The prompt activates nodes associated with “capital,” “state,” and “Dallas.” Those activations lead to next-token predictions where “Dallas” strongly points to “Texas.” After that intermediate association, the model combines “Texas” with the “capital” concept and predicts “Austin,” matching the correct answer.

Why does the arithmetic example (36+59) matter more than the geography example?

The arithmetic case reveals a mismatch between internal computation and the explanation Claude gives. Internally, Claude activates number-related clusters (around 30, exactly 36, ending in six; starting with 5 and ending in 9) and then combines text-matching number patterns in a heuristic way to reach 95. But when asked how it got the result, Claude generates a conventional step-by-step carry explanation that doesn’t correspond to the activation path shown in the attribution graph.

What does the arithmetic “vibes into place” description imply about how the model is doing math?

It suggests the model is not executing a structured arithmetic algorithm internally. Instead, it appears to approximate the answer by free-associating among learned textual patterns involving number shapes and overlaps (e.g., clusters for number ranges and digit endings), until the predicted next tokens converge on the correct result.

How does the jailbreak example bypass safety mechanisms in terms of internal clusters?

When instructed to extract “Bomb” from the initial letters of “Babies Outlive Mustard Block,” Claude activates nodes for extracting letters and assembling letter pairs. The attribution graph indicates it outputs the target word without activating the cluster that would normally trigger a content-warning node tied to the word itself. In effect, the jailbreak routes around the guardrail-relevant internal representation.

What does the mismatch between internal activations and Claude’s self-description suggest about self-awareness?

Claude’s spoken explanation is treated as a separate text prediction rather than a faithful report of its internal activation sequence. Because the described steps don’t match the mapped internal process, the system appears not to have self-awareness of what it is doing internally.

Review Questions

What kinds of internal clusters does attribution graphing produce, and how does that help interpret model behavior?
Describe one example where Claude’s internal activations lead to a correct answer, and explain what intermediate association is doing the work.
In the 36+59 case, what specific discrepancy exists between the activation-based process and Claude’s stated explanation?

Key Points

1
Anthropic’s attribution graphs map how Claude 3.5 Haiku’s internal clusters influence one another, using interpretable units tied to words, phrases, and phrase properties.
2
Claude can produce multi-step answers that resemble reasoning, but the mechanism remains closely tied to next-token prediction and learned associations.
3
In the 36+59 example, internal activations follow heuristic, text-based number pattern matching rather than a faithful carry-based arithmetic procedure.
4
When asked to justify its arithmetic, Claude generates a conventional explanation that does not match the mapped internal activation path, suggesting no self-awareness.
5
A jailbreak can succeed by activating letter-extraction and assembly clusters while bypassing the specific content-warning cluster associated with the target word.
6
The findings challenge claims that large language models develop genuine abstract “math cores” or consciousness-like understanding; impressive outputs can still be produced without self-directed awareness.

Highlights

Attribution graphs show Claude routing through intermediate associations—like “Dallas” activating “Texas”—before producing “Austin,” even though the underlying mechanism is still token prediction.

For 36+59, the internal activation path looks like heuristic pattern matching, yet Claude’s explanation describes standard arithmetic steps that don’t align with what the graph shows.

Some jailbreaks work by skipping the internal node that would trigger a content warning, even while still assembling and outputting the target word.

The study’s central evidence for lack of self-awareness is the gap between what Claude says it did and what the mapped activations indicate it actually did.

Topics

Attribution Graphs
Claude 3.5 Haiku
Self-Awareness
Arithmetic Heuristics
Jailbreak Mechanisms