Get AI summaries of any video or article — Sign up free
Transformer Circuits Part 1 thumbnail

Transformer Circuits Part 1

6 min read

Based on West Coast Machine Learning's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Residual connections make Transformer blocks act like additive corrections to a baseline representation, which simplifies circuit interpretation.

Briefing

Transformer circuits work centers on a simple but powerful claim: even in a stripped-down, one-layer attention-only Transformer, the model’s behavior can be decomposed into two interpretable “circuits”—an output value circuit that predicts how attending to a token changes the next-token logits, and a query-key circuit that predicts which tokens get attended to which others. That decomposition matters because it turns an opaque neural network into something closer to a set of mechanical rules: attention chooses sources, and learned linear maps decide how those sources shift the probability distribution over the vocabulary.

The discussion begins with a fast refresher on the Transformer architecture. Attention blocks (orange) sit alongside feed-forward networks (blue), and residual connections wrap both attention and feed-forward sublayers so that information flows forward even if learned weights are effectively zero. With stacked layers, the model repeatedly transforms a residual stream. For language modeling, tokens start as integer IDs (e.g., 0 to 49,999 for a 50,000-word vocabulary), get turned into vectors via an embedding matrix, and then produce a softmax over the vocabulary to predict the next token.

From there, the focus shifts to progressively simpler “toy” Transformers to make the math tractable. A “zero-layer Transformer” is treated as the simplest baseline: token IDs are embedded and then mapped through a product of embedding and unembedding matrices, yielding an approximation of bigram statistics. The key takeaway is that larger Transformers always contain a term with the same structure—an embedding-to-unembedding path—so understanding this baseline helps interpret what remains when attention is added.

The next step is the “one-layer attention-only Transformer,” which removes MLPs and simplifies away layer normalization and biases to reduce bookkeeping. At a high level, the model does three things: embed tokens, run each attention head and add its result into the residual stream, then map the final residual stream back to vocabulary logits via the unembedding matrix. Inside an attention head, the mechanism is described using matrices for values (from the residual stream via WV), attention weights (from queries and keys), and an output projection (WO) that decides how the attended information is written back into the residual stream.

A major conceptual move is to treat the full attention computation as a structured linear algebra object. By using tensor-product notation, the analysis separates “what attention moves” from “where it moves it.” In this framing, the query-key side determines an attention pattern across token positions (which token attends to which other token), while the value/output side determines how the attended token’s content changes the output logits. When the attention pattern is treated as fixed, the remaining computation becomes linear, allowing the model’s effect to be written as an identity term plus a sum of circuit terms.

Two specific matrices become central. The output value circuit is effectively a vocabulary-by-vocabulary matrix formed from the embedding, value, output projection, and unembedding weights; it quantifies how attending to a particular token would bump or suppress specific vocabulary logits. The query-key circuit is another vocabulary-by-vocabulary matrix that yields pre-softmax attention affinities between token pairs, acting like a lookup table for which words are likely to attend to which others. The analysis notes that positions can matter in real models, but the simplified treatment initially ignores positional embeddings.

The session ends by emphasizing that this is a “lift the hood” approach: fully reverse-engineering a toy Transformer provides interpretability tools, even though multi-layer Transformers introduce more complex interactions. The next step is to test whether the predicted circuit behavior shows up in trained one-layer attention-only models and then extend the method to deeper architectures where interactions go beyond simple word-to-word affinity.

Cornell Notes

The core idea is to interpret a Transformer by splitting its one-layer attention-only computation into two circuits. The query-key circuit determines an attention pattern—how strongly each token attends to every other token—via learned projections of the residual stream into keys and queries. The output value circuit determines what happens to the vocabulary logits when a token is attended to, via a learned chain of embedding/value/output/unembedding maps that can be collapsed into a vocabulary-by-vocabulary linear effect. Treating the attention pattern as fixed makes the remaining computation linear, enabling a clean decomposition into an identity-like direct path plus attention-driven correction terms. This matters because it turns next-token prediction into a mechanical story: attention selects sources, and learned linear maps decide how those sources shift probabilities.

What does the “residual stream” do, and why does the residual connection matter for interpreting Transformer circuits?

Residual connections let information pass through attention and feed-forward blocks even when learned weights contribute little. In the architecture recap, the output of a sublayer (attention or feed-forward) is added back to the incoming representation. The transcript highlights that if weights in attention and MLP were effectively zero, the data could flow through unmodified. That makes circuit analysis easier: learned components act like additive corrections to a baseline path rather than replacing the entire signal.

Why is the zero-layer Transformer useful when studying deeper Transformers?

The zero-layer setup embeds tokens and maps them through an embedding/unembedding product, then uses softmax to predict the next token. The key interpretability point is that this produces a term resembling bigram statistics. More importantly, the transcript notes that larger Transformers always include an analogous embedding-to-unembedding term, so understanding this baseline helps interpret what attention adds on top of the direct path.

In a one-layer attention-only Transformer, what are the three high-level steps from tokens to logits?

First, tokens are embedded (tokens are represented as one-hot vectors multiplied by an embedding matrix). Second, each attention head runs and its output is added into the residual stream (attention mixes information across token positions). Third, the final residual stream is multiplied by the unembedding matrix to produce vocabulary logits, followed by softmax to predict the next token.

How do the query-key and output value circuits differ in what they explain?

The query-key circuit explains which tokens get attended to which others. It comes from projecting the residual stream into keys (via WK) and queries (via WQ), then forming attention scores (via dot products) and applying softmax to get attention weights across positions. The output value circuit explains what attending to a token does to the output logits: it chains embedding (WE), value projection (WV), output projection (WO), and unembedding (unembedding) into a vocabulary-by-vocabulary linear effect that quantifies logit changes.

Why does treating the attention pattern as “fixed” make the model easier to analyze?

With a fixed attention pattern, the remaining computation becomes linear. The transcript describes that if attention weights are assumed constant for a given input sequence, then the model can be written as an identity/direct term plus a sum of linear circuit terms. This lets the analysis collapse complex expressions into interpretable matrix products, making it possible to precompute a single effective linear map for how attended tokens change logits.

What does the vocabulary-by-vocabulary output value matrix mean operationally?

It acts like a mechanical “logit bump table.” If the effective matrix entry for (source token → target vocabulary word) is 0.2, attending to that source token would increase the target word’s logit by 0.2 (and thus increase its probability after softmax). The transcript gives intuition using multiplicative probability effects (e.g., adding 1 to a logit corresponds to multiplying odds by about e), emphasizing that the matrix entries translate directly into logit shifts.

Review Questions

  1. In the one-layer attention-only Transformer, which learned matrices determine (a) attention weights across tokens and (b) how attended information changes vocabulary logits?
  2. How does the identity/direct path in the circuit decomposition relate to the zero-layer Transformer’s bigram-like behavior?
  3. What changes in the analysis when positional embeddings are included rather than ignored?

Key Points

  1. 1

    Residual connections make Transformer blocks act like additive corrections to a baseline representation, which simplifies circuit interpretation.

  2. 2

    A zero-layer Transformer provides a baseline embedding-to-unembedding path that resembles bigram statistics and appears as a direct term in deeper models.

  3. 3

    A one-layer attention-only Transformer removes MLPs and (for analysis) layer norms and biases to make the computation decomposable into interpretable parts.

  4. 4

    The query-key circuit determines the attention pattern across token positions by projecting the residual stream into keys and queries and applying softmax to dot-product scores.

  5. 5

    The output value circuit determines the logit impact of attending to a token, collapsing learned projections into a vocabulary-by-vocabulary linear effect.

  6. 6

    Assuming the attention pattern is fixed turns the remaining computation into a linear map, enabling a clean decomposition into a direct term plus attention-driven correction terms.

  7. 7

    The next analytical step is to verify whether these predicted circuit behaviors show up in trained one-layer attention-only models before extending to multi-layer interactions.

Highlights

Attention can be understood as two separable jobs: selecting source tokens (query-key) and deciding how those sources shift output logits (output value).
The embedding-to-unembedding “direct path” behaves like a bigram-like baseline and persists as a term even when attention is added.
With attention treated as fixed, the model’s effect becomes linear, letting circuit terms be expressed as matrix products with direct logit interpretations.
The query-key matrix functions like an affinity lookup table for token-to-token attention scores (before softmax).

Topics