How might LLMs store facts | Deep Learning Chapter 7

TL;DR

Facts can be represented as directions in a high-dimensional embedding space, and transformer MLP blocks can transform those directions into new feature directions.

Briefing Cornell Notes

Briefing

Large language models don’t just “know” facts in a vague sense—those facts can be traced to specific internal computations, especially inside the multilayer perceptron (MLP) parts of transformer networks. A concrete toy example uses the prompt “Michael Jordan plays the sport of blank” to show how an MLP block could, in principle, turn a representation of “Michael Jordan” into an added direction corresponding to “basketball.” Even though a full mechanistic map of real-world fact storage remains unsolved, the evidence discussed points to a broad pattern: the most useful capacity for storing and transforming factual associations appears concentrated in the MLP sub-networks rather than the attention mechanism alone.

The explanation starts with how transformer vectors encode meaning in a high-dimensional space. Different directions in that space can represent different features. Under simplifying assumptions, one direction corresponds to the first name “Michael,” another nearly perpendicular direction corresponds to the last name “Jordan,” and a third direction corresponds to “basketball.” If an internal vector aligns with both “Michael” and “Jordan,” then the MLP’s computations can be arranged so that the “basketball” direction gets injected into the output.

Inside an MLP block, each token’s vector is processed independently in parallel. The core computation is two matrix multiplications with biases and a nonlinearity in between. First comes an “up projection” matrix: it linearly mixes the input vector into a larger hidden space. Then a ReLU activation clips negative values to zero, turning a fuzzy linear score into a more gate-like signal. With a carefully chosen bias, a hidden neuron can behave like an AND gate: it activates only when the input vector simultaneously indicates “Michael” and “Jordan,” while other partial matches (like “Michael” with a different last name) fail to activate it. Finally, a “down projection” matrix maps the activated hidden units back into the original embedding dimension. In the toy setup, one particular column in this down-projection matrix is aligned with the “basketball” direction, so activating the “Michael Jordan” neuron causes the output vector to gain the “basketball” component. The block then adds this result back to the residual stream.

The transcript also grounds the toy story in scale. For GPT-3, the up and down projection matrices each contribute roughly 1.2 billion parameters together, and 96 separate MLPs bring the total MLP parameter count to about 116 billion—around two-thirds of the model’s 175 billion parameters. Normalization-related parameters exist too, but they’re described as a small fraction.

A key twist comes from interpretability research: individual neurons may not correspond to single clean concepts like “Michael Jordan.” Instead, a “superposition” hypothesis suggests that many features share the same neurons by using nearly orthogonal directions in high-dimensional space. In high dimensions, it becomes possible to pack exponentially more nearly perpendicular feature directions than the raw dimensionality would suggest, which helps explain why models can scale well while remaining hard to interpret. In that view, features are distributed across combinations of neurons rather than appearing as a single, obvious activation; tools like sparse autoencoders are mentioned as a way to tease out these hidden features. The next major step promised is the training process—backpropagation, language-model loss functions, reinforcement learning with human feedback, and scaling laws.

Cornell Notes

The transcript explains how factual associations could be stored and retrieved inside transformer MLP blocks. Using a toy example for “Michael Jordan plays basketball,” it assumes specific directions in a high-dimensional embedding space represent “Michael,” “Jordan,” and “basketball.” An MLP block then uses an up-projection matrix, a ReLU nonlinearity (with a bias tuned to act like an AND gate), and a down-projection matrix so that when the internal representation matches “Michael Jordan,” the output gains the “basketball” direction. While this is a clean mathematical construction, real models likely use superposition: many features share neurons via nearly orthogonal directions, making individual neurons less interpretable. This helps reconcile strong performance with interpretability challenges and motivates methods like sparse autoencoders.

How does the toy “Michael Jordan → basketball” example map facts into transformer vectors?

It assumes the embedding space contains distinct directions for features: one direction aligns with the first name “Michael,” another nearly perpendicular direction aligns with the last name “Jordan,” and a third direction aligns with “basketball.” If an internal vector has dot product 1 with both the “Michael” and “Jordan” directions, it represents the full name “Michael Jordan.” The MLP is then constructed so that this combined alignment triggers an internal neuron whose output causes the “basketball” direction to be added to the residual stream.

What computation does an MLP block perform, and why does it matter for storing associations?

Each token’s vector passes through two matrix multiplications with biases and a nonlinearity between them, then gets added back via a residual connection. The first “up projection” expands into a larger hidden space; the ReLU activation clips negative values to zero, turning linear scores into gate-like activations; the second “down projection” maps activated hidden units back to the embedding dimension. Because the down-projection columns can be aligned with particular feature directions, activated hidden units can inject specific associations (like “basketball”) into the output.

How does ReLU enable something like an AND gate in the toy construction?

The transcript describes setting a bias so that a hidden neuron’s pre-activation is positive only when the input vector aligns with the full conjunction “Michael Jordan,” not just one part. Without the nonlinearity, linear mixing would respond to partial matches too (e.g., “Michael + Phelps” could partially trigger). ReLU then clips negative values to zero, so only the conjunction produces a nonzero activation, mimicking AND behavior.

Why might real neurons fail to represent single clean facts like “Michael Jordan”?

The superposition hypothesis suggests neurons can participate in representing many features simultaneously. In high-dimensional spaces, features can be encoded using directions that are nearly orthogonal rather than perfectly independent. That allows many features to be packed into the same neuron activations, so a neuron’s activity becomes a mixture of multiple concepts rather than a single interpretable one.

What role do high-dimensional geometry results play in superposition?

The transcript notes that if features were exactly orthogonal, the number of independent directions would be limited by the dimension n. But if near-orthogonality is allowed (angles roughly between 89 and 91 degrees), the number of nearly perpendicular vectors that can be packed grows dramatically with dimension. It references the Johnson–Lindenstrauss lemma as a reason the capacity can increase exponentially with dimension, helping explain how models can store many ideas without needing one neuron per idea.

How much of GPT-3’s parameter budget sits in MLPs, and why is that relevant?

For GPT-3, the up and down projection matrices together contribute about 1.2 billion parameters per MLP block. With 96 MLPs, the total MLP parameter count is about 116 billion, roughly two-thirds of the model’s 175 billion parameters. Concentrating capacity in MLPs supports the broader claim that factual associations may be stored or transformed primarily within these blocks.

Review Questions

In the toy model, what specific sequence of operations in an MLP makes “Michael Jordan” activate a hidden unit that then adds the “basketball” direction?
What is the difference between exact orthogonality and near-orthogonality in the superposition argument, and how does that change the number of features that can be represented?
Why does the transcript suggest that interpretability methods like sparse autoencoders are needed if features are in superposition?

Key Points

1
Facts can be represented as directions in a high-dimensional embedding space, and transformer MLP blocks can transform those directions into new feature directions.
2
An MLP block’s core computation is an up projection (matrix multiply + bias), a ReLU (or similar) nonlinearity, then a down projection (matrix multiply + bias) followed by a residual addition.
3
With suitable biases and assumptions, ReLU can turn linear “partial match” signals into gate-like behavior that resembles an AND condition for a conjunction such as “Michael Jordan.”
4
Parameter capacity is heavily concentrated in MLPs: in GPT-3, MLPs account for about 116B of 175B parameters (around two-thirds).
5
A superposition hypothesis suggests neurons rarely correspond to single clean concepts; instead, many features overlap in neuron activations using nearly orthogonal directions.
6
High-dimensional geometry (including ideas related to the Johnson–Lindenstrauss lemma) helps explain why near-orthogonal feature directions can be packed far beyond the raw dimensionality.
7
If features are superimposed, interpretability requires tools that can disentangle distributed representations, such as sparse autoencoders.

Highlights

A clean toy mechanism shows how an MLP could inject a “basketball” direction into the residual stream when internal vectors encode “Michael Jordan.”

The MLP computation is two matrix multiplications with biases and a ReLU in between; the nonlinearity is what makes conjunction-like gating possible.

In GPT-3, MLPs hold about 116 billion parameters—roughly two-thirds of the full 175 billion—making them prime candidates for where factual associations live.

Superposition reframes interpretability: neurons may encode mixtures of many features rather than one concept, because high-dimensional spaces allow massive near-orthogonal packing.

The transcript connects scaling and interpretability difficulty to geometry: near-perpendicular directions can multiply the number of storable ideas as dimension grows.

Topics

MLP Fact Storage
Transformer Internals
ReLU Gating
Superposition
GPT-3 Parameters

Mentioned

Neil Nanda
MLP
ReLU
GELU
GPT-3