What is Multi-head Attention in Transformers | Multi-head Attention v Self Attention

TL;DR

Self-attention produces contextual embeddings by mixing value vectors using softmax-normalized similarity scores from query-key dot products.

Briefing Cornell Notes

Briefing

Multi-head attention is presented as the fix for a key limitation of self-attention: a single attention pass tends to lock onto only one interpretation of a sentence, even when multiple meanings are plausible. The transcript uses the ambiguous sentence “The man saw the astronaut with a telescope” to show how self-attention can struggle to represent both readings—either the man used a telescope, or the astronaut had the telescope—because it produces one set of similarity relationships across words.

Self-attention is first recapped as a mechanism for generating contextual embeddings. Static word embeddings can’t distinguish meanings that depend on surrounding words (the “bank” in “money bank” versus “river bank”), so self-attention builds context-aware representations by comparing every word to every other word. Internally, it forms query (Q), key (K), and value (V) vectors from word embeddings using learned weight matrices. For each word pair, it computes a dot-product similarity between the query of one word and the key of another, scales the scores, normalizes them with softmax to get weights, and then uses those weights to mix the value vectors—repeating this across all words to produce contextual outputs.

The transcript then argues that this single perspective is the bottleneck: self-attention effectively outputs one “attention table” of word-to-word similarities, which can miss alternative interpretations that require different relational patterns. That’s where multi-head attention enters. Instead of one set of Q/K/V projections, multi-head attention runs multiple self-attention modules in parallel—each called a “head”—so each head can focus on different relationships within the same sentence.

A concrete two-head example is built around the same ambiguous sentence. With two heads, the model creates two separate sets of Q/K/V vectors (two different learned projections). Each head produces its own contextual representation (e.g., one head yields one version of the contextual vector for “man,” another yields a different version). These head-specific outputs are then concatenated and passed through a linear transformation to return to the original embedding dimension. The transcript emphasizes that the final output is a learned mixture of perspectives, with the linear layer balancing how much each head contributes.

Finally, the transcript connects the explanation to the original Transformer implementation details: embeddings are projected from a larger dimension (e.g., 512) down to a smaller per-head dimension (e.g., 64) to reduce computation, then processed by multiple heads, concatenated back, and projected again to the model’s full size. A visualization from Google APIs is used to make the intuition tangible: different heads show different strongest word-to-word similarities, with one head highlighting the “telescope with man” interpretation and another highlighting “telescope with astronaut.” The takeaway is that multi-head attention preserves self-attention’s contextual power while increasing the chance of capturing multiple meanings in parallel.

Cornell Notes

Self-attention builds contextual embeddings by turning each word embedding into query (Q), key (K), and value (V) vectors, then using scaled dot-product similarities and softmax weights to mix values. A key weakness is that one attention pass tends to capture only a single relational “perspective,” which can fail on ambiguous sentences with multiple plausible interpretations. Multi-head attention addresses this by running several self-attention heads in parallel, each with its own learned Q/K/V projections, so different heads can focus on different word-to-word relationships. The head outputs are concatenated and linearly transformed back to the original embedding size, producing a learned mixture of perspectives. This design also reduces computation by projecting from a larger model dimension (e.g., 512) down to a smaller per-head dimension (e.g., 64).

Why do static word embeddings struggle with meaning in context?

Static embeddings assign the same vector to a word everywhere, so they can’t represent how meaning shifts with neighbors. The transcript’s example is “money bank” versus “river bank”: both use the word “bank,” but the semantic relationship differs. Self-attention fixes this by producing contextual embeddings where the representation of “bank” changes depending on which other words it attends to.

How does self-attention compute contextual embeddings from word embeddings?

For each word, learned weight matrices produce query (Q), key (K), and value (V) vectors. For a word pair, it computes a dot product between the query of one word and the key of the other to get a similarity score, scales it, applies softmax to obtain attention weights, and then forms a weighted sum of the value vectors. Repeating this across all words yields contextual outputs that reflect relationships in the sentence.

What specific limitation of self-attention is illustrated by “The man saw the astronaut with a telescope”?

The sentence has two plausible interpretations: (1) the man used a telescope to see the astronaut, or (2) the astronaut had the telescope. The transcript claims a single self-attention pass can effectively capture only one dominant relational pattern because it produces one attention table of similarities, making it less likely to represent multiple perspectives simultaneously.

How does multi-head attention change the computation to capture multiple interpretations?

Multi-head attention replaces one set of Q/K/V projections with multiple heads—parallel self-attention modules. Each head has its own learned projection matrices, so each head generates its own Q/K/V and its own contextual representation. In the two-head example, one head can strongly connect “man” with “telescope,” while another can strongly connect “astronaut” with “telescope.”

What happens after the heads produce their outputs?

Each head outputs contextual vectors for each word. These are concatenated (so the combined representation includes information from all heads) and then passed through a linear transformation to return to the original embedding dimension. The linear layer learns how to mix the different head perspectives into a single final representation.

Review Questions

In self-attention, what roles do the query, key, and value vectors play in forming contextual embeddings?
Why does multi-head attention increase the chance of capturing multiple meanings in ambiguous sentences?
How does the model reduce computation by projecting from a larger dimension (like 512) down to a smaller per-head dimension (like 64)?

Key Points

1
Self-attention produces contextual embeddings by mixing value vectors using softmax-normalized similarity scores from query-key dot products.
2
Static embeddings can’t represent context-dependent meaning (e.g., “bank” in “money bank” vs “river bank”).
3
A single self-attention pass can miss alternative interpretations because it tends to output one dominant word-to-word similarity pattern.
4
Multi-head attention runs multiple self-attention heads in parallel, each with its own learned Q/K/V projections, enabling different heads to focus on different relationships.
5
Head outputs are concatenated and then linearly transformed back to the model’s original embedding dimension to form a learned mixture of perspectives.
6
Transformer implementations often project from a larger embedding size (e.g., 512) down to a smaller per-head size (e.g., 64) to reduce computation before applying attention per head.

Highlights

The ambiguous sentence “The man saw the astronaut with a telescope” is used to show how one attention perspective can favor only one reading.

Multi-head attention captures multiple perspectives by giving each head its own Q/K/V projections and letting heads specialize in different word relationships.

After attention, head outputs are concatenated and passed through a linear layer to return to the original embedding size, effectively mixing interpretations.

A visualization demonstrates that different heads can show different strongest similarity links—one aligning “man” with “telescope,” another aligning “astronaut” with “telescope.”

Topics

Self Attention
Multi-Head Attention
QKV Projections
Contextual Embeddings
Transformer Architecture

What is Multi-head Attention in Transformers | Multi-head Attention v Self Attention | Deep Learning