What is Multi-head Attention in Transformers | Multi-head Attention v Self Attention | Deep Learning
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Self-attention produces contextual embeddings by mixing value vectors using softmax-normalized similarity scores from query-key dot products.
Briefing
Multi-head attention is presented as the fix for a key limitation of self-attention: a single attention pass tends to lock onto only one interpretation of a sentence, even when multiple meanings are plausible. The transcript uses the ambiguous sentence “The man saw the astronaut with a telescope” to show how self-attention can struggle to represent both readings—either the man used a telescope, or the astronaut had the telescope—because it produces one set of similarity relationships across words.
Self-attention is first recapped as a mechanism for generating contextual embeddings. Static word embeddings can’t distinguish meanings that depend on surrounding words (the “bank” in “money bank” versus “river bank”), so self-attention builds context-aware representations by comparing every word to every other word. Internally, it forms query (Q), key (K), and value (V) vectors from word embeddings using learned weight matrices. For each word pair, it computes a dot-product similarity between the query of one word and the key of another, scales the scores, normalizes them with softmax to get weights, and then uses those weights to mix the value vectors—repeating this across all words to produce contextual outputs.
The transcript then argues that this single perspective is the bottleneck: self-attention effectively outputs one “attention table” of word-to-word similarities, which can miss alternative interpretations that require different relational patterns. That’s where multi-head attention enters. Instead of one set of Q/K/V projections, multi-head attention runs multiple self-attention modules in parallel—each called a “head”—so each head can focus on different relationships within the same sentence.
A concrete two-head example is built around the same ambiguous sentence. With two heads, the model creates two separate sets of Q/K/V vectors (two different learned projections). Each head produces its own contextual representation (e.g., one head yields one version of the contextual vector for “man,” another yields a different version). These head-specific outputs are then concatenated and passed through a linear transformation to return to the original embedding dimension. The transcript emphasizes that the final output is a learned mixture of perspectives, with the linear layer balancing how much each head contributes.
Finally, the transcript connects the explanation to the original Transformer implementation details: embeddings are projected from a larger dimension (e.g., 512) down to a smaller per-head dimension (e.g., 64) to reduce computation, then processed by multiple heads, concatenated back, and projected again to the model’s full size. A visualization from Google APIs is used to make the intuition tangible: different heads show different strongest word-to-word similarities, with one head highlighting the “telescope with man” interpretation and another highlighting “telescope with astronaut.” The takeaway is that multi-head attention preserves self-attention’s contextual power while increasing the chance of capturing multiple meanings in parallel.
Cornell Notes
Self-attention builds contextual embeddings by turning each word embedding into query (Q), key (K), and value (V) vectors, then using scaled dot-product similarities and softmax weights to mix values. A key weakness is that one attention pass tends to capture only a single relational “perspective,” which can fail on ambiguous sentences with multiple plausible interpretations. Multi-head attention addresses this by running several self-attention heads in parallel, each with its own learned Q/K/V projections, so different heads can focus on different word-to-word relationships. The head outputs are concatenated and linearly transformed back to the original embedding size, producing a learned mixture of perspectives. This design also reduces computation by projecting from a larger model dimension (e.g., 512) down to a smaller per-head dimension (e.g., 64).
Why do static word embeddings struggle with meaning in context?
How does self-attention compute contextual embeddings from word embeddings?
What specific limitation of self-attention is illustrated by “The man saw the astronaut with a telescope”?
How does multi-head attention change the computation to capture multiple interpretations?
What happens after the heads produce their outputs?
Review Questions
- In self-attention, what roles do the query, key, and value vectors play in forming contextual embeddings?
- Why does multi-head attention increase the chance of capturing multiple meanings in ambiguous sentences?
- How does the model reduce computation by projecting from a larger dimension (like 512) down to a smaller per-head dimension (like 64)?
Key Points
- 1
Self-attention produces contextual embeddings by mixing value vectors using softmax-normalized similarity scores from query-key dot products.
- 2
Static embeddings can’t represent context-dependent meaning (e.g., “bank” in “money bank” vs “river bank”).
- 3
A single self-attention pass can miss alternative interpretations because it tends to output one dominant word-to-word similarity pattern.
- 4
Multi-head attention runs multiple self-attention heads in parallel, each with its own learned Q/K/V projections, enabling different heads to focus on different relationships.
- 5
Head outputs are concatenated and then linearly transformed back to the model’s original embedding dimension to form a learned mixture of perspectives.
- 6
Transformer implementations often project from a larger embedding size (e.g., 512) down to a smaller per-head size (e.g., 64) to reduce computation before applying attention per head.