Why is Self Attention called "Self"? | Self Attention Vs Luong Attention in Depth Lecture | CampusX
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Self-attention is named for computing attention scores within one sequence rather than between two different sequences.
Briefing
Self-attention gets its name because it computes attention scores within a single sequence—using the same tokens as both the “source” and the “target”—rather than aligning two different sequences like classic encoder–decoder attention. That intra-sequence setup is the key difference from earlier attention variants (such as Luong-style and Bahdanau-style), and it matters because it lets Transformers decide, for every token, which other tokens in the same sentence or context are most relevant.
The walkthrough starts by revisiting why attention exists at all. In the older sequence-to-sequence encoder–decoder approach with LSTMs, the encoder compresses an entire input sentence into one fixed-size context vector (often described as a summary of hidden states). The decoder then generates the output step by step using only that single vector. This breaks down when inputs get long: once the sentence exceeds roughly 30 words, squeezing all information into one vector degrades translation quality. Attention fixes the bottleneck by letting the decoder, at each output time step, build a context vector as a weighted combination of encoder hidden states—where the weights reflect which input tokens are useful for the current output token.
The transcript then details how those weights are computed. For each decoder step i and each encoder position j, an alignment score eᵢⱼ is calculated using a similarity measure (in Luong attention, via a dot product between decoder hidden state and encoder hidden states). Those alignment scores are normalized with a softmax to produce attention weights αᵢⱼ. The context vector for that decoder step becomes a weighted sum of encoder hidden states using these α values.
From there, self-attention is presented as the same core math—queries, keys, alignment scores, softmax, and weighted sums—but applied differently. Instead of having separate encoder and decoder sequences, self-attention forms three learned projections from the same token embeddings: a Query (Q), a Key (K), and a Value (V) for each token. For a given token position, its query is compared (via dot products) against all keys in the same sequence to produce similarity scores. After softmax normalization, the resulting weights are used to take a weighted sum of the corresponding values, yielding the contextualized representation for that token.
Finally, the naming logic is made explicit. In Luong/Bahdanau attention, alignment scores are computed between two different sequences (e.g., English tokens vs. Hindi tokens). In self-attention, the alignment scores are computed between tokens of the same sequence. Because the “attention” is calculated intra-sequence—token-to-token within one sentence—the mechanism is called self-attention. The transcript also emphasizes that this understanding becomes foundational for later Transformer components like multi-head attention and positional encoding.
Cornell Notes
Self-attention is called “self” because it computes attention scores within a single sequence: each token attends to other tokens in the same input. The mechanism uses the same attention pipeline as earlier encoder–decoder attention—alignment scores, softmax normalization, and a weighted sum—but replaces cross-sequence alignment with intra-sequence similarity.
The transcript explains that earlier seq2seq models relied on a single fixed context vector, which fails for long inputs (quality drops for sentences beyond about 30 words). Attention fixes this by building a context vector at each decoding step using weights over encoder hidden states.
In self-attention, each token embedding is projected into Query (Q), Key (K), and Value (V). A token’s query is dotted with all keys to produce alignment scores, softmax turns them into weights, and those weights combine the values to produce the token’s contextual representation.
Why did encoder–decoder models with a single context vector struggle with longer sentences?
How do attention weights αᵢⱼ get computed in Luong-style attention?
What changes when moving from Luong/Bahdanau attention to self-attention?
How does self-attention produce the contextual vector for a token position?
Why is self-attention called “self” rather than just “attention”?
Review Questions
- In the encoder–decoder setup, what bottleneck arises from using a single fixed context vector, and how does attention remove it?
- Describe the sequence of operations in self-attention (Q·K, softmax, weighted sum of V) and specify what Q, K, and V come from.
- Explain the difference between cross-attention and self-attention in terms of which tokens are compared to compute alignment scores.
Key Points
- 1
Self-attention is named for computing attention scores within one sequence rather than between two different sequences.
- 2
Single context-vector encoder–decoder models degrade for long inputs because they must compress all information into one fixed-size representation.
- 3
Attention improves translation quality by building a context vector at each decoding step as a weighted sum over encoder hidden states.
- 4
Luong-style attention computes alignment scores using similarity (described as dot products) between decoder hidden states and encoder hidden states, then normalizes with softmax.
- 5
Self-attention keeps the same attention pipeline but generates Query, Key, and Value projections from the same token embeddings.
- 6
For each token, self-attention compares its query against all keys in the sequence, then uses the resulting weights to combine values into a contextualized representation.