Get AI summaries of any video or article — Sign up free
Why is Self Attention called "Self"? | Self Attention Vs Luong Attention in Depth Lecture | CampusX thumbnail

Why is Self Attention called "Self"? | Self Attention Vs Luong Attention in Depth Lecture | CampusX

CampusX·
5 min read

Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Self-attention is named for computing attention scores within one sequence rather than between two different sequences.

Briefing

Self-attention gets its name because it computes attention scores within a single sequence—using the same tokens as both the “source” and the “target”—rather than aligning two different sequences like classic encoder–decoder attention. That intra-sequence setup is the key difference from earlier attention variants (such as Luong-style and Bahdanau-style), and it matters because it lets Transformers decide, for every token, which other tokens in the same sentence or context are most relevant.

The walkthrough starts by revisiting why attention exists at all. In the older sequence-to-sequence encoder–decoder approach with LSTMs, the encoder compresses an entire input sentence into one fixed-size context vector (often described as a summary of hidden states). The decoder then generates the output step by step using only that single vector. This breaks down when inputs get long: once the sentence exceeds roughly 30 words, squeezing all information into one vector degrades translation quality. Attention fixes the bottleneck by letting the decoder, at each output time step, build a context vector as a weighted combination of encoder hidden states—where the weights reflect which input tokens are useful for the current output token.

The transcript then details how those weights are computed. For each decoder step i and each encoder position j, an alignment score eᵢⱼ is calculated using a similarity measure (in Luong attention, via a dot product between decoder hidden state and encoder hidden states). Those alignment scores are normalized with a softmax to produce attention weights αᵢⱼ. The context vector for that decoder step becomes a weighted sum of encoder hidden states using these α values.

From there, self-attention is presented as the same core math—queries, keys, alignment scores, softmax, and weighted sums—but applied differently. Instead of having separate encoder and decoder sequences, self-attention forms three learned projections from the same token embeddings: a Query (Q), a Key (K), and a Value (V) for each token. For a given token position, its query is compared (via dot products) against all keys in the same sequence to produce similarity scores. After softmax normalization, the resulting weights are used to take a weighted sum of the corresponding values, yielding the contextualized representation for that token.

Finally, the naming logic is made explicit. In Luong/Bahdanau attention, alignment scores are computed between two different sequences (e.g., English tokens vs. Hindi tokens). In self-attention, the alignment scores are computed between tokens of the same sequence. Because the “attention” is calculated intra-sequence—token-to-token within one sentence—the mechanism is called self-attention. The transcript also emphasizes that this understanding becomes foundational for later Transformer components like multi-head attention and positional encoding.

Cornell Notes

Self-attention is called “self” because it computes attention scores within a single sequence: each token attends to other tokens in the same input. The mechanism uses the same attention pipeline as earlier encoder–decoder attention—alignment scores, softmax normalization, and a weighted sum—but replaces cross-sequence alignment with intra-sequence similarity.

The transcript explains that earlier seq2seq models relied on a single fixed context vector, which fails for long inputs (quality drops for sentences beyond about 30 words). Attention fixes this by building a context vector at each decoding step using weights over encoder hidden states.

In self-attention, each token embedding is projected into Query (Q), Key (K), and Value (V). A token’s query is dotted with all keys to produce alignment scores, softmax turns them into weights, and those weights combine the values to produce the token’s contextual representation.

Why did encoder–decoder models with a single context vector struggle with longer sentences?

They compressed the entire input into one fixed-size context vector (a summary of encoder hidden states). When the input grows (the transcript cites beyond ~30 words), the single vector can’t carry all necessary information, so translation quality degrades. Attention addresses this by letting the decoder form a context vector at each step from multiple encoder states instead of relying on one compressed summary.

How do attention weights αᵢⱼ get computed in Luong-style attention?

For each decoder time step i and encoder position j, an alignment score eᵢⱼ is computed using a similarity function—described as a dot product between decoder hidden state and encoder hidden states. Those alignment scores are then passed through softmax to produce normalized weights αᵢⱼ. The context vector for step i is a weighted sum of encoder hidden states using these α weights.

What changes when moving from Luong/Bahdanau attention to self-attention?

The core math stays the same (alignment scores → softmax → weighted sum), but the source of tokens changes. Cross-attention aligns two different sequences (e.g., English vs. Hindi). Self-attention aligns tokens within the same sequence by using projections of the same token embeddings into Q, K, and V, then comparing queries to keys across positions in that same sequence.

How does self-attention produce the contextual vector for a token position?

For a token at position i, its Query vector Qᵢ is dotted with every Key vector Kⱼ in the same sequence to get similarity scores sᵢⱼ (alignment scores). Softmax over these scores yields weights wᵢⱼ. The output contextual vector is the weighted sum of the corresponding Value vectors Vⱼ using wᵢⱼ.

Why is self-attention called “self” rather than just “attention”?

Because the attention scores are computed between tokens of the same sequence (intra-sequence). Earlier attention variants compute alignment between two different sequences (inter-sequence). The transcript frames this as “inter-sequence tension” for translation pairs versus “intra-sequence” token-to-token comparison for self-attention.

Review Questions

  1. In the encoder–decoder setup, what bottleneck arises from using a single fixed context vector, and how does attention remove it?
  2. Describe the sequence of operations in self-attention (Q·K, softmax, weighted sum of V) and specify what Q, K, and V come from.
  3. Explain the difference between cross-attention and self-attention in terms of which tokens are compared to compute alignment scores.

Key Points

  1. 1

    Self-attention is named for computing attention scores within one sequence rather than between two different sequences.

  2. 2

    Single context-vector encoder–decoder models degrade for long inputs because they must compress all information into one fixed-size representation.

  3. 3

    Attention improves translation quality by building a context vector at each decoding step as a weighted sum over encoder hidden states.

  4. 4

    Luong-style attention computes alignment scores using similarity (described as dot products) between decoder hidden states and encoder hidden states, then normalizes with softmax.

  5. 5

    Self-attention keeps the same attention pipeline but generates Query, Key, and Value projections from the same token embeddings.

  6. 6

    For each token, self-attention compares its query against all keys in the sequence, then uses the resulting weights to combine values into a contextualized representation.

Highlights

The “self” in self-attention comes from intra-sequence alignment: tokens attend to other tokens in the same sentence.
Earlier seq2seq models rely on one context vector, which becomes a bottleneck as input length grows (quality drops beyond roughly 30 words).
Self-attention uses the same attention mechanics—alignment scores, softmax, weighted sums—just with Q/K/V derived from the same sequence.
Self-attention replaces cross-sequence encoder–decoder alignment with token-to-token comparisons inside one sequence.

Topics

  • Self-Attention Naming
  • Attention Mechanism
  • Luong Attention
  • Query-Key-Value
  • Intra vs Inter Sequence Attention

Mentioned