Masked Self Attention | Masked Multi-head Attention in Transformer | Transformer Decoder
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The decoder must follow causal generation: inference-time predictions depend only on earlier tokens, which makes decoding autoregressive.
Briefing
Transformer decoders generate text one token at a time during inference, but they can be trained in parallel during training—thanks to masked self-attention. The key insight is that the decoder must behave autoregressively when producing outputs (so each next token depends only on earlier tokens), yet training should avoid the slow sequential dependency that would otherwise make learning impractically expensive.
The discussion starts with a central sentence: a Transformer decoder is autoregressive at inference time and non-autoregressive at training time. “Inference” is framed as prediction: given an input sequence, the model produces the next token, then uses that newly produced token as context for the next step. “Autoregressive” is illustrated with a time-series analogy: to predict Friday’s stock value, the model conditions on Wednesday and Thursday predictions. In language tasks, this becomes the rule that later words depend on earlier generated words, so the model cannot generate an entire sentence in one shot without violating the natural dependency structure.
The tension appears when training is considered. If training also forced the decoder to feed on its own previous predictions, it would be slow because every training example would require stepping through many token positions sequentially. The transcript argues that teacher forcing avoids this slowdown: during training, the decoder receives the ground-truth previous tokens from the dataset, so the model does not need to wait for its own earlier outputs. That means the computations for different time steps can be parallelized, dramatically speeding training.
But parallelization introduces a new risk: if the model computes attention over all tokens at once, a token could “see” future tokens. That would create an unfair advantage—effectively a form of data leakage—because training would use information that won’t exist at inference time. The transcript frames this as a mismatch: parallel training would allow the current token’s representation to depend on future tokens, while inference must not.
Masked self-attention is presented as the mechanism that resolves the contradiction. The model first computes attention scores using query (Q) and key (K) vectors, scales them, applies softmax, and then forms context vectors by weighting value (V) vectors. Masking intervenes right after the attention-score matrix is formed: positions corresponding to future tokens are set to negative infinity (via a mask). After softmax, those positions contribute zero probability, so the current token’s context is computed only from allowed earlier tokens. The result is “best of both worlds”: training can run non-autoregressively in the sense of parallel computation across positions, while still enforcing the autoregressive constraint needed for correct inference behavior.
Overall, the transcript’s takeaway is that masked self-attention makes parallel training possible without letting the decoder cheat by attending to future words, preserving the causal structure required for generation while keeping training efficient.
Cornell Notes
The Transformer decoder must generate tokens causally: at inference, each next token can depend only on earlier tokens, making decoding autoregressive. During training, teacher forcing removes the need to wait for the model’s own previous predictions, enabling parallel computation and speeding up learning. However, parallel attention would let a token “peek” at future tokens, creating a mismatch with inference and risking data leakage. Masked self-attention fixes this by applying a mask to the attention-score matrix so future positions get −∞ before softmax, yielding zero attention weight. The decoder then trains in parallel while still using only past context, matching inference-time behavior.
Why is a Transformer decoder autoregressive at inference time?
What makes training faster than inference in this setup?
What problem arises if training attention is computed over all tokens without restrictions?
How does masked self-attention prevent the model from using future tokens?
Where in the self-attention pipeline does masking happen?
Why does masking allow “parallel training” without breaking autoregressive behavior?
Review Questions
- In what exact sense is the decoder “non-autoregressive” during training, and how does teacher forcing enable that?
- Explain how masking changes the attention-score matrix and what effect that has after softmax.
- Why would computing attention over all tokens during training lead to a mismatch with inference-time generation?
Key Points
- 1
The decoder must follow causal generation: inference-time predictions depend only on earlier tokens, which makes decoding autoregressive.
- 2
Teacher forcing during training supplies ground-truth previous tokens, removing the need to wait for model-generated outputs and enabling parallel computation.
- 3
Unrestricted parallel attention would let a token attend to future tokens, creating a mismatch with inference and risking data leakage.
- 4
Masked self-attention enforces causality by setting future attention-score positions to −∞ before softmax, producing zero attention weights for those positions.
- 5
Self-attention computes context via Q·K^T (attention scores), scaling, softmax, then a weighted sum of V; masking intervenes between score computation and softmax.
- 6
The combination of teacher forcing (for speed) and masking (for correctness) yields fast training while preserving inference-time behavior.