Masked Self Attention | Masked Multi-head Attention in Transformer

TL;DR

The decoder must follow causal generation: inference-time predictions depend only on earlier tokens, which makes decoding autoregressive.

Briefing Cornell Notes

Briefing

Transformer decoders generate text one token at a time during inference, but they can be trained in parallel during training—thanks to masked self-attention. The key insight is that the decoder must behave autoregressively when producing outputs (so each next token depends only on earlier tokens), yet training should avoid the slow sequential dependency that would otherwise make learning impractically expensive.

The discussion starts with a central sentence: a Transformer decoder is autoregressive at inference time and non-autoregressive at training time. “Inference” is framed as prediction: given an input sequence, the model produces the next token, then uses that newly produced token as context for the next step. “Autoregressive” is illustrated with a time-series analogy: to predict Friday’s stock value, the model conditions on Wednesday and Thursday predictions. In language tasks, this becomes the rule that later words depend on earlier generated words, so the model cannot generate an entire sentence in one shot without violating the natural dependency structure.

The tension appears when training is considered. If training also forced the decoder to feed on its own previous predictions, it would be slow because every training example would require stepping through many token positions sequentially. The transcript argues that teacher forcing avoids this slowdown: during training, the decoder receives the ground-truth previous tokens from the dataset, so the model does not need to wait for its own earlier outputs. That means the computations for different time steps can be parallelized, dramatically speeding training.

But parallelization introduces a new risk: if the model computes attention over all tokens at once, a token could “see” future tokens. That would create an unfair advantage—effectively a form of data leakage—because training would use information that won’t exist at inference time. The transcript frames this as a mismatch: parallel training would allow the current token’s representation to depend on future tokens, while inference must not.

Masked self-attention is presented as the mechanism that resolves the contradiction. The model first computes attention scores using query (Q) and key (K) vectors, scales them, applies softmax, and then forms context vectors by weighting value (V) vectors. Masking intervenes right after the attention-score matrix is formed: positions corresponding to future tokens are set to negative infinity (via a mask). After softmax, those positions contribute zero probability, so the current token’s context is computed only from allowed earlier tokens. The result is “best of both worlds”: training can run non-autoregressively in the sense of parallel computation across positions, while still enforcing the autoregressive constraint needed for correct inference behavior.

Overall, the transcript’s takeaway is that masked self-attention makes parallel training possible without letting the decoder cheat by attending to future words, preserving the causal structure required for generation while keeping training efficient.

Cornell Notes

The Transformer decoder must generate tokens causally: at inference, each next token can depend only on earlier tokens, making decoding autoregressive. During training, teacher forcing removes the need to wait for the model’s own previous predictions, enabling parallel computation and speeding up learning. However, parallel attention would let a token “peek” at future tokens, creating a mismatch with inference and risking data leakage. Masked self-attention fixes this by applying a mask to the attention-score matrix so future positions get −∞ before softmax, yielding zero attention weight. The decoder then trains in parallel while still using only past context, matching inference-time behavior.

Why is a Transformer decoder autoregressive at inference time?

Inference produces tokens sequentially. After generating one token, the decoder uses that token as part of the context for the next prediction. The transcript emphasizes that later outputs depend on earlier generated outputs, so the model cannot generate the entire sequence in one pass without breaking the dependency structure.

What makes training faster than inference in this setup?

Training uses teacher forcing: at each time step, the decoder receives the ground-truth previous tokens from the dataset rather than the model’s own sampled predictions. Because the required inputs are already known from the training data, computations across time steps can be parallelized instead of waiting for sequential generation.

What problem arises if training attention is computed over all tokens without restrictions?

If attention is computed across the full sequence during training, a token at position t could attend to tokens at positions > t (future tokens). The transcript calls this “unfair” because those future tokens will not be available during inference, leading to a mismatch and effectively a form of data leakage.

How does masked self-attention prevent the model from using future tokens?

After computing attention scores from Q·K^T and scaling, the model applies a mask that sets future positions to −∞. When softmax is applied, softmax(−∞) becomes 0, so attention probability mass for future tokens disappears. The context vector at position t is then computed only from allowed earlier tokens’ value vectors.

Where in the self-attention pipeline does masking happen?

Masking is applied to the attention-score matrix (the one produced after the Q and K dot product and scaling, before softmax). This ensures the softmax output assigns zero weight to masked (future) positions, which then prevents those future tokens from contributing to the weighted sum of V.

Why does masking allow “parallel training” without breaking autoregressive behavior?

Parallel training is possible because teacher forcing supplies the correct tokens for all positions during training, so the model can compute attention for many positions at once. Masking enforces causality inside attention itself, ensuring each position’s representation depends only on earlier tokens—so the parallel computation still respects the same causal constraint used at inference.

Review Questions

In what exact sense is the decoder “non-autoregressive” during training, and how does teacher forcing enable that?
Explain how masking changes the attention-score matrix and what effect that has after softmax.
Why would computing attention over all tokens during training lead to a mismatch with inference-time generation?

Key Points

1
The decoder must follow causal generation: inference-time predictions depend only on earlier tokens, which makes decoding autoregressive.
2
Teacher forcing during training supplies ground-truth previous tokens, removing the need to wait for model-generated outputs and enabling parallel computation.
3
Unrestricted parallel attention would let a token attend to future tokens, creating a mismatch with inference and risking data leakage.
4
Masked self-attention enforces causality by setting future attention-score positions to −∞ before softmax, producing zero attention weights for those positions.
5
Self-attention computes context via Q·K^T (attention scores), scaling, softmax, then a weighted sum of V; masking intervenes between score computation and softmax.
6
The combination of teacher forcing (for speed) and masking (for correctness) yields fast training while preserving inference-time behavior.

Highlights

The core tradeoff is speed vs. correctness: parallel training is fast, but without masking it would let tokens “see” future words that inference cannot access.

Masked self-attention fixes the mismatch by forcing softmax to assign zero probability to future positions using −∞ scores.

Inference must be sequential because each next token depends on earlier generated tokens; training can be parallel because teacher forcing provides the needed previous tokens.

The attention-score matrix is the control point: mask it before softmax, and the context vectors automatically respect causality.

Topics

Transformer Decoder
Masked Self Attention
Autoregressive Inference
Teacher Forcing
Causal Attention Masking

Mentioned

Nitesh

Masked Self Attention | Masked Multi-head Attention in Transformer | Transformer Decoder