Attention Mechanism in 1 video | Seq2Seq Networks

TL;DR

Vanilla Seq2Seq compresses the entire source sentence into one fixed vector, which becomes a bottleneck for longer inputs.

Briefing Cornell Notes

Briefing

Attention-based encoder–decoder models fix two core weaknesses of the classic LSTM Seq2Seq setup: they stop forcing a single, static sentence summary to carry everything, and they let the decoder dynamically “look up” the most relevant parts of the source while generating each target word. The practical result is more stable translation quality as sentence length grows, along with interpretable attention weights that reveal which source words matter for each generated output.

In the baseline encoder–decoder architecture, the encoder reads the entire input sentence step by step and compresses it into one fixed vector (a summary or representation). The decoder then generates the output sentence step by step using that same static vector. This design breaks down for longer sentences. A quick thought experiment—reading a long sentence, closing the eyes, and translating from memory—captures the intuition: humans don’t rely on a single all-purpose memory snapshot for long text. Instead, they focus on a moving window of context. In the Seq2Seq model, the encoder’s fixed summary becomes overloaded when inputs exceed roughly 25 words, because the decoder must infer too much from too little.

The second weakness appears on the decoder side. At each decoding time step, the model really needs only certain source tokens (or nearby context) to produce the next target word. For example, generating “light” should depend on the source region that contains the relevant concept, not the entire sentence. Yet the vanilla architecture feeds the same static representation at every time step, so the decoder can’t selectively emphasize the right encoder states. The transcript frames this as a mismatch: static representations are convenient, but translation decisions are inherently time-dependent.

The attention mechanism introduces a dynamic alternative. Rather than passing one fixed vector, the decoder receives a context vector c_i at each time step i. That context vector is computed as a weighted combination of encoder hidden states h_j (one per encoder time step). The weights—called attention weights or alignment scores (α)—are different for each decoder time step, effectively telling the model which encoder positions to prioritize when producing the next word.

Concretely, the transcript defines encoder hidden states h1…h4 and decoder states s1…s4, along with the decoder input y1…y3. At each decoder step i, attention produces α_i1…α_i4, then forms c_i = Σ_j α_ij h_j. Those α values are computed by an alignment function that takes the encoder hidden state h_j and the decoder’s previous hidden state s_{i−1} (denoted as s_prev or similar in the explanation). Rather than hand-designing this alignment function, the approach uses a small feed-forward neural network to approximate it, leveraging the universal function approximation idea and training via backpropagation through time.

Empirical motivation is provided via reported BLEU-score behavior: most models’ BLEU drops sharply once sentence length exceeds about 30 words, while the attention-based model stays comparatively stable. Another reported experiment plots attention weights (α) as a grid for English-to-French translation, showing that specific source words—like agreement-related tokens—receive the highest attention when generating corresponding target words. Finally, the transcript notes that the original paper used bidirectional LSTMs in the encoder, improving context by incorporating both past and future information, while leaving the attention alignment mechanism structurally the same.

Cornell Notes

Classic LSTM Seq2Seq compresses the entire source sentence into one fixed vector, then uses that same representation at every decoding step. That static bottleneck makes translation harder for long sentences and prevents the decoder from focusing on the specific source tokens needed for each output word. Attention replaces the single summary with a time-varying context vector c_i computed as a weighted sum of encoder hidden states h_j, where weights α_ij are alignment scores. Those weights are produced by a learned neural network using the decoder’s previous state and each encoder state, then trained end-to-end with backpropagation through time. The payoff is more stable BLEU scores for longer sentences and interpretable attention maps showing which source words drive each generated target word.

Why does the vanilla encoder–decoder struggle as input sentences get longer?

The encoder produces one fixed summary vector for the entire input. When the source grows (the transcript cites beyond ~25 words), that single vector must carry too much information, so the decoder has to reconstruct details from an overloaded representation—similar to trying to translate after reading a long sentence and then relying on a single memory snapshot.

What is the decoder-side problem that attention targets?

At each decoding time step, the model ideally needs only a subset of source information (specific tokens or local context). Vanilla Seq2Seq feeds the same static encoder summary at every step, so the decoder cannot selectively emphasize the relevant encoder positions when generating words like “light” or “off.”

How is the attention context vector c_i computed at decoder time step i?

The transcript defines c_i as a weighted sum of encoder hidden states: c_i = Σ_j α_ij h_j. The α_ij values are attention weights (alignment/similarity scores) that indicate how useful each encoder state h_j is for producing the next decoder output at step i.

What determines the attention weights α_ij?

In the explanation, α_ij depends on the encoder hidden state h_j and the decoder’s previous hidden state (denoted as s_prev or s_{i−1}). The alignment score is computed by a learned function—implemented as a small feed-forward neural network—so α_ij becomes a trainable function of (h_j, s_prev).

Why does using a neural network for alignment make sense here?

Instead of manually designing the alignment function, the approach uses the neural network’s capacity to approximate complex functions (framed via universal function approximation). During training, backpropagation through time updates both the LSTM parameters and the alignment network parameters, so the attention mechanism learns weights that improve translation.

What evidence is given that attention improves translation quality and interpretability?

Reportedly, BLEU scores remain more stable for longer sentences (after ~30 words) compared with models whose BLEU drops quickly. Additionally, attention weights α can be plotted as a grid; for English-to-French translation, the highest attention often aligns with linguistically relevant source tokens (e.g., agreement-related words), indicating which source positions drive each target word.

Review Questions

In vanilla Seq2Seq, what information is fixed across all decoder time steps, and why does that become problematic for long sentences?
Write the formula for c_i in terms of encoder states h_j and attention weights α_ij, and explain what α_ij represents.
What inputs feed the alignment function used to compute α_ij, and how does training adjust it?

Key Points

1
Vanilla Seq2Seq compresses the entire source sentence into one fixed vector, which becomes a bottleneck for longer inputs.
2
Attention replaces the static summary with a time-varying context vector c_i computed from encoder hidden states.
3
At each decoder step i, attention assigns weights α_ij to encoder states h_j, then forms c_i as a weighted sum.
4
The alignment score α_ij is computed by a learned function (a small neural network) using the decoder’s previous state and each encoder state.
5
End-to-end training updates both the LSTM parameters and the attention/alignment network via backpropagation through time.
6
Reported results show attention-based models maintain more stable BLEU scores for longer sentences and produce interpretable attention heatmaps.
7
Using bidirectional LSTMs in the encoder improves context availability for attention, while the attention mechanism itself stays structurally similar.

Highlights

Attention turns translation from “one summary for everything” into “different source focus at each output step” via c_i = Σ_j α_ij h_j.

The attention weights α_ij are computed from both encoder information (h_j) and the decoder’s progress (previous decoder state), making them time-dependent.

BLEU-score stability after ~30 words is cited as evidence that attention mitigates the long-sentence bottleneck.

Attention weights can be plotted to show which source words (e.g., agreement tokens) most influence specific target words in English-to-French translation.

Topics

Attention Mechanism
Seq2Seq
Encoder Decoder
Alignment Scores
Context Vector

Mentioned

Nitesh
Seq2Seq
LSTM
BLEU
NN

Attention Mechanism in 1 video | Seq2Seq Networks | Encoder Decoder Architecture