Positional Encoding in Transformers | Deep Learning

TL;DR

Self-attention builds contextual embeddings in parallel but is order-blind unless positional information is injected.

Briefing Cornell Notes

Briefing

Transformers need positional information because self-attention treats tokens as a set—great for parallel context building, but blind to word order. The core problem is that “river bank” and “bank river” can end up with the same attention inputs, even though they mean different things. Positional encoding fixes this by injecting a representation of each token’s position into the model so attention can distinguish not just *what* words are present, but *where* they occur.

The transcript starts by unpacking why self-attention is powerful: it generates context-aware embeddings dynamically, and it can compute interactions for all tokens in parallel. That parallelism is a major speed advantage over inherently sequential architectures like RNNs, especially for long documents. But the same parallel design creates the order problem: since all tokens are fed together, self-attention has no built-in mechanism to know whether “Nithish” came before “Kild” or vice versa. Without an order signal, the model would struggle with real NLP tasks where syntax and semantics depend heavily on position.

A naive fix—appending the token index as an extra number—fails for multiple reasons. First, the index grows unbounded with sequence length, which can destabilize training because neural networks often become numerically unstable with very large values (leading to issues like exploding/vanishing gradients). Second, normalization doesn’t solve the deeper issue: the same “position” would map to different values across different sequence lengths, making the encoding inconsistent across training samples. Third, using discrete indices harms learning because neural networks generally prefer smooth, continuous patterns. Finally, simple indexing captures absolute position but not relative distance well, since the model can’t infer how far apart two tokens are when the encoding is effectively a discrete lookup.

The transcript then motivates the standard Transformer solution: use sinusoidal positional encodings. By mapping each position through sine and cosine functions, the encoding stays bounded (values remain between -1 and 1), changes smoothly, and can represent relative offsets. A single sine wave still risks collisions because sine is periodic—different positions can produce the same value. The fix is to use multiple frequencies: pair sine and cosine at different “wavelengths,” turning each token’s position into a vector rather than a single scalar. Higher-frequency components vary quickly (helping distinguish nearby positions), while lower-frequency components vary slowly (supporting longer-range distinctions). This multi-frequency design reduces the chance that two different positions share the same encoding.

Finally, the transcript explains how positional encodings integrate with token embeddings in practice. For each token, a positional encoding vector is computed with the same dimensionality as the token embedding (denoted as d_model). Instead of concatenating (which would increase dimensionality and effectively double parameter/training cost), the approach adds the positional vector to the token embedding element-wise. The result is a combined representation that carries both word meaning and position information into the self-attention block.

The closing sections emphasize that the sinusoidal scheme isn’t just a heuristic: with the right mathematical structure, it supports linear relationships that allow the model to recover relative positional information. The overall takeaway is that positional encoding is the bridge between self-attention’s parallel set-based processing and the order-sensitive nature of language.

Cornell Notes

Self-attention computes context-aware embeddings in parallel, but it doesn’t inherently know token order. That makes “word A then word B” indistinguishable from “word B then word A” unless positional information is added. Simple index-based encodings break down due to unbounded growth, inconsistent normalization across sequence lengths, discrete jumps that are hard to learn, and weak relative-distance modeling. Sinusoidal positional encodings solve these issues by mapping each position to a bounded, smooth vector using sine and cosine functions at multiple frequencies. The positional vector has the same dimensionality as the token embedding (d_model) and is added (not concatenated) to the embedding so the Transformer can use both meaning and position during attention.

Why does self-attention need positional encoding even though it already builds contextual embeddings?

Self-attention can generate dynamic, context-aware embeddings because each token’s representation is computed using interactions with other tokens. But the computation is permutation-invariant with respect to token order: tokens are processed together, so the mechanism lacks any built-in way to tell which token appeared first. The transcript illustrates this with the idea that two sentences containing the same words in different orders (e.g., “Nithish killed the lion” vs. “lion killed Nithish”) can produce the same attention inputs if no positional signal is provided. Positional encoding injects “where” information so attention can distinguish order-dependent meanings.

What goes wrong with the simplest positional encoding idea: appending the token index to the embedding?

The transcript lists several failure modes. (1) Unbounded values: token indices grow with sequence length (e.g., a book could push indices to 100,000), and large numeric ranges can destabilize training via exploding/vanishing gradients. (2) Normalization inconsistency: dividing by sequence length forces values into [0,1], but then the same position index maps to different values across different sequence lengths, confusing the model. (3) Discreteness: indices are discrete jumps, while neural networks typically learn better from smooth, continuous changes. (4) Weak relative-distance capture: absolute indices don’t naturally provide a smooth way to infer how far apart two tokens are.

How do sine and cosine positional encodings address stability and learning dynamics?

Using sine/cosine keeps positional values bounded between -1 and 1, avoiding the unbounded-growth problem. The functions change smoothly with position, which aligns better with gradient-based learning. The transcript also notes that periodicity alone can cause collisions (different positions can share the same sine value), so the method must be extended with multiple frequencies to reduce repeated encodings.

Why use multiple frequencies (and sine+cosine pairs) instead of a single sine wave?

A single sine wave is periodic, so different positions can map to identical values, creating collisions that would make distinct positions look the same to the model. The transcript’s fix is to represent each position as a vector built from sine and cosine at several frequencies. Sine and cosine together provide complementary phase information, and using progressively different frequencies reduces the probability that two different positions produce the same full vector. Higher-frequency components distinguish nearby positions; lower-frequency components help with longer-range structure.

How are positional encodings combined with token embeddings in Transformers?

For each token, the positional encoding vector is computed with the same dimensionality as the token embedding (d_model). The transcript emphasizes element-wise addition: embedding + positional_encoding. Concatenation is avoided because it would increase dimensionality (e.g., doubling to 2*d_model), which would increase parameter count and slow training. After addition, the combined vector is fed into the self-attention layers so attention has both semantic and positional signals.

What does the transcript claim about relative position information?

Absolute indexing alone doesn’t give a clean way to infer relative distance. Sinusoidal encodings are presented as enabling relative-position reasoning because the mathematical structure of sine/cosine allows relationships between positions to be expressed through linear transformations. The transcript connects this to the idea that with the right linear mapping, moving by a fixed delta in position corresponds to a predictable change in the encoding space, letting the model recover relative offsets.

Review Questions

What specific limitations of index-based positional encodings motivate switching to sinusoidal functions?
How do multiple sine/cosine frequencies reduce collisions between different positions?
Why does the transcript prefer adding positional encodings to embeddings instead of concatenating them?

Key Points

1
Self-attention builds contextual embeddings in parallel but is order-blind unless positional information is injected.
2
Index-based positional encoding fails due to unbounded growth, inconsistent normalization across sequence lengths, and discrete jumps that don’t support smooth learning.
3
Sinusoidal positional encodings keep values bounded and smooth, improving numerical stability and gradient flow.
4
Periodic sine alone can collide across positions; using sine+cosine at multiple frequencies turns position into a vector that greatly reduces repeats.
5
Positional encoding vectors match the embedding dimensionality (d_model) and are added element-wise to token embeddings to avoid increasing model size.
6
The multi-frequency sinusoidal design supports relative-position reasoning, not just absolute position tagging.

Highlights

Self-attention’s parallelism is a double-edged sword: it accelerates computation but removes any built-in sense of token order.

Normalization of raw indices doesn’t fix the core issue because the same position can map to different values across different sequence lengths.

Using sine/cosine makes positional values bounded and smooth, but periodicity requires multiple frequencies to avoid collisions.

Adding positional encodings to embeddings preserves dimensionality, while concatenation would inflate parameters and slow training.

Multi-frequency sinusoidal encodings are designed so the model can infer relative offsets, not only absolute indices.

Topics

Positional Encoding
Transformers
Self-Attention
Sinusoidal Functions
Relative Positioning

Mentioned

Nithish
NLP
RNN
PDF
d_model

Positional Encoding in Transformers | Deep Learning | CampusX