Transformer Architecture | Part 1 Encoder Architecture

TL;DR

A transformer encoder is a stack of six identical encoder blocks; understanding one block explains the rest because the architecture repeats.

Briefing Cornell Notes

Briefing

Transformer encoder architecture is built from a repeating pattern: each encoder block takes token embeddings (augmented with positional information), runs multi-head self-attention, then passes the result through a two-layer feed-forward network, with residual connections and layer normalization wrapping both sub-steps. The practical payoff is that the model can convert a sequence like “How are you” into context-aware representations—while keeping tensor shapes consistent (512-dimensional vectors per token) so the same block can be stacked six times.

The walkthrough starts by reframing the famous transformer diagram into something easier to reason about: a transformer is split into an encoder side and a decoder side, and each side contains multiple identical blocks. In the original “Attention Is All You Need” setup, the encoder uses six encoder blocks and the decoder uses six decoder blocks. Because the blocks are architecturally identical, understanding one encoder block unlocks the rest; the only difference across blocks is that each has its own trainable parameters that get updated during backpropagation.

Before any encoder block runs, the input sentence goes through an “input block” with three steps. First comes tokenization: the sentence is split into tokens (the example assumes word-level tokenization, producing tokens like “How”, “are”, “you”). Second is embedding: each token becomes a 512-dimensional vector via an embedding layer, since models operate on numbers rather than raw text. Third is positional encoding: because embeddings alone don’t tell the model which word comes first or last, positional vectors are generated for each token position (also 512-dimensional) and added to the token embeddings. After this, the encoder receives a sequence of 512-dimensional vectors.

Inside an encoder block, multi-head self-attention is applied first. The key motivation is contextualization: the same word embedding can represent different meanings depending on surrounding words (e.g., “bank” in different contexts). Self-attention lets each token’s representation change based on other tokens in the sequence, producing context-aware outputs (denoted as z1, z2, z3 for the example tokens). Multi-head attention repeats this idea across multiple attention “views,” yielding richer, more diverse contextual signals.

Next, residual connections and layer normalization stabilize the computation. The block adds the attention output back to the original input vectors (a skip path that bypasses the attention transformation) and then normalizes each token vector using layer norm (mean/standard deviation over the 512 dimensions, plus learned parameters gamma and beta). This normalization is presented as a training-stability mechanism: without it, attention outputs can drift into uncontrolled numeric ranges.

Then the feed-forward network runs. It’s a two-layer MLP applied position-wise: the first layer expands from 512 to 2048 units with ReLU nonlinearity, and the second layer projects back from 2048 to 512 with a linear activation. The transcript emphasizes the shape flow: after the first layer, token representations become 3×2048 (for three tokens), then return to 3×512 after the second layer. The residual connection and layer normalization wrap this feed-forward step as well.

Finally, the encoder block output (still 3×512 in the example) becomes the input to the next encoder block. This repeats six times, producing the encoder’s final token representations that are then handed off to the decoder. The transcript closes by addressing common “why” questions: residual connections are linked to stable training and preserving useful features; feed-forward layers are tied to introducing nonlinearity and representational capacity; and stacking multiple encoder blocks is justified by the need for stronger representation power to capture human language patterns.

Cornell Notes

The encoder side of a transformer turns tokenized text into context-aware 512-dimensional vectors using a stack of six identical encoder blocks. Each encoder block applies multi-head self-attention to let every token incorporate information from other tokens in the sequence, then uses a two-layer feed-forward network (512→2048 with ReLU, then 2048→512 with linear) to add nonlinearity and richer transformations. Residual connections and layer normalization wrap both the attention and feed-forward sub-steps to stabilize training and preserve useful features. Before entering the first block, the input is tokenized, embedded into 512D vectors, and augmented with positional encodings so the model knows word order. The output remains the same shape per token, enabling straightforward stacking across blocks.

Why does the encoder need positional encoding if tokens are already embedded into vectors?

Embeddings convert each token into a 512-dimensional vector, but they don’t encode order. Without positional information, the model can’t distinguish whether “How” came before “you” or vice versa. Positional encoding generates a 512-dimensional vector for each position (e.g., position 1, 2, 3) and adds it to the corresponding token embedding. After addition, each token representation (x1, x2, x3 in the example) carries both token identity and position.

What problem does self-attention solve that plain embeddings can’t?

Embeddings alone treat a token as a fixed representation, even though meaning depends on context. The transcript uses “bank” as an example: “bank” in one sentence context differs from “bank” in another. Self-attention updates each token’s representation by attending to other tokens, so the “bank” vector changes based on surrounding words. Multi-head attention extends this by running multiple attention views in parallel, producing more diverse contextual embeddings.

What exactly happens around multi-head attention with residual connections and layer normalization?

After multi-head attention produces outputs z1, z2, z3 (each still 512-dimensional), the block adds these to the original inputs x1, x2, x3 via a residual/skip path. This yields updated vectors (z1’ , z2’ , z3’). Layer normalization then normalizes each token vector across its 512 dimensions using mean and standard deviation, plus learned gamma and beta parameters. The transcript frames this as stabilizing training because attention outputs can otherwise drift into wide numeric ranges.

Why does the feed-forward network expand to 2048 dimensions and then shrink back to 512?

The feed-forward network is a two-layer MLP applied position-wise. First, it projects 512→2048 and applies ReLU to introduce nonlinearity. Then it projects 2048→512 with a linear activation, returning to the same dimensionality required by the next sub-layer and the next encoder block. The transcript highlights that the shape grows to 3×2048 for three tokens, then returns to 3×512, preserving the interface between blocks while increasing representational capacity through nonlinearity.

How does stacking six encoder blocks work without changing the overall tensor shape per token?

Each encoder block outputs token vectors that remain 512-dimensional. So the output of the first block (still 3×512 in the example) becomes the input to the second block, and so on. The transcript stresses that while the block architecture is identical, each block has its own parameters (weights and biases) that differ across blocks during training. After six such blocks, the encoder produces final token representations for the decoder.

What are the transcript’s main “why” explanations for residual connections and feed-forward layers?

Residual connections are linked to stable training in deep networks, helping gradients flow and preventing useful features from being overwritten when transformations underperform. The feed-forward network is described as a way to add nonlinearity (since attention operations are effectively linear in parts of the computation) and to increase representational power. The transcript also mentions an additional perspective from a paper: feed-forward layers may function like value-memory components correlated with textual patterns.

Review Questions

In the encoder input pipeline, what are the three operations performed before multi-head self-attention, and what role does each one play?
Trace the dimensionality changes through the feed-forward network (512→2048→512). Why is the dimensionality preserved at the end of the block?
Explain how residual connections and layer normalization interact with the attention output. What stability problem are they meant to address?

Key Points

1
A transformer encoder is a stack of six identical encoder blocks; understanding one block explains the rest because the architecture repeats.
2
Tokenization, embedding (512D), and positional encoding (added to embeddings) prepare the input so the model knows both token identity and word order.
3
Multi-head self-attention makes token representations context-aware by letting each token incorporate information from other tokens in the sequence.
4
Residual connections add the sub-layer input back to the sub-layer output, and layer normalization stabilizes training by keeping values in a controlled range.
5
Each encoder block applies a two-layer feed-forward network position-wise: 512→2048 with ReLU, then 2048→512 with linear activation.
6
The encoder block output keeps the same per-token dimensionality (512), enabling seamless stacking across multiple blocks.
7
Each encoder block has its own trainable parameters even though the block structure is copied across the stack.

Highlights

The encoder’s core loop is: multi-head self-attention → residual + layer norm → feed-forward (512→2048→512) → residual + layer norm, repeated six times.

Positional encoding is added to token embeddings because embeddings alone don’t encode which word appears first or last.

Residual connections are framed as a training-stability mechanism and a safeguard that preserves original features when transformations don’t help.

The feed-forward network’s expansion to 2048 is mainly about adding nonlinearity and representational capacity before projecting back to 512.

Topics

Transformer Encoder
Multi-Head Self-Attention
Positional Encoding
Residual Connections
Feed-Forward Network
Layer Normalization

Mentioned

Nitesh

Transformer Architecture | Part 1 Encoder Architecture | CampusX