Transformer Architecture | Part 1 Encoder Architecture | CampusX
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
A transformer encoder is a stack of six identical encoder blocks; understanding one block explains the rest because the architecture repeats.
Briefing
Transformer encoder architecture is built from a repeating pattern: each encoder block takes token embeddings (augmented with positional information), runs multi-head self-attention, then passes the result through a two-layer feed-forward network, with residual connections and layer normalization wrapping both sub-steps. The practical payoff is that the model can convert a sequence like “How are you” into context-aware representations—while keeping tensor shapes consistent (512-dimensional vectors per token) so the same block can be stacked six times.
The walkthrough starts by reframing the famous transformer diagram into something easier to reason about: a transformer is split into an encoder side and a decoder side, and each side contains multiple identical blocks. In the original “Attention Is All You Need” setup, the encoder uses six encoder blocks and the decoder uses six decoder blocks. Because the blocks are architecturally identical, understanding one encoder block unlocks the rest; the only difference across blocks is that each has its own trainable parameters that get updated during backpropagation.
Before any encoder block runs, the input sentence goes through an “input block” with three steps. First comes tokenization: the sentence is split into tokens (the example assumes word-level tokenization, producing tokens like “How”, “are”, “you”). Second is embedding: each token becomes a 512-dimensional vector via an embedding layer, since models operate on numbers rather than raw text. Third is positional encoding: because embeddings alone don’t tell the model which word comes first or last, positional vectors are generated for each token position (also 512-dimensional) and added to the token embeddings. After this, the encoder receives a sequence of 512-dimensional vectors.
Inside an encoder block, multi-head self-attention is applied first. The key motivation is contextualization: the same word embedding can represent different meanings depending on surrounding words (e.g., “bank” in different contexts). Self-attention lets each token’s representation change based on other tokens in the sequence, producing context-aware outputs (denoted as z1, z2, z3 for the example tokens). Multi-head attention repeats this idea across multiple attention “views,” yielding richer, more diverse contextual signals.
Next, residual connections and layer normalization stabilize the computation. The block adds the attention output back to the original input vectors (a skip path that bypasses the attention transformation) and then normalizes each token vector using layer norm (mean/standard deviation over the 512 dimensions, plus learned parameters gamma and beta). This normalization is presented as a training-stability mechanism: without it, attention outputs can drift into uncontrolled numeric ranges.
Then the feed-forward network runs. It’s a two-layer MLP applied position-wise: the first layer expands from 512 to 2048 units with ReLU nonlinearity, and the second layer projects back from 2048 to 512 with a linear activation. The transcript emphasizes the shape flow: after the first layer, token representations become 3×2048 (for three tokens), then return to 3×512 after the second layer. The residual connection and layer normalization wrap this feed-forward step as well.
Finally, the encoder block output (still 3×512 in the example) becomes the input to the next encoder block. This repeats six times, producing the encoder’s final token representations that are then handed off to the decoder. The transcript closes by addressing common “why” questions: residual connections are linked to stable training and preserving useful features; feed-forward layers are tied to introducing nonlinearity and representational capacity; and stacking multiple encoder blocks is justified by the need for stronger representation power to capture human language patterns.
Cornell Notes
The encoder side of a transformer turns tokenized text into context-aware 512-dimensional vectors using a stack of six identical encoder blocks. Each encoder block applies multi-head self-attention to let every token incorporate information from other tokens in the sequence, then uses a two-layer feed-forward network (512→2048 with ReLU, then 2048→512 with linear) to add nonlinearity and richer transformations. Residual connections and layer normalization wrap both the attention and feed-forward sub-steps to stabilize training and preserve useful features. Before entering the first block, the input is tokenized, embedded into 512D vectors, and augmented with positional encodings so the model knows word order. The output remains the same shape per token, enabling straightforward stacking across blocks.
Why does the encoder need positional encoding if tokens are already embedded into vectors?
What problem does self-attention solve that plain embeddings can’t?
What exactly happens around multi-head attention with residual connections and layer normalization?
Why does the feed-forward network expand to 2048 dimensions and then shrink back to 512?
How does stacking six encoder blocks work without changing the overall tensor shape per token?
What are the transcript’s main “why” explanations for residual connections and feed-forward layers?
Review Questions
- In the encoder input pipeline, what are the three operations performed before multi-head self-attention, and what role does each one play?
- Trace the dimensionality changes through the feed-forward network (512→2048→512). Why is the dimensionality preserved at the end of the block?
- Explain how residual connections and layer normalization interact with the attention output. What stability problem are they meant to address?
Key Points
- 1
A transformer encoder is a stack of six identical encoder blocks; understanding one block explains the rest because the architecture repeats.
- 2
Tokenization, embedding (512D), and positional encoding (added to embeddings) prepare the input so the model knows both token identity and word order.
- 3
Multi-head self-attention makes token representations context-aware by letting each token incorporate information from other tokens in the sequence.
- 4
Residual connections add the sub-layer input back to the sub-layer output, and layer normalization stabilizes training by keeping values in a controlled range.
- 5
Each encoder block applies a two-layer feed-forward network position-wise: 512→2048 with ReLU, then 2048→512 with linear activation.
- 6
The encoder block output keeps the same per-token dimensionality (512), enabling seamless stacking across multiple blocks.
- 7
Each encoder block has its own trainable parameters even though the block structure is copied across the stack.