Understanding Transformer Architecture of LLM: Attention Is All You Need

TL;DR

Transformers replace sequential recurrence with self-attention, enabling parallel processing across tokens and faster training.

Briefing Cornell Notes

Briefing

Transformer architecture became a turning point for language modeling because it replaces sequential processing with self-attention, enabling parallel computation while still capturing word-to-word relationships across an entire sentence. That shift matters because older sequence models such as recurrent neural networks, LSTMs, and GRUs process tokens step by step, which slows training and makes long-range dependencies harder to learn efficiently at scale. By contrast, Transformers can be trained faster on large datasets and handle complex language patterns without the same bottleneck of strict temporal recurrence.

The core idea behind the 2017 paper “Attention Is All You Need” (Ashish Vaswani and collaborators at Google Brain, Google Research, and the University of Toronto) is a self-attention mechanism that lets the model weigh how much each token should attend to every other token. The architecture is organized into two main stacks: an encoder and a decoder. In a machine translation example, the encoder reads the source sentence (e.g., English) and the decoder generates the target sentence (e.g., Spanish) one token at a time. Both stacks rely on embeddings to convert tokens into numeric vectors, but they also need positional information so the model can distinguish “Sunset over the ocean” from “Ocean over the Sun.” Positional encoding injects word-order signals into the input embeddings, allowing attention to incorporate both meaning and sequence.

Inside the encoder, each identical layer contains two sublayers: multi-head self-attention and a position-wise feed-forward network. Multi-head attention runs several attention “heads” in parallel, where each head can focus on different relationships—for instance, one head might connect “Sunset” with “ocean,” while another emphasizes how “over” links the two. Mathematically, attention uses query, key, and value vectors (Q, K, V) to compute attention scores, applies a softmax to turn those scores into weights, and then combines values accordingly. The outputs from all heads are merged, then added back to the original representation (a residual connection) and normalized to stabilize training.

The position-wise feed-forward network further transforms each token representation independently at every position using two linear layers with a ReLU activation in between. This design lets the model refine what attention learned, while keeping computation efficient because the same feed-forward structure is applied across positions.

The decoder mirrors the encoder but adds a third sublayer: multi-head attention over the encoder’s output. During generation, the decoder uses self-attention to consider previously generated tokens and encoder-decoder attention to align the emerging translation with relevant parts of the source sentence. The final stages map decoder outputs to vocabulary-sized scores via a linear layer, convert them into probabilities with softmax, and select the most likely next token—continuing until an end-of-sequence token appears.

Transformers have since become practical tools for machine translation, text summarization, and tasks like identifying names and places in text, largely because their attention-based design improves parallelization and scalability compared with recurrent approaches.

Cornell Notes

Transformers improved language modeling by using self-attention instead of step-by-step recurrence. That design enables parallel processing across tokens while still learning relationships between distant words. The architecture splits into an encoder and a decoder: the encoder reads the input sequence, and the decoder generates the output sequence token by token. Embeddings convert tokens to vectors, while positional encoding preserves word order so the model can tell “Sunset over the ocean” from “Ocean over the Sun.” Multi-head attention computes weighted interactions using query, key, and value vectors, and position-wise feed-forward networks refine each token representation before the decoder produces vocabulary probabilities via softmax.

Why did recurrent models (RNN, LSTM, GRU) struggle compared with Transformers for large-scale language tasks?

Recurrent models process tokens sequentially, so each step depends on the previous one. That limits parallelization and increases training time, especially on large datasets. They also face difficulty learning long-range dependencies efficiently, and their scalability can lag when workloads demand faster processing over very large corpora—areas where Transformers are built to perform better.

How do embeddings and positional encoding work together in a Transformer?

Input embeddings map each token (e.g., words from a vocabulary) into a fixed-size vector so the model can operate on numbers. But embeddings alone don’t encode order, so positional encoding adds position-related information to each token vector. This lets the model distinguish sentences where the same words appear in different orders, such as “Sunset over the ocean” versus “Ocean over the Sun.”

What does multi-head self-attention actually compute, and why use multiple heads?

Self-attention uses query (Q), key (K), and value (V) vectors derived from token representations. It computes attention scores between tokens, applies softmax to convert scores into weights, and uses those weights to combine values into contextual representations. Multiple heads run in parallel so different heads can capture different relationship types—one head might link “Sunset” with “ocean,” while another emphasizes how “over” connects them.

What role does the position-wise feed-forward network play after attention?

After attention produces contextual token representations, a position-wise feed-forward network transforms each token independently at every position. It typically uses two linear layers with a ReLU activation in between: the first linear layer projects to a new space, ReLU keeps positive values and zeros negatives, and the second linear layer maps back to the model’s dimension. This refines the representation learned by attention.

How does the decoder generate translations step by step while still using attention?

The decoder generates output tokens one at a time. It uses self-attention over previously generated tokens to maintain context within the partial translation. Then it uses multi-head attention over the encoder’s output to align each generated token with relevant parts of the source sentence. The final linear layer maps decoder outputs to vocabulary-sized scores, and softmax turns those scores into probabilities for the next token.

How do the final layers turn decoder outputs into an actual word choice?

A linear layer projects decoder outputs into a vector whose size matches the target vocabulary. Softmax converts those scores into a probability distribution over possible next tokens. The token with the highest probability is selected, and generation continues until an end-of-sequence token is produced.

Review Questions

What specific mechanism allows Transformers to capture relationships between all tokens without sequential recurrence?
Describe the difference between encoder self-attention and decoder encoder-decoder attention in terms of what each attends to.
Why is positional encoding necessary even though tokens are already embedded into vectors?

Key Points

1
Transformers replace sequential recurrence with self-attention, enabling parallel processing across tokens and faster training.
2
The encoder-decoder design supports tasks like machine translation by encoding the source sequence and generating the target sequence token by token.
3
Token embeddings convert words into fixed-size vectors, but positional encoding is required to preserve word order information.
4
Multi-head attention computes weighted token interactions using query, key, and value vectors, with softmax turning scores into attention weights.
5
Residual connections and normalization help stabilize learning after attention outputs are combined.
6
Position-wise feed-forward networks refine each token representation using two linear layers with ReLU in between.
7
Final vocabulary probabilities come from a linear projection followed by softmax, with generation continuing until an end-of-sequence token.

Highlights

Self-attention lets every token directly consider every other token, avoiding the sequential dependency that slows RNN-style models.

Positional encoding is what makes “Sunset over the ocean” meaningfully different from “Ocean over the Sun,” even when the same words appear.

Multi-head attention runs several attention patterns in parallel, so different heads can learn different relationship types between words.

The decoder uses two attention steps—self-attention over prior outputs and attention over the encoder’s representations—to align translation choices with the source.

Topics

Mentioned

Ashish Vaswani
RNN
LSTM
GRU
LLM
Q
K
V
ReLU
softmax