Transformers, the tech behind LLMs | Deep Learning Chapter 5
Based on 3Blue1Brown's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Transformer text generation starts by tokenizing input and mapping tokens to vectors using an embedding matrix.
Briefing
Transformer-based models—behind systems like ChatGPT—turn text into a stream of vectors, mix information across tokens with attention, and then convert the final hidden representation into a probability distribution over the next possible token. The practical breakthrough is that this “predict-the-next-token” loop can be run repeatedly: feed the model a prompt, sample from its predicted distribution for the next token, append that token, and repeat. What looks like simple next-word prediction becomes coherent multi-sentence generation when the model is large enough and trained well.
At a high level, the process starts by splitting input into tokens. For text, tokens often correspond to words, word fragments, or common character sequences; for multimodal variants, tokens can represent small image patches or audio fragments. Each token is mapped to a vector via an embedding matrix: a vocabulary-sized table where each row/column corresponds to a token and the vector dimension encodes learned semantic relationships. In GPT-3, the embedding dimension is 12,288 and the vocabulary is 50,257 tokens, implying roughly 617 million parameters just for embeddings. These vectors are not merely “word meanings.” They also carry positional and contextual information once the model processes them through layers.
The core mechanism that lets tokens influence one another is attention. The model repeatedly alternates between attention blocks and feed-forward (multi-layer perceptron) blocks. Attention is where vectors “talk” to each other: it updates token representations based on which other tokens in the current context are relevant. A key example is how the word “model” changes meaning depending on whether it appears in “machine learning model” versus “fashion model.” The feed-forward layers then transform each token representation in parallel, applying non-linear computation without direct token-to-token communication.
After enough layers, the model produces a final hidden vector for the context window. That vector is projected through an “unembedding” matrix (W_U) into a set of scores—one per vocabulary token—then normalized into probabilities using softmax. Softmax converts arbitrary real-valued scores into a valid probability distribution by exponentiating each score and dividing by the sum of all exponentiated values. Higher scores become more likely, but the distribution remains smooth enough to support sampling rather than always choosing the single most likely token.
Generation is then a loop. Given a prompt, the model predicts a distribution for the next token, samples from it, appends the sampled token to the prompt, and predicts again. This is why a model can appear to “write” stories: it’s repeatedly sampling from its own next-token predictions. Temperature controls how sharp the distribution is: lower temperature makes the model behave closer to greedy decoding (near-deterministic), while higher temperature increases randomness by flattening probabilities. The transcript notes that some APIs cap temperature (e.g., not above 2) to prevent outputs from degrading into nonsense.
Finally, the transcript grounds these ideas in the training paradigm of deep learning: large models learn weights by backpropagation, using vast datasets rather than hand-coded rules. GPT-3 is highlighted as an example of scale—175 billion parameters—organized into many matrix multiplications. The embedding and unembedding matrices, attention, and feed-forward transformations all reduce to weighted sums and non-linearities, making the transformer architecture a scalable, trainable engine for next-token prediction and, by extension, the fluent text generation seen across modern AI tools.
Cornell Notes
Transformer models generate text by repeatedly predicting the next token from a fixed-size context window. Input text is tokenized, mapped to vectors with an embedding matrix, then processed through stacked attention blocks and feed-forward layers so tokens can update each other’s meaning. The final hidden representation is projected through an unembedding matrix into scores for every vocabulary token, then normalized with softmax into a probability distribution. Sampling from that distribution—optionally adjusted by temperature—produces the next token, which is appended to the prompt and used for the next prediction. This “next-token” loop is what turns simple probability modeling into coherent multi-sentence output.
Why does “predict the next token” become long-form text generation instead of just one-word guesses?
What role do embeddings play beyond representing words?
How does attention make token meaning depend on context?
What exactly does softmax do in the final step of token prediction?
How does temperature change the model’s behavior during sampling?
Why is the context window size (e.g., 2048 in GPT-3) important?
Review Questions
- How do embedding vectors and attention layers work together to make token meaning context-dependent?
- Explain the full pipeline from tokenization to softmax probabilities to sampling, including where temperature fits in.
- What does the context window limit imply for long conversations, and how does it affect next-token predictions?
Key Points
- 1
Transformer text generation starts by tokenizing input and mapping tokens to vectors using an embedding matrix.
- 2
Attention layers let token representations update based on which other tokens are relevant in the current context window.
- 3
Feed-forward layers transform token vectors in parallel, adding non-linear computation without direct token-to-token communication.
- 4
The final hidden state is projected through an unembedding matrix into scores for every vocabulary token, then normalized with softmax into probabilities.
- 5
Generation works by repeatedly sampling the next token from that distribution, appending it to the prompt, and predicting again.
- 6
Temperature reshapes the probability distribution: lower values make outputs more deterministic; higher values increase randomness and risk incoherence.
- 7
Deep learning training relies on backpropagation to learn the large set of matrix weights that make these computations effective at scale.