Transformers, the tech behind LLMs | Deep Learning Chapter 5

TL;DR

Transformer text generation starts by tokenizing input and mapping tokens to vectors using an embedding matrix.

Briefing Cornell Notes

Briefing

Transformer-based models—behind systems like ChatGPT—turn text into a stream of vectors, mix information across tokens with attention, and then convert the final hidden representation into a probability distribution over the next possible token. The practical breakthrough is that this “predict-the-next-token” loop can be run repeatedly: feed the model a prompt, sample from its predicted distribution for the next token, append that token, and repeat. What looks like simple next-word prediction becomes coherent multi-sentence generation when the model is large enough and trained well.

At a high level, the process starts by splitting input into tokens. For text, tokens often correspond to words, word fragments, or common character sequences; for multimodal variants, tokens can represent small image patches or audio fragments. Each token is mapped to a vector via an embedding matrix: a vocabulary-sized table where each row/column corresponds to a token and the vector dimension encodes learned semantic relationships. In GPT-3, the embedding dimension is 12,288 and the vocabulary is 50,257 tokens, implying roughly 617 million parameters just for embeddings. These vectors are not merely “word meanings.” They also carry positional and contextual information once the model processes them through layers.

The core mechanism that lets tokens influence one another is attention. The model repeatedly alternates between attention blocks and feed-forward (multi-layer perceptron) blocks. Attention is where vectors “talk” to each other: it updates token representations based on which other tokens in the current context are relevant. A key example is how the word “model” changes meaning depending on whether it appears in “machine learning model” versus “fashion model.” The feed-forward layers then transform each token representation in parallel, applying non-linear computation without direct token-to-token communication.

After enough layers, the model produces a final hidden vector for the context window. That vector is projected through an “unembedding” matrix (W_U) into a set of scores—one per vocabulary token—then normalized into probabilities using softmax. Softmax converts arbitrary real-valued scores into a valid probability distribution by exponentiating each score and dividing by the sum of all exponentiated values. Higher scores become more likely, but the distribution remains smooth enough to support sampling rather than always choosing the single most likely token.

Generation is then a loop. Given a prompt, the model predicts a distribution for the next token, samples from it, appends the sampled token to the prompt, and predicts again. This is why a model can appear to “write” stories: it’s repeatedly sampling from its own next-token predictions. Temperature controls how sharp the distribution is: lower temperature makes the model behave closer to greedy decoding (near-deterministic), while higher temperature increases randomness by flattening probabilities. The transcript notes that some APIs cap temperature (e.g., not above 2) to prevent outputs from degrading into nonsense.

Finally, the transcript grounds these ideas in the training paradigm of deep learning: large models learn weights by backpropagation, using vast datasets rather than hand-coded rules. GPT-3 is highlighted as an example of scale—175 billion parameters—organized into many matrix multiplications. The embedding and unembedding matrices, attention, and feed-forward transformations all reduce to weighted sums and non-linearities, making the transformer architecture a scalable, trainable engine for next-token prediction and, by extension, the fluent text generation seen across modern AI tools.

Cornell Notes

Transformer models generate text by repeatedly predicting the next token from a fixed-size context window. Input text is tokenized, mapped to vectors with an embedding matrix, then processed through stacked attention blocks and feed-forward layers so tokens can update each other’s meaning. The final hidden representation is projected through an unembedding matrix into scores for every vocabulary token, then normalized with softmax into a probability distribution. Sampling from that distribution—optionally adjusted by temperature—produces the next token, which is appended to the prompt and used for the next prediction. This “next-token” loop is what turns simple probability modeling into coherent multi-sentence output.

Why does “predict the next token” become long-form text generation instead of just one-word guesses?

Because the model’s output is a full probability distribution over all possible next tokens. After sampling one token from that distribution, the sampled token is appended to the prompt, and the model predicts again using the updated context. Repeating this loop turns short next-token predictions into extended sequences. The transcript contrasts early GPT-2-style demos (often incoherent) with larger GPT-3-style behavior (often producing more plausible continuations), attributing the difference largely to scale and training.

What role do embeddings play beyond representing words?

Embeddings map each token to a learned vector in a high-dimensional space. In GPT-3, the embedding dimension is 12,288 and the vocabulary size is 50,257 tokens. While embeddings can reflect semantic relationships (e.g., vector arithmetic capturing gender or identity-like directions), the transcript emphasizes that embeddings also serve as the starting point for contextualization: later layers reshape these vectors so they encode information from surrounding tokens, not just the token’s standalone meaning.

How does attention make token meaning depend on context?

Attention updates token vectors by allowing them to “attend” to other tokens in the current context window. This is how the same word can take on different meanings depending on nearby words—for example, “model” in “machine learning model” versus “fashion model.” The transcript describes attention as the mechanism responsible for exchanging information between tokens and updating representations accordingly, repeatedly across layers.

What exactly does softmax do in the final step of token prediction?

Softmax converts raw scores (which can be any real numbers) into a proper probability distribution where each value lies between 0 and 1 and the probabilities sum to 1. It does this by exponentiating each score and dividing by the sum of all exponentiated scores. The transcript also notes that sampling from softmax outputs is more flexible than always taking the maximum score, because multiple tokens can retain non-trivial probability mass.

How does temperature change the model’s behavior during sampling?

Temperature rescales the softmax distribution: higher temperature flattens probabilities, giving less likely tokens more chance; lower temperature sharpens probabilities, making the model more likely to pick the top token. In the extreme, temperature near 0 approaches deterministic selection of the maximum-probability token. The transcript mentions an API constraint that temperature cannot exceed 2, partly to avoid outputs collapsing into nonsense.

Why is the context window size (e.g., 2048 in GPT-3) important?

The transformer processes a fixed number of tokens at a time. For GPT-3, the context size is 2048, meaning the model can only condition on the most recent 2048 tokens when predicting the next one. This limitation helps explain why early chat experiences sometimes seemed to “lose the thread” during very long conversations: older parts of the dialogue fall outside the context window.

Review Questions

How do embedding vectors and attention layers work together to make token meaning context-dependent?
Explain the full pipeline from tokenization to softmax probabilities to sampling, including where temperature fits in.
What does the context window limit imply for long conversations, and how does it affect next-token predictions?

Key Points

1
Transformer text generation starts by tokenizing input and mapping tokens to vectors using an embedding matrix.
2
Attention layers let token representations update based on which other tokens are relevant in the current context window.
3
Feed-forward layers transform token vectors in parallel, adding non-linear computation without direct token-to-token communication.
4
The final hidden state is projected through an unembedding matrix into scores for every vocabulary token, then normalized with softmax into probabilities.
5
Generation works by repeatedly sampling the next token from that distribution, appending it to the prompt, and predicting again.
6
Temperature reshapes the probability distribution: lower values make outputs more deterministic; higher values increase randomness and risk incoherence.
7
Deep learning training relies on backpropagation to learn the large set of matrix weights that make these computations effective at scale.

Highlights

Attention is the mechanism that makes a token’s meaning depend on surrounding tokens—so the same word can shift interpretation in different phrases.

Softmax turns raw scores into a probability distribution, enabling sampling rather than always choosing the single most likely token.

Temperature controls how sharply the model favors high-probability tokens, with lower values approaching greedy decoding and higher values increasing the chance of surprising (sometimes incoherent) continuations.