LLM Foundations (LLM Bootcamp)

TL;DR

Modern LLMs are trained to predict the next token, using cross-entropy loss and backpropagation over Transformer parameters.

Briefing Cornell Notes

Briefing

Large language models work because they turn text into numbers, then learn—via gradient-based training—to predict the next token using a Transformer architecture built for parallel computation. That “next-token” setup matters because it unifies many learning styles under one practical training loop: inputs and outputs become vectors, a loss function measures prediction error, and backpropagation adjusts millions to billions of parameters until the model generalizes.

The foundation starts with a shift from “software 1.0” to “software 2.0.” Traditional programming encodes explicit rules and edge cases. Machine learning instead trains a system that produces a new behavior from data, where the learned robot is driven by parameters that are hard to interpret directly. Within that framing, unsupervised, supervised, and reinforcement learning have largely converged in modern practice into self-supervised learning: models learn by predicting missing or future parts of data, and reinforcement-style objectives can often be reformulated as supervised prediction problems.

Neural networks dominate because they can be expressed as matrix multiplications, and GPUs make those operations fast. Training typically uses mini-batches, computes a loss (commonly cross-entropy for classification-style token prediction), and updates parameters through backpropagation. Data is split into training, validation, and test sets: validation loss guides stopping and hyperparameter choices to reduce overfitting, while the test set stays mostly untouched to approximate real-world performance.

At the architecture level, the Transformer—introduced in “Attention Is All You Need” (2017)—is the core engine. Its decoder predicts the next token by repeatedly attending over previously seen tokens. Attention works by comparing token representations: each token becomes a “query,” “key,” and “value,” and the model forms weighted sums of values where weights reflect similarity between queries and keys. Multi-head attention lets the model learn several attention patterns at once (implemented efficiently as projections into multiple subspaces). During training, masking prevents the model from “looking ahead,” ensuring predictions depend only on earlier context.

Because attention itself is order-agnostic, positional encoding injects information about token order. Residual (skip) connections and layer normalization stabilize training and help gradients flow through many layers. A Transformer block also includes a feed-forward network that transforms token representations into more semantic features. Stacking these blocks yields models with varying depth, embedding size, and attention heads; GPT-3 is cited as having 175 billion parameters with 96 layers and 12,000 embedding dimensions.

Modern LLMs differ mainly in training objectives and data pipelines. BERT is encoder-only and uses masked-word prediction, while T5 reframes tasks as text-to-text generation. GPT models are decoder-only and use next-token prediction with causal masking. Tokenization commonly uses byte pair encoding, balancing vocabulary size with the ability to represent rare characters.

Scaling laws and compute allocation shape performance. DeepMind’s “Training Compute-Optimal LMs” (Chinchilla) argues that many models use too many parameters for the amount of data seen; a smaller model trained on more tokens (70B parameters vs. a much larger baseline) can outperform when compute is held fixed. Meta’s Llama is presented as an open-weight, Chinchilla-optimal family trained on large corpora (including filtered Common Crawl, GitHub, Wikipedia, and books) and released in multiple sizes.

Finally, instruction following and retrieval are treated as separate levers. Instruction tuning uses supervised fine-tuning and often reinforcement learning from human feedback to shift behavior from raw text completion toward following user commands in chat formats. Fine-tuning can improve instruction compliance but may reduce calibration and few-shot abilities (“alignment tax”). Retrieval-enhanced approaches like DeepMind’s RETRO aim to let smaller models fetch factual context from large databases, pointing toward systems that combine reasoning with external knowledge.

Cornell Notes

The core claim is that modern LLMs learn next-token prediction using Transformers, turning text into numeric vectors and training with cross-entropy loss plus backpropagation. Transformers succeed because attention lets each token selectively use information from earlier tokens, while masking enforces causality during training. Residual connections, layer normalization, positional encoding, and feed-forward layers make deep stacks trainable and effective. Scaling matters: compute-optimal training (Chinchilla) favors fewer parameters trained on more tokens, and instruction tuning shifts models from completion to instruction-following via supervised fine-tuning and reinforcement learning from human feedback. These design choices explain why models like GPT, BERT, T5, Llama, and Chinchilla behave differently and how they can be improved.

How does “software 2.0” change what a model is, compared with traditional programming?

Traditional programming encodes explicit logic: a developer writes a robot that handles all input edge cases and is verified with robust tests. In machine learning, the developer instead trains a robot from data. The resulting behavior is controlled by learned parameters (weights) rather than human-written rules, and testing is less about enumerating edge cases and more about measuring generalization on validation/test splits.

Why can modern LLM training be treated as supervised learning even when it sounds like reinforcement learning?

The transcript frames a convergence: many objectives can be reformulated so that inputs map to targets, and learning becomes “predict the next thing” with a loss. For generative tasks, the continuation of the sequence acts like a label (self-supervised). For reinforcement-style tasks, the “next move that maximizes reward” can be cast as a supervised prediction problem over states and actions, so the same numeric input/output and loss machinery applies.

What exactly is attention doing in a Transformer decoder?

Each token representation is projected into three roles: query, key, and value. For a given output position, the model computes attention weights by taking dot products between that position’s query and all keys, then uses those weights to form a weighted sum of the values. Multi-head attention learns multiple such projections in parallel, enabling different attention patterns (e.g., one head tracking syntax-like dependencies, another tracking longer-range context).

Why is masking necessary during training for GPT-style models?

During training, the model predicts multiple next-token positions simultaneously, which would otherwise allow it to use future tokens. Causal masking forces attention to ignore tokens to the right of the current position. So when predicting a token like “sundress,” the model can only use earlier tokens such as “blue,” not later ground-truth words.

How do BERT, T5, and GPT differ despite sharing the Transformer idea?

BERT is encoder-only and uses masked-word prediction (it can look at the full sequence because attention is not causal). T5 uses an encoder-decoder setup and treats tasks as text-to-text: the input string encodes the task, and the output is a target text string. GPT is decoder-only and uses causal next-token prediction with masking, matching the “predict the next token” generation loop used in chat-style completion.

What does “Chinchilla” change about scaling, and why does it matter?

Chinchilla (from DeepMind’s “Training Compute-Optimal LMs”) argues that many published LLMs have too many parameters relative to the number of training tokens. Under a fixed compute budget, better results come from using fewer parameters and training on more data. The transcript cites a 70B-parameter Chinchilla model beating a much larger baseline (Gopher) by training on about 1.4T tokens versus roughly 300B tokens.

Review Questions

Explain how masking, positional encoding, and attention interact to ensure GPT-style models generate text left-to-right.
Compare the training objectives of BERT, T5, and GPT and predict how each objective affects what the model can condition on.
Under a fixed compute budget, why might increasing data tokens outperform increasing model parameters?

Key Points

1
Modern LLMs are trained to predict the next token, using cross-entropy loss and backpropagation over Transformer parameters.
2
Machine learning replaces explicit rule-writing with data-driven parameter learning, making generalization depend on validation/test discipline.
3
Transformer attention builds token-to-token dependencies by using dot-product similarity between queries and keys to weight values.
4
Causal masking is essential for GPT-style training so predictions cannot use future tokens.
5
Residual connections, layer normalization, and positional encoding are practical mechanisms that make deep attention stacks trainable and order-aware.
6
Scaling performance depends on how compute is allocated; Chinchilla-style training favors fewer parameters trained on more tokens.
7
Instruction tuning (supervised fine-tuning and reinforcement learning from human feedback) shifts models toward instruction-following, but can reduce some few-shot behaviors and calibration.

Highlights

Attention turns “which earlier tokens matter” into a weighted sum computed from query-key dot products, enabling selective context use.

Causal masking prevents information leakage during training, ensuring next-token predictions rely only on already-seen text.

Chinchilla’s compute-optimal approach can outperform much larger models by training on substantially more tokens with fewer parameters.

Topics

Mentioned

Andrej Karpathy
Rich Sutton
GPT
T5
BERT
GPU
MLP
NLP
C4
Causal
LLM
RLHF