LLM Foundations (LLM Bootcamp)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Modern LLMs are trained to predict the next token, using cross-entropy loss and backpropagation over Transformer parameters.
Briefing
Large language models work because they turn text into numbers, then learn—via gradient-based training—to predict the next token using a Transformer architecture built for parallel computation. That “next-token” setup matters because it unifies many learning styles under one practical training loop: inputs and outputs become vectors, a loss function measures prediction error, and backpropagation adjusts millions to billions of parameters until the model generalizes.
The foundation starts with a shift from “software 1.0” to “software 2.0.” Traditional programming encodes explicit rules and edge cases. Machine learning instead trains a system that produces a new behavior from data, where the learned robot is driven by parameters that are hard to interpret directly. Within that framing, unsupervised, supervised, and reinforcement learning have largely converged in modern practice into self-supervised learning: models learn by predicting missing or future parts of data, and reinforcement-style objectives can often be reformulated as supervised prediction problems.
Neural networks dominate because they can be expressed as matrix multiplications, and GPUs make those operations fast. Training typically uses mini-batches, computes a loss (commonly cross-entropy for classification-style token prediction), and updates parameters through backpropagation. Data is split into training, validation, and test sets: validation loss guides stopping and hyperparameter choices to reduce overfitting, while the test set stays mostly untouched to approximate real-world performance.
At the architecture level, the Transformer—introduced in “Attention Is All You Need” (2017)—is the core engine. Its decoder predicts the next token by repeatedly attending over previously seen tokens. Attention works by comparing token representations: each token becomes a “query,” “key,” and “value,” and the model forms weighted sums of values where weights reflect similarity between queries and keys. Multi-head attention lets the model learn several attention patterns at once (implemented efficiently as projections into multiple subspaces). During training, masking prevents the model from “looking ahead,” ensuring predictions depend only on earlier context.
Because attention itself is order-agnostic, positional encoding injects information about token order. Residual (skip) connections and layer normalization stabilize training and help gradients flow through many layers. A Transformer block also includes a feed-forward network that transforms token representations into more semantic features. Stacking these blocks yields models with varying depth, embedding size, and attention heads; GPT-3 is cited as having 175 billion parameters with 96 layers and 12,000 embedding dimensions.
Modern LLMs differ mainly in training objectives and data pipelines. BERT is encoder-only and uses masked-word prediction, while T5 reframes tasks as text-to-text generation. GPT models are decoder-only and use next-token prediction with causal masking. Tokenization commonly uses byte pair encoding, balancing vocabulary size with the ability to represent rare characters.
Scaling laws and compute allocation shape performance. DeepMind’s “Training Compute-Optimal LMs” (Chinchilla) argues that many models use too many parameters for the amount of data seen; a smaller model trained on more tokens (70B parameters vs. a much larger baseline) can outperform when compute is held fixed. Meta’s Llama is presented as an open-weight, Chinchilla-optimal family trained on large corpora (including filtered Common Crawl, GitHub, Wikipedia, and books) and released in multiple sizes.
Finally, instruction following and retrieval are treated as separate levers. Instruction tuning uses supervised fine-tuning and often reinforcement learning from human feedback to shift behavior from raw text completion toward following user commands in chat formats. Fine-tuning can improve instruction compliance but may reduce calibration and few-shot abilities (“alignment tax”). Retrieval-enhanced approaches like DeepMind’s RETRO aim to let smaller models fetch factual context from large databases, pointing toward systems that combine reasoning with external knowledge.
Cornell Notes
The core claim is that modern LLMs learn next-token prediction using Transformers, turning text into numeric vectors and training with cross-entropy loss plus backpropagation. Transformers succeed because attention lets each token selectively use information from earlier tokens, while masking enforces causality during training. Residual connections, layer normalization, positional encoding, and feed-forward layers make deep stacks trainable and effective. Scaling matters: compute-optimal training (Chinchilla) favors fewer parameters trained on more tokens, and instruction tuning shifts models from completion to instruction-following via supervised fine-tuning and reinforcement learning from human feedback. These design choices explain why models like GPT, BERT, T5, Llama, and Chinchilla behave differently and how they can be improved.
How does “software 2.0” change what a model is, compared with traditional programming?
Why can modern LLM training be treated as supervised learning even when it sounds like reinforcement learning?
What exactly is attention doing in a Transformer decoder?
Why is masking necessary during training for GPT-style models?
How do BERT, T5, and GPT differ despite sharing the Transformer idea?
What does “Chinchilla” change about scaling, and why does it matter?
Review Questions
- Explain how masking, positional encoding, and attention interact to ensure GPT-style models generate text left-to-right.
- Compare the training objectives of BERT, T5, and GPT and predict how each objective affects what the model can condition on.
- Under a fixed compute budget, why might increasing data tokens outperform increasing model parameters?
Key Points
- 1
Modern LLMs are trained to predict the next token, using cross-entropy loss and backpropagation over Transformer parameters.
- 2
Machine learning replaces explicit rule-writing with data-driven parameter learning, making generalization depend on validation/test discipline.
- 3
Transformer attention builds token-to-token dependencies by using dot-product similarity between queries and keys to weight values.
- 4
Causal masking is essential for GPT-style training so predictions cannot use future tokens.
- 5
Residual connections, layer normalization, and positional encoding are practical mechanisms that make deep attention stacks trainable and order-aware.
- 6
Scaling performance depends on how compute is allocated; Chinchilla-style training favors fewer parameters trained on more tokens.
- 7
Instruction tuning (supervised fine-tuning and reinforcement learning from human feedback) shifts models toward instruction-following, but can reduce some few-shot behaviors and calibration.