Get AI summaries of any video or article — Sign up free
Lecture 07: Foundation Models (FSDL 2022) thumbnail

Lecture 07: Foundation Models (FSDL 2022)

The Full Stack·
6 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Fine-tuning lets teams reuse large pre-trained models by updating only a small portion of weights, making adaptation faster when labeled data is limited.

Briefing

Foundation models are driving a shift in AI from task-specific systems toward general-purpose models built by scaling architecture, data, and compute—then adapting them through fine-tuning, prompting, and retrieval. The central puzzle is why performance keeps rising even as training demands explode, but the lecture’s through-line is practical: the same core Transformer machinery, scaled and paired with better data and training objectives, powers breakthroughs across language, code, and vision.

The lecture starts with fine-tuning as the bridge from “old school” machine learning to today’s foundation models. Traditional approaches rely on training from scratch, but when data is scarce, models pre-trained on massive corpora can be reused by updating only a small portion of the network—often by adding a few layers or replacing the top layers while keeping most weights frozen. In NLP, the early bottleneck was representing words: one-hot vectors don’t scale and don’t encode semantic similarity. Word embeddings fix this by mapping vocabulary items into dense vector spaces where similarity reflects meaning, learned from co-occurrence patterns.

From there, the Transformer becomes the key architectural leap. “Attention is all you need” popularized an encoder-decoder design, but the lecture emphasizes the decoder-only setup used by GPT-style models. Self-attention turns each token into a weighted mixture of other tokens, with learned query/key/value projections and positional encoding to restore order. Layer normalization stabilizes training by keeping activations well-behaved across layers. Scaling then matters: modern models use dozens of layers and heads, large embedding dimensions, and huge parameter counts.

The lecture maps major model families and training regimes. GPT and GPT-2 are decoder-only language models trained to predict the next word, with performance improving as parameters grow. BERT flips the script with an encoder-only masked-word objective. T5 reframes tasks as text-to-text problems using both encoder and decoder components. GPT-3’s 175B parameters are highlighted for “emergent” few-shot and zero-shot behavior, and instruct-style tuning is credited for better instruction following via reinforcement learning with human rankings. Retrieval-augmented approaches like DeepMind’s RETRO aim to reduce reliance on memorized facts by fetching relevant text from large databases at inference time.

A major theme is scaling laws and data efficiency. Chinchilla results suggest many earlier models were “undertrained” for their parameter counts: better performance comes from using more tokens and fewer parameters under a fixed compute budget. That leads to a blunt takeaway—data sets deserve as much attention as model architectures, since optimal performance depends on both.

The lecture then shifts from language to applications. Prompt engineering is treated as an engineering layer over tokenization quirks: byte-pair encoding can break character-level tasks, while “scratch pad” style step-by-step prompting can improve multi-step reasoning. It also warns about prompt injection and “possessed” outputs in real systems.

For code, the lecture highlights filtering and verification as accuracy boosters: AlphaCode generates candidate solutions, then narrows them using models and execution. In general, separating generation from validation can lift results. It also describes tools like GitHub Copilot and research ideas like WebGPT, where models write code that is executed to answer questions.

Finally, the lecture broadens to multimodal foundation models. CLIP aligns image and text embeddings using contrastive training, enabling zero-shot classification and cross-modal search. Image generation is explained through diffusion-based pipelines: models like DALL·E 2 use a diffusion prior to map text embeddings to image embeddings, then a decoder diffusion model to produce pixels. Stable Diffusion is presented as a latent diffusion alternative with open weights and a large public training dataset, fueling rapid community experimentation.

Overall, foundation models unify language, vision, and code under scalable Transformer-based systems, with performance increasingly driven by data quality, training objectives, retrieval, and post-generation verification—turning “prompting” and “fine-tuning” into the practical interfaces for real-world use.

Cornell Notes

Foundation models deliver broad capabilities by scaling Transformers with more data and compute, then adapting them via fine-tuning, prompting, and retrieval. The lecture traces how embeddings and self-attention enable language understanding, and how decoder-only Transformers power GPT-style next-token prediction. Scaling laws (especially Chinchilla) argue that many models were undertrained: optimal performance depends strongly on the amount of training data, not just parameter count. Instruction tuning (e.g., InstructGPT) improves how well models follow user requests, while retrieval-augmented methods (e.g., RETRO) reduce reliance on memorized facts. Across applications—code generation, semantic search, and image generation—accuracy often improves when generation is paired with filtering, verification, or diffusion-based conditioning.

Why did NLP move from one-hot word vectors to embeddings, and what problem do embeddings solve?

One-hot encoding represents a vocabulary item as a vector with a single 1 and the rest 0s, which leads to an enormous V×V space and provides no notion of semantic closeness. Embeddings replace this with a dense real-valued vector (often on the order of ~1,000 dimensions) where similar meanings can end up near each other in vector space. The lecture notes early embedding training approaches that use co-occurrence statistics—learning to maximize cosine similarity between embeddings of words that appear together frequently. This also enables “vector math” demos (e.g., combining embeddings like king/man to land near queen).

How does self-attention work in a Transformer, and what role do query, key, and value play?

Self-attention produces an output vector for each position as a weighted sum of input vectors. The weights come from dot products between vectors: each token’s representation is used in three roles—query, key, and value. Queries and keys determine attention weights (via dot products, then normalized to sum to 1), while values are what get summed to form the output. The lecture emphasizes that learning the projection matrices for query/key/value is what “learns attention,” and that multi-head attention uses multiple sets of these projections in parallel (even if implemented as a single matrix multiply).

What changes when moving from an encoder-only model like BERT to a decoder-only model like GPT?

Decoder-only models use masked self-attention so a token can only attend to earlier tokens, matching a sentence-completion training objective (predict the next word). Encoder-only models like BERT instead mask random words and train the model to predict the masked tokens, enabling bidirectional context during encoding. The lecture contrasts these training setups and connects them to model behavior: GPT-style systems naturally support next-token generation, while BERT-style systems support representation learning for downstream tasks.

What does Chinchilla’s scaling-law result imply about how to allocate compute between parameters and training tokens?

Chinchilla experiments trained many models under a fixed compute budget and derived formulas for optimal model size and data size. The key claim is that many published large language models were undertrained: they used too many parameters relative to the number of training tokens. In the example given, Gopher used 280B parameters with 300B tokens, while Chinchilla used 70B parameters with 1.4T tokens and matched or exceeded Gopher’s performance. The implication: to improve under compute constraints, increasing training data can matter as much as (or more than) increasing parameter count.

Why does prompt engineering sometimes work like a “hack,” and what tokenization issue is highlighted by the word-reversal example?

The lecture frames prompting as working around how models actually process text: models operate on byte-pair encoded tokens rather than characters or words directly. In the word-reversal demo, GPT-3 fails to reverse strings correctly because the model doesn’t “see” the input at the character level the way humans do. Adding spaces between characters can change token boundaries but may still fail; more robust prompting can include algorithmic steps (e.g., numbering letters, reversing, then merging) to align with tokenization behavior. It also introduces “scratch pad” prompting and “let’s think step by step,” which can improve multi-step reasoning by making intermediate steps explicit.

How do CLIP and diffusion-based models differ in what they generate or predict?

CLIP maps images and text into a shared embedding space using contrastive training, enabling cross-modal retrieval and zero-shot classification. It does not directly generate images from text. Diffusion-based systems (like DALL·E 2) generate images: they use a diffusion process that starts from noise and learns to denoise step-by-step. DALL·E 2 uses a prior to map text embeddings to image embeddings, then a decoder diffusion model to produce pixels conditioned on those embeddings. Stable Diffusion follows a similar idea but performs diffusion in a latent space for efficiency.

Review Questions

  1. What architectural components (attention, positional encoding, normalization) are necessary for a Transformer to handle sequence order and stable training?
  2. How do masked-language modeling (BERT) and next-token prediction (GPT) change what the model learns and how it’s used?
  3. According to the lecture’s Chinchilla discussion, what trade-off between parameters and training tokens tends to yield better performance under fixed compute?

Key Points

  1. 1

    Fine-tuning lets teams reuse large pre-trained models by updating only a small portion of weights, making adaptation faster when labeled data is limited.

  2. 2

    Embeddings replace one-hot word vectors with dense vectors that capture semantic similarity, often learned from word co-occurrence and trained to make related words close in embedding space.

  3. 3

    Transformers rely on self-attention with query/key/value projections plus positional encoding and layer normalization; scaling these components drives much of modern performance.

  4. 4

    Decoder-only GPT-style models use masked self-attention and next-token prediction, while BERT-style encoder-only models use random masking and token reconstruction.

  5. 5

    Scaling laws (not just parameter count) matter: Chinchilla results emphasize that using more training tokens can outperform models that are “undertrained” for their size.

  6. 6

    Instruction tuning (e.g., InstructGPT) improves instruction following by training on human-ranked outputs, while retrieval-augmented models (e.g., RETRO) fetch external text to reduce stale or memorized knowledge.

  7. 7

    In multimodal systems, CLIP aligns image/text embeddings for retrieval and zero-shot tasks, while diffusion models generate pixels by denoising—often using a text-to-embedding prior plus an image decoder.

Highlights

Self-attention turns each token into a weighted mixture of other tokens, with learned query/key/value projections and positional encoding restoring order.
Chinchilla’s scaling-law takeaway: many earlier large language models were undertrained—more tokens (data) can beat more parameters under fixed compute.
Prompting can exploit tokenization realities; “byte-pair encoding” means character-level instructions may fail unless prompts restructure the task to match token boundaries.
CLIP enables cross-modal search and zero-shot classification by mapping images and text into the same embedding space, without generating images.
DALL·E 2-style pipelines use a diffusion prior to connect text embeddings to image embeddings, then a diffusion decoder to produce pixels conditioned on those embeddings.

Mentioned

  • Sergey
  • Mishka
  • Josh
  • Peter Wellander
  • Simon Willison
  • Riley Goodside
  • Amjad
  • Stanford
  • GPT
  • GPT-2
  • BERT
  • T5
  • LSTM
  • API
  • RL
  • CLIP
  • VQ-GAN
  • IRC
  • JSON
  • CSV
  • VS Code
  • PR
  • API
  • GPT Neo X
  • OPT
  • Bloom
  • DALLE
  • DALL·E
  • RETRO
  • FSDL