Get AI summaries of any video or article — Sign up free
Large Language Models explained briefly thumbnail

Large Language Models explained briefly

3Blue1Brown·
5 min read

Based on 3Blue1Brown's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Large language models generate text by repeatedly predicting the next word from a probability distribution, often sampling to keep outputs natural.

Briefing

Large language models power chatbots by learning to predict the next word in a sequence—turning that prediction into fluent, context-aware responses. At their core, these models don’t pick a single “correct” next word. Instead, they assign probabilities across all possible next words, then generate text by repeatedly sampling from that distribution. That probabilistic sampling helps outputs sound natural, and it also explains why the same prompt can produce different answers even when the underlying computation is deterministic.

Training starts with massive amounts of text, often scraped from the internet. For a model like GPT-3, the transcript notes that reading the training text nonstop would take over 2,600 years; newer, larger models train on far more. The model itself is a mathematical function with hundreds of billions of parameters (also called weights). No human manually sets these weights. They begin randomly, producing gibberish, and are then refined through training.

The basic training loop is next-word prediction. Each training example is fed in with all words except the last; the model predicts what the final word should be. An algorithm called backpropagation adjusts the parameters to make the true next word more likely while pushing down the likelihood of other words. Repeating this process across many trillions of examples improves performance not only on training data but also on previously unseen text.

The computational scale is described as staggering: even if a machine could do one billion additions and multiplications per second, completing the operations needed for training the largest models would take well over 100 million years. That level of compute is only feasible because training uses specialized hardware—GPUs—designed to run many operations in parallel.

A key architectural shift came with the transformer. Before 2017, many language models processed text sequentially, word by word. Transformers instead ingest the whole input at once, enabling parallel processing. They represent each word as a vector of numbers so the training process can operate on continuous values. The transformer’s signature mechanism is attention, which lets these word vectors “talk” to one another and update their meanings based on surrounding context—for instance, adjusting the representation of “bank” depending on whether the context implies a riverbank. A feed-forward neural network adds further capacity to store language patterns.

After multiple iterations of attention and feed-forward processing, the model uses a final function on the last vector to output a probability distribution for the next word. The transcript emphasizes that while the framework is designed, the exact behavior that emerges—what the model predicts in particular situations—is an emergent result of how parameters are tuned during training, making it difficult to pinpoint why specific outputs occur.

Finally, chatbot quality depends on more than pre-training. After auto-completing internet text, models undergo reinforcement learning with human feedback, where workers flag unhelpful or problematic predictions and their corrections reshape the parameters to better match user preferences.

Cornell Notes

Large language models generate text by predicting the next word from a probability distribution over all possible next words. They are trained on enormous internet-scale corpora using backpropagation to adjust hundreds of billions of parameters so the model becomes more likely to choose the correct next word. Pre-training focuses on next-word prediction, but chatbot behavior improves further through reinforcement learning with human feedback, where human reviewers flag bad outputs and corrections guide the model toward user-preferred responses. Transformers enable efficient training by processing all input tokens in parallel and using attention to update word representations based on surrounding context. The resulting behavior is emergent from parameter tuning, so it’s hard to explain why a specific prediction occurs in a given prompt.

How does a large language model turn “next-word prediction” into a chatbot response?

It repeatedly completes text. Given a prompt, the model computes probabilities for every possible next word. It then samples from that distribution (including lower-probability words) to build a response token by token. Because sampling introduces randomness, the same prompt can yield different outputs across runs even though the underlying computation is deterministic.

What role do parameters (weights) play, and how are they learned?

Parameters are continuous values that determine the model’s probability estimates for the next word. They start randomly, producing gibberish, then get refined through training. For each example, the model sees all words except the last and predicts the missing final word; backpropagation tweaks parameters to increase the likelihood of the true last word and decrease the likelihood of alternatives.

Why is training described as computationally extreme?

The transcript gives a scale illustration: even with one billion additions and multiplications per second, completing the operations needed to train the largest models would take well over 100 million years. This compute burden is made practical by GPUs, which run many operations in parallel.

What changed with transformers compared with earlier language models?

Earlier approaches often processed text sequentially, one word at a time. Transformers ingest the whole input at once, enabling parallel processing. They encode each word as a vector of numbers and use attention so word vectors can update based on context (e.g., “bank” can shift toward “riverbank” depending on surrounding words). A feed-forward neural network adds additional pattern capacity.

How does reinforcement learning with human feedback differ from pre-training?

Pre-training teaches auto-completion on large text corpora by next-word prediction. Reinforcement learning with human feedback then targets chatbot usefulness: workers flag unhelpful or problematic predictions, and those corrections adjust parameters so the model becomes more likely to produce outputs users prefer.

Why is it hard to explain why a model produces a specific output?

Even though researchers design the transformer framework and training objective, the precise behavior emerges from how hundreds of billions of parameters are tuned during training. That makes it difficult to trace a specific prediction back to a simple, human-interpretable rule.

Review Questions

  1. What is the difference between predicting a single next word and producing a probability distribution over next words, and why does that matter for chatbot output?
  2. Describe the training loop for next-word prediction, including what backpropagation optimizes.
  3. Explain how attention in a transformer changes word representations using context, and why this enables parallel processing.

Key Points

  1. 1

    Large language models generate text by repeatedly predicting the next word from a probability distribution, often sampling to keep outputs natural.

  2. 2

    Model parameters (weights) start randomly and are learned by backpropagation to make the correct next word more likely.

  3. 3

    Pre-training focuses on next-word auto-completion from large internet-scale datasets, but chatbot quality requires additional training.

  4. 4

    Reinforcement learning with human feedback uses human judgments to penalize unhelpful outputs and steer the model toward user preferences.

  5. 5

    Transformers enable parallel processing by ingesting all tokens at once and using attention to update word representations based on context.

  6. 6

    Training at the scale required for the largest models demands massive computation, made feasible by GPUs optimized for parallel operations.

  7. 7

    Emergent behavior means the exact predictions in specific prompts are difficult to interpret even when the architecture is well defined.

Highlights

A large language model assigns probabilities to every possible next word, then builds responses by sampling from that distribution token by token.
Even with extreme hypothetical hardware speed, training the largest language models would take well over 100 million years—hence the reliance on GPUs.
Transformers replace sequential reading with parallel processing and use attention so word meanings shift based on surrounding context (e.g., “bank” → riverbank).
Chatbot usefulness comes not just from pre-training, but from reinforcement learning with human feedback that incorporates human corrections.
The model’s behavior is emergent from parameter tuning, making it challenging to explain why specific outputs occur.

Topics

Mentioned

  • GPT-3
  • GPUs