State of GPT

TL;DR

GPT-style assistants are trained in four serial stages: pre-training, supervised fine-tuning, reward modeling, and RLHF, with pre-training consuming the vast majority of compute.

Briefing Cornell Notes

Briefing

Large language models are built through a pipeline that starts with internet-scale next-token pre-training and then progressively adds human preference signals—turning raw “document completers” into instruction-following assistants. The central takeaway is that most of the capability comes from the pre-training stage, while later steps (supervised fine-tuning and RLHF) mainly reshape behavior toward helpfulness and alignment with what people rank as better outputs.

The training recipe begins with four serial stages: pre-training, supervised fine-tuning, reward modeling, and reinforcement learning from human feedback. Pre-training is the computational heavyweight: it consumes roughly 99% of training compute and uses thousands of GPUs over months. Data is assembled from mixtures such as Common Crawl and C4, plus higher-quality sources like GitHub, Wikipedia, books, and Stack Exchange, sampled according to set proportions. Raw text is converted into token sequences via tokenization methods (e.g., byte pair encoding), then fed into a Transformer that learns by predicting the next token across extremely long contexts.

Hyperparameters and scale illustrate why parameter count alone is a weak proxy for ability. A model like Llama (about 65B parameters) can outperform a larger model (e.g., GPT-3 at 175B) when trained longer on more tokens—1.4 trillion tokens versus 300 billion in the cited comparison. The talk also emphasized that training loss curves can show sharp spikes without necessarily indicating classic overfitting; rare “bad batches” or unusually difficult sequences can produce large temporary jumps.

After pre-training, the model’s general representations enable two practical adaptation paths. One is fine-tuning: sentiment classification, for example, shifts from training a task-specific model from scratch to taking a pre-trained Transformer and adapting it with comparatively small labeled datasets. The other is prompting: around the GPT-2 era, models became effective at completing tasks from instructions and examples embedded in the prompt (few-shot prompting), without additional gradient updates.

To make assistants that reliably follow instructions, the pipeline moves beyond base models. Supervised fine-tuning trains on prompt–ideal-response pairs created by human contractors, using tens of thousands of high-quality examples and continuing language modeling on this curated data. Then reward modeling collects comparative judgments: contractors rank multiple candidate completions for the same prompt, and a reward model learns to predict those rankings. Finally, reinforcement learning uses the reward model as a fixed scorer, updating the assistant so sampled outputs receive higher predicted reward.

Why this extra RLHF step matters: humans tend to prefer RLHF-tuned outputs over base or merely prompted models. The talk offered one intuition—judging quality via comparisons is easier than generating perfect examples—so human feedback can be leveraged more efficiently.

The second half shifted to how to use these assistants effectively. Because Transformers are “token simulators” without human-style inner monologue or self-correction, prompting often needs to supply structure: show-your-work patterns (e.g., “let’s think step by step”), self-consistency via multiple samples, and techniques like tree-of-thought that combine prompt templates with search logic. For harder tasks, tool use and retrieval augmented generation help: load relevant documents into context via embeddings and vector search (e.g., Llama Index), enforce output formats through constraint prompting (e.g., JSON), and consider fine-tuning only after prompt engineering is exhausted. The practical recommendation was to start with the most capable model (notably GPT-4) for best results, then optimize cost and latency later—while keeping expectations grounded in limitations like hallucinations, knowledge cutoffs, and susceptibility to prompt injection and jailbreak attacks.

Cornell Notes

The core idea is that GPT-style assistants are built in stages: massive next-token pre-training first, then supervised fine-tuning, then RLHF using human preference comparisons. Pre-training teaches general language representations by predicting the next token over huge token counts; later stages mainly reshape behavior toward helpful, truthful, and harmless responses. Reward modeling trains a model to score candidate completions based on ranked human judgments, and reinforcement learning updates the assistant to produce outputs that score higher. This matters because it explains why prompting works for many tasks, but why instruction-following quality often improves substantially after RLHF. It also guides practical usage: spread reasoning across more tokens, use multiple samples or search, and add tools or retrieval when the task exceeds what pure text generation can reliably handle.

Why does pre-training dominate compute, and what does the model learn during that phase?

Pre-training is the computational bottleneck—about 99% of training compute—because it runs on internet-scale data with thousands of GPUs for months. The learning signal is next-token prediction: the Transformer receives tokenized text (converted to integer token IDs via methods like byte pair encoding) and is trained to assign higher probability to the next token at every position in the context. Over time, loss decreases and generated text becomes more coherent and consistent, because the model internalizes patterns of words, punctuation, and structure. The talk also stressed that power isn’t determined only by parameter count; longer training on more tokens can yield stronger models even with fewer parameters (e.g., Llama trained on 1.4T tokens vs GPT-3 on 300B tokens).

How do supervised fine-tuning and RLHF differ from base-model prompting?

Base models are trained to complete documents; they can be coaxed into tasks via prompting (few-shot examples), but behavior is less reliable. Supervised fine-tuning (SFT) uses curated prompt–ideal-response pairs from human contractors, continuing language modeling on high-quality instruction-following data. RLHF then adds preference optimization: contractors generate multiple completions for the same prompt, rank them, and a reward model learns to predict those rankings. Reinforcement learning uses that reward model to update the assistant so sampled outputs are more likely to be preferred by humans. The result is typically better instruction adherence and overall preference in human comparisons.

What causes spikes in training loss curves, and why they may not imply overfitting?

The discussion suggested that sharp spikes can come from unusually bad batches or rare difficult sequences rather than systematic overfitting. Because training aggregates many batches over enormous token counts (e.g., 1.4T tokens), it’s plausible to encounter occasional extreme-loss events—like sampling a very unusual text segment that the model struggles to predict. The spikes can look dramatic on a smoothed plot even if the underlying per-batch losses fluctuate widely.

Why does prompting often need “show your work” or structured reasoning?

Transformers generate tokens without human-style inner monologue, sanity checks, or the ability to revise earlier steps mid-generation. They also can’t “spend more compute per token” in the way a human might; instead, reasoning must be distributed across more tokens. Prompt strategies like “let’s think step by step” encourage the model to allocate intermediate reasoning tokens, which can improve performance on tasks requiring multi-step calculation or logic. For additional robustness, self-consistency samples multiple solutions and selects the best (e.g., majority vote or scoring).

How do retrieval and constraint prompting improve reliability?

Retrieval augmented generation (RAG) helps when the model needs facts from specific documents: relevant text chunks are embedded, stored in a vector store, and fetched at query time to be inserted into the prompt context. This reduces reliance on the model’s internal memory and knowledge cutoff. Constraint prompting improves output format reliability by clamping token probabilities so the model must follow a template—such as forcing JSON output—then only filling in the allowed blanks. Together, these approaches reduce hallucination risk and make outputs easier to parse and use in applications.

Review Questions

What are the four serial training stages for GPT-style assistants, and what new data signal does each stage add?
Why is parameter count not a reliable predictor of model capability in the talk’s comparisons?
Give two prompting or system-level techniques that compensate for the model’s lack of self-correction during generation, and explain how each helps.

Key Points

1
GPT-style assistants are trained in four serial stages: pre-training, supervised fine-tuning, reward modeling, and RLHF, with pre-training consuming the vast majority of compute.
2
Pre-training learns general language representations via next-token prediction over tokenized internet-scale data; longer training on more tokens can outweigh higher parameter counts.
3
Base models are document completers; prompting can induce task behavior, but supervised fine-tuning and RLHF are what reliably shift outputs toward instruction-following and human preferences.
4
RLHF works by training a reward model from human rankings of multiple candidate completions, then using that reward model to reinforce higher-quality generations.
5
Effective prompting often requires spreading reasoning across more tokens (e.g., “let’s think step by step”), sampling multiple candidates, and selecting or voting on better outputs.
6
For tasks needing specific facts or strict output structure, retrieval augmented generation and constraint prompting (e.g., JSON templates) improve reliability.
7
Because LLMs can hallucinate, lack knowledge beyond their cutoff, and are vulnerable to attacks, they’re best used in low-stakes workflows with human oversight and tool-assisted guardrails.

Highlights

Pre-training is the compute-heavy foundation—about 99% of training effort—while later stages mainly reshape behavior using curated instruction data and human preference signals.

A model can be more capable with fewer parameters if it’s trained longer on more tokens (Llama vs GPT-3 comparison).

RLHF turns human preference rankings into a reward model, then uses reinforcement learning to bias generations toward outputs humans prefer.

“Let’s think step by step” and self-consistency work because they compensate for the model’s lack of internal self-correction by forcing more structured intermediate reasoning and/or multiple samples.

Retrieval augmented generation and constraint prompting address two common failure modes: missing facts and unparseable outputs.

Topics

GPT Training Pipeline
Pre-Training vs Fine-Tuning
RLHF Reward Modeling
Prompt Engineering
Retrieval Augmented Generation

Mentioned

Andrew Carpathy
GPT
RLHF
SFT
PPO
JSON
RAG
ELO
API
GPU
flops
C4
IQ