Learn to Spell: Prompt Engineering (LLM Bootcamp)

TL;DR

Prompt engineering works by conditioning a language model’s next-token probabilities through the exact text you provide, narrowing which continuations become likely.

Briefing Cornell Notes

Briefing

Prompt engineering is the practical art of choosing the exact text you feed a language model so it behaves the way you need—often replacing what used to require training, fine-tuning, or other model-building work. The core insight is that prompts act like “magic spells,” not because models contain wizards, but because they reshape the probability landscape of what text could come next. A language model is fundamentally an autoregressive statistical model: it tokenizes input, then assigns probabilities to every possible next token, repeating that process to generate an entire document. Adding a prompt conditions that generation—reweighting which “alternate universes” (possible continuations) become more likely—so the model effectively narrows from a huge space of documents to the one that matches the user’s intent.

That “alternate universe” framing matters because it clarifies both what prompting can do and what it can’t. Prompts can steer toward nearby, already-written patterns—like finding a Reddit-style answer to a question that exists somewhere in the training distribution—but they can’t reliably jump to universes where missing facts were never written. The talk uses examples to show the danger of overreaching: asking for a cancer cure “from the early 21st century” won’t magically produce a correct molecular mechanism if the model’s learned text doesn’t support that specific claim. Instead, prompting is closer to running search across nearby documentation: it can recombine ideas that exist in the data, but it can’t conjure precise, real-world results on demand.

For instruction-tuned models (including chat-style systems), prompting is framed as “wishes” that can succeed—sometimes dramatically—when the request is phrased correctly. A key example comes from work on reducing social bias: a pre-trained model may default to stereotypes (e.g., choosing “grandfather” in an Uber booking scenario involving age and phone comfort). But adding a clear instruction—asking for an unbiased answer that avoids stereotypes—can produce large improvements on bias benchmarks. The flip side is that wishes require precision: vague or poorly structured instructions can lead to failure modes, including the model following the wrong interpretation of the request.

To make prompting work reliably, the talk lays out an emerging playbook built from practical constraints. First, prompts should use formatted, structured text—especially code fences—because models predict formatted patterns well and are less likely to drift. Second, decomposition is a recurring technique: break tasks into smaller steps, or use “self-ask” style follow-up questions so the model decides what sub-questions to generate at query time. Chain-of-thought prompting is treated as a way to elicit reasoning traces by showing examples where explanations precede answers; it often improves multi-step reasoning and question answering, though it increases latency and token cost.

Finally, the talk warns against common misconceptions. “Few-shot learning” via prompts is not a dependable substitute for training; models may ignore label permutations or struggle with character-level operations because they operate on tokens, not raw characters. Tokenization quirks can break seemingly simple string tasks, though spacing letters can sometimes change tokenization behavior. Across all of it, the practical message is that prompt engineering is mostly a bag of tricks—effective, but sensitive to fiddly details—and that the best approach often combines multiple techniques while managing tradeoffs in compute, latency, and reliability.

Cornell Notes

Prompt engineering is presented as a way to control language models by conditioning their next-token probabilities through carefully chosen input text. The talk frames prompts as “magic spells”: they don’t add hidden intelligence, but they reweight which continuations (imagined “alternate universes”) become likely. Instruction-tuned models can behave like “wish-granters,” improving outcomes such as bias reduction when requests are explicit and structured, but they can fail when instructions are vague or rely on negations poorly. A practical playbook emphasizes formatted prompts (like code fences), task decomposition (including self-ask and chain-of-thought), and post-generation checking or ensembling. The overall takeaway is that prompting is powerful yet constrained by tokenization, distributional limits, and cost tradeoffs like latency and token usage.

How can a prompt change what a language model generates if the model is just predicting the next token?

A language model is autoregressive: it tokenizes input and repeatedly predicts the probability distribution over the next token. When a prompt is added, it conditions that distribution—reweighting which full documents (possible continuations) become more probable. The talk describes this as focusing probability mass from a vast set of possible “documents/universes” down toward continuations whose prefixes match the prompt. In technical terms, the model is conditioning generation on the prompt’s text, making certain suffixes more likely and others less likely.

Why does the “alternate universe” metaphor come with a warning?

Because prompting can only steer toward patterns that exist in the model’s learned text distribution. The talk argues that models can’t reliably “jump” to a universe where a missing fact or procedure was never written. For instance, asking for a cancer cure mechanism based on an early-21st-century prompt won’t produce a correct, specific molecular explanation if that exact knowledge isn’t supported by the training data. The more accurate intuition is that prompting behaves like navigating nearby documentation and recombining ideas that already exist.

What’s the difference between pre-trained models and instruction-tuned models in how prompts function?

Pre-trained models are framed as “alternate universe document generators”: the prompt acts like a portal that biases which continuation document the model produces. Instruction-tuned models (e.g., chat-style systems) are framed as “wish” engines: they respond more directly to user directives. The talk gives a bias example where a pre-trained model defaults to a stereotype (“grandfather” for phone comfort), but an instruction like “ensure your answer is unbiased and does not rely on stereotypes” produces a large improvement on bias benchmarks.

Why do negations and character-level tricks often fail in prompting?

The talk notes that models don’t naturally follow negation phrasing well; smaller models especially may mishandle “not” instructions, so converting negations into positive assertions (e.g., “do not rely on stereotypes” → “ensure your answer does not rely on stereotypes” or a structured positive constraint) can help. It also emphasizes tokenization: models operate on tokens, not characters. Rotated or character-manipulation tasks can break because the same-looking strings tokenize differently; adding spaces between letters can change tokenization so the model treats parts more predictably. Even then, character-level exactness remains difficult.

What are the main techniques in the emerging prompting playbook, and what do they cost?

Key techniques include formatted prompts (especially code fences), decomposition (breaking tasks into smaller steps), self-ask (letting the model generate follow-up questions), and chain-of-thought (eliciting reasoning traces by showing examples where explanations precede answers). The talk highlights tradeoffs: chain-of-thought and few-shot examples increase latency and token cost because they add more text to process; ensembling can improve quality but raises compute cost due to multiple generations; self-criticism increases latency because it may require multiple repair passes.

How does ensembling improve answers, and why does randomness matter?

Ensembling generates multiple outputs for the same prompt and then aggregates them (e.g., majority voting). The intuition is that the correct answer is more probable than any single wrong answer, and there are fewer ways to be right than to be wrong. The talk also suggests injecting randomness—like rephrasing or changing casing—to increase output diversity so the ensemble covers more candidate answers without necessarily changing the correct one.

Review Questions

Which parts of prompting are best understood as conditioning a probability distribution, and how does that differ from “learning a new task on the fly”?
Give one example of a prompting technique that improves reasoning quality and one example of a technique that mainly changes cost (latency/compute). Explain why.
Why might a model fail at a character-level transformation even when it can summarize or write creatively?

Key Points

1
Prompt engineering works by conditioning a language model’s next-token probabilities through the exact text you provide, narrowing which continuations become likely.
2
Language models are autoregressive statistical text models; prompts don’t add hidden intelligence, they reweight the space of possible documents.
3
Instruction-tuned models can reduce issues like social bias when instructions are explicit and structured, but vague or poorly phrased requests can still fail.
4
Formatted structure (especially code fences) makes model outputs more predictable and reduces drift.
5
Task decomposition—via self-ask or chain-of-thought—often improves multi-step performance but increases token usage and latency.
6
Negations and character-level operations are fragile because models follow token-level patterns and can mishandle “not” phrasing or tokenization quirks.
7
Quality can improve through ensembling and self-correction, but these methods trade off against compute and response time.

Highlights

Prompts act like “magic spells” because they condition generation: the model reweights which “alternate universes” (possible continuations) are most probable.

Instruction-tuned prompting can materially reduce bias by adding clear constraints like “do not rely on stereotypes,” outperforming naive probability-based answers.

Chain-of-thought and decomposition often boost reasoning, but they cost more tokens and time because they require extra intermediate text.

Tokenization—not characters—drives model behavior, so character-level tricks (like reversing or rotating strings) can fail unless you change how text is tokenized.

Ensembling works by generating diverse candidates and aggregating them; injecting randomness increases diversity and can improve the odds of correctness.