Livecoding: Getting Started with LLMs, by Jeremy Howard

TL;DR

MAP@3 rewards ranking the correct option within the top three choices, so systems should optimize for top-k placement rather than single-choice accuracy.

Briefing Cornell Notes

Briefing

The core takeaway is that strong performance on an LLM multiple-choice science benchmark comes less from clever prompting and more from disciplined exploratory data analysis (EDA)—including learning how the evaluation metric can be “gamed” by surface patterns—and from building baselines that mirror how the model will be scored. The walkthrough uses a Kaggle science exam competition dataset generated from Wikipedia passages and GPT-3.5, then demonstrates how to inspect the data, test GPT-3.5 as a baseline, and iterate toward better scoring strategies.

The competition itself is framed around two constraints: it tests whether smaller models can answer science questions under limited compute, and it measures success using mean average precision at 3 (MAP@3)—meaning the correct option must appear somewhere in the model’s top three ranked answers, not necessarily as the single best choice. That scoring rule shifts the optimization target away from “always be correct” toward “rank the right answer highly enough.” The dataset generation process matters just as much: Wikipedia passages are used to create questions, GPT-3.5 generates the multiple-choice items and selects the correct option, and the “correct” answer may reflect GPT-3.5’s interpretation rather than Wikipedia’s factual correctness. This distinction forces a strategy that treats the benchmark as a model-mimicry problem, not a pure knowledge test.

Before writing any modeling code, the session emphasizes doing EDA by hand: manually answering sample questions to understand what kinds of information are required. The presenter finds that many questions hinge on highly specific details rather than general science knowledge. More importantly, repeated attempts reveal that simple heuristics—like choosing answers whose text overlaps with key terms shared across options, or selecting the longest option—can sometimes land the correct answer within the top three. Those observations suggest a practical “cheat surface” for models: they may learn shallow textual patterns that correlate with the correct option, even without deep understanding.

The notebook then establishes baselines by running GPT-3.5 turbo across the dataset. Because API calls are slow and costly, the approach uses concurrency (thread pool execution) and includes a rough cost estimate based on prompt length in tokens/words. The baseline scoring lands around 78 (with partial credit logic tied to whether the correct option appears in first, second, or third rank), which is described as “decent” for a naive setup.

From there, the session tests a more structured prompting approach: a two-step conversational flow where the model first reasons through options and then outputs a ranked set of A–E choices. On a resistivity example where the naive prompt fails, the more deliberate “think then choose” interaction corrects the answer. The walkthrough also warns that chain-of-thought style explanations can be unreliable—models can produce plausible rationalizations that don’t reflect the true reasoning.

Finally, the session explores data generation as an extension of EDA: generating plausible incorrect alternatives using GPT-3.5 so the training set better matches the benchmark’s multiple-choice style. The attempt highlights real-world failure modes—duplicate or malformed alternatives, inconsistent formatting, and the need for validation checks—reinforcing the broader message: LLMs accelerate data work, but the resulting data still requires careful quality control.

Overall, the session argues for a repeatable workflow: inspect the dataset deeply (especially how evaluation rewards ranking), run a baseline that matches the scoring format, mine hard errors to understand whether mistakes come from the model or the dataset, and only then invest in prompt engineering, function calling, or synthetic data generation.

Cornell Notes

The walkthrough targets a Kaggle science multiple-choice benchmark where success depends on ranking: mean average precision at 3 (MAP@3) rewards having the correct option within the model’s top three choices. Because the dataset is generated from Wikipedia passages and GPT-3.5, the “correct” answer can reflect GPT-3.5’s interpretation rather than Wikipedia’s factual truth, so strategy must mimic the benchmark’s data-generating process. The session demonstrates EDA by manually answering questions, uncovering that surface heuristics (like answer-length or shared phrasing) can sometimes correlate with correctness. It then builds baselines using GPT-3.5 turbo, estimates API cost, and improves one failure case by prompting the model to evaluate options before selecting a ranked A–E output. Finally, it experiments with generating plausible incorrect alternatives for synthetic training data, exposing formatting and duplication pitfalls that require validation.

Why does MAP@3 change how you should approach “getting the right answer” in this benchmark?

MAP@3 means the evaluation cares whether the correct option appears in the model’s top three ranked answers, not whether it is the single best choice. That shifts optimization toward ranking quality: a model can score well even if it doesn’t always pick the correct letter first, as long as the correct option lands in positions 1–3. The notebook operationalizes this by scoring whether the correct answer is in the first, second, or third slot and giving partial credit accordingly.

What makes this competition different from a pure “science knowledge” test?

The dataset is created by taking Wikipedia snippets, then using GPT-3.5 to write multiple-choice questions and select the correct option based on the passage it was given. That implies the benchmark’s “correctness” can be tied to GPT-3.5’s interpretation of the passage, and Wikipedia itself may contain errors. The session stresses that maximizing benchmark performance is partly about mimicking the dataset’s data-generating process, not simply learning the most accurate science facts.

What does manual EDA reveal about how models might succeed on these questions?

By answering sample questions directly, the presenter finds that many items require very specific, sometimes niche details rather than general science trivia. The attempts also uncover that shallow text patterns can correlate with correct options—e.g., selecting answers that share distinctive terms across multiple options, or choosing the longest answer when length correlates with correctness. Those heuristics can let a model score without truly understanding underlying science.

How does the notebook build a baseline that matches the evaluation format?

It runs GPT-3.5 turbo over the dataset and then parses the model’s output into a ranked list of A–E choices. A key baseline is “naive prompting” (question + options) and then scoring using the competition’s top-three ranking logic. The notebook also notes practical constraints: API latency and cost, so it uses concurrency and estimates token/word counts to keep the run affordable.

Why can “chain-of-thought” style prompting help, and why is it risky to trust the explanations?

A more structured prompt asks the model to evaluate options and then output a ranked answer, which can fix specific failures (like confusing resistivity with resistance). But the session warns that chain-of-thought explanations are not always faithful: models can generate plausible rationalizations for an incorrect choice. So the explanations may look convincing while the underlying decision process is wrong.

What goes wrong when generating synthetic multiple-choice alternatives with an LLM?

The synthetic generation attempt can produce duplicates, malformed formatting (e.g., extra lines), or inconsistent counts of alternatives (sometimes fewer than expected). The notebook shows that parsing assumptions (like splitting on newlines) can break when output formatting varies. It also highlights that generating “hard” incorrect options is nontrivial: many alternatives can be implausible and therefore too easy, reducing the usefulness of the synthetic data.

Review Questions

How does MAP@3 scoring change the target behavior of a multiple-choice LLM system compared with accuracy?
What role does the dataset’s GPT-3.5 + Wikipedia generation process play in determining what “correct” means for model training and evaluation?
When improving prompts, what evidence in the session suggests that failures may come from dataset labeling/creation rather than model reasoning?

Key Points

1
MAP@3 rewards ranking the correct option within the top three choices, so systems should optimize for top-k placement rather than single-choice accuracy.
2
The benchmark’s “correct answers” are produced via GPT-3.5 over Wikipedia passages, so correctness may reflect GPT-3.5’s interpretation rather than purely factual Wikipedia content.
3
Manual EDA can uncover exploitable correlations (e.g., answer-length or shared phrasing) that let models score without deep understanding.
4
A baseline that matches the evaluation format (ranked A–E output) is essential before investing in prompt engineering or fine-tuning.
5
API-based baselines require cost and latency planning; rough token/word estimates plus concurrency help keep runs feasible.
6
Structured prompting (evaluate options, then rank) can fix specific errors, but chain-of-thought rationales can be unreliable as evidence of true reasoning.
7
Synthetic data generation for multiple-choice alternatives needs validation for formatting, duplication, and plausibility to avoid teaching the model the wrong patterns.

Highlights

MAP@3 reframes the task: the goal is to place the correct option somewhere in the top three, not necessarily to pick it first.

Because GPT-3.5 helped generate the dataset from Wikipedia, “correct” can be benchmark-specific rather than strictly factual.

Simple heuristics discovered during EDA—like choosing the longest option—can sometimes land the correct answer within the top three.

A two-step prompting flow (evaluate options, then output ranked A–E) corrected a resistivity failure case.

Synthetic alternative generation is fragile: duplicates and formatting drift can break parsing and reduce training value.

Topics

LLM EDA
MAP@3 Evaluation
GPT-3.5 Baseline
Prompt Engineering
Synthetic Multiple-Choice Data

Mentioned

Jeremy Howard
LLMs
MAP@3
EDA
API
VRAM
TPU
A100
H100
GPT-3.5
GPT-4
Q-LoRA
LoRA