Phi-2, Imagen-2, Optimus-Gen-2: Small New Models to Change the World?

TL;DR

Phi-2 is a 2.7B-parameter model positioned as small enough for local smartphone use while still outperforming similarly sized models and some much larger ones.

Briefing Cornell Notes

Briefing

Small models are suddenly getting big enough to matter: Microsoft’s Phi-2 (2.7B parameters) is positioned as a smartphone-sized model that can outperform similarly sized competitors and even models far larger—while also benefiting from training choices that may reduce toxicity. Phi-2 was trained in 14 days on under 100 A100 GPUs, using a synthetic-data-heavy recipe built on earlier Phi work. The training involved 1.4 trillion tokens, described as roughly five times more than Phi-1.5 Web, and the model is open-sourced for download.

The Phi-2 story matters because it reframes what “scaling” means in 2024. More parameters and more compute can help, but the bigger claim is about wasted effort: researchers behind the Phi line argue that throwing “kitchen sink” amounts of compute at low-quality or ineffective data can undercut results. Phi-1 and Phi-1.5 were built around permissively licensed code and textbook-quality filtering, then expanded with synthetic exercises and Q&A generated by earlier models. Phi-1.5 Web reportedly used synthetic data for training while reserving additional filtered stack data for a separate web model, and it added synthetic reasoning-style tasks (common sense, logic, science, and theory of mind). Phi-2 continues that approach, with reported toxicity scores dropping across the board before any reinforcement learning from human feedback.

Benchmark comparisons are part of the pitch, but the transcript also pushes back on how much trust to place in benchmark numbers. Phi-2 is compared against models like Gemini Nano and Mistral variants, and the discussion flags a recurring issue: benchmark contamination and evaluation design can make results misleading. The transcript notes prior evidence of contamination checks and points to a Phi-1.5 paper discussion that hinted at the possibility of reaching ChatGPT-like capability at smaller scales. Still, there’s a practical caution: performance may be sensitive to prompt wording and length, with models sometimes ignoring or misreading parts of longer prompts.

The broader theme extends beyond Phi-2. Microsoft’s prompting system is cited as reaching 90.1% on the MMLU benchmark, and the transcript argues that MMLU itself is flawed—especially when used for fine-grained comparisons to two decimal places. A detailed critique follows: human grading found missing answer context, option formatting problems, incorrect source material, and even dev-set leakage that can cause models to learn the wrong answers as if they were correct. Examples span business ethics, chemistry, virology, economics, and philosophy, including cases where option order was mixed up, answers weren’t even valid options, or questions were ambiguous enough to produce multiple plausible answers.

Finally, the transcript ties the “small model” momentum to other releases: Google’s Imagen 2 is described as available via API with diffusion-based image generation, watermarking, and strong photo realism; and Tesla’s Optimus Gen 2 is mentioned as a robotics milestone. Taken together, the message is that 2024 may be shaped less by ever-larger models alone and more by better data pipelines, smarter prompting, and more honest evaluation—because the difference between progress and noise can hinge on what’s in the dataset and how the test is built.

Cornell Notes

Phi-2 (2.7B parameters) is presented as a “small” model that can run locally on a smartphone and still compete with larger systems. It was trained in 14 days on fewer than 100 A100 GPUs using 1.4 trillion tokens, with heavy reliance on synthetic, filtered, textbook-quality data built from earlier Phi methods. Reported toxicity scores drop across the board before any reinforcement learning from human feedback, suggesting synthetic data may be cleaner. The transcript also warns that benchmark results—especially MMLU—can be distorted by contamination, missing context, incorrect source answers, and dev-set issues, making tiny accuracy differences potentially misleading. The practical takeaway: model performance may be strong, but evaluation design and prompt sensitivity can decide whether those gains are real.

What makes Phi-2 “small” yet potentially impactful, according to the transcript?

Phi-2 is a 2.7 billion parameter model described as small enough to fit locally on a smartphone. Despite that size, it’s reported to outperform other models of comparable scale (including ones trained with Mamba and Google’s Gemini Nano) and even models 25× larger. The transcript frames the impact as coming from both architecture scale (more parameters and compute) and training strategy (synthetic, filtered, textbook-quality data).

How did the Phi line build its training data pipeline before Phi-2?

The transcript traces Phi-1 and Phi-1.5: researchers retrieved permissively licensed open code (about 10× more than used), extracted Python code, filtered for duplicates, and used GPT-4 to label textbook-quality code via a task filter (including comments and good syntax). A small classifier then finished labeling to reduce cost. GPT-3.5 generated diverse synthetic textbook-quality data and synthetic Q&A. For Phi-1.5, synthetic exercises expanded into common sense reasoning, logic, science, and theory of mind, and training relied on synthetic data while additional filtered stack data was used for a separate Phi-1.5 Web model.

What training scale did Phi-2 reportedly use, and why does it matter?

Phi-2 is said to be trained in 14 days on fewer than 100 A100 GPUs, using 1.4 trillion tokens—described as about five times more than Phi-1.5 Web. The transcript links this to the idea that more compute enables more passes over data (including “EPO,” likened to rereading). It also notes that open-sourcing includes a download link, and that synthetic-data training may reduce toxicity before any human-feedback reinforcement.

Why does the transcript argue that MMLU benchmark numbers can be unreliable?

It highlights multiple failure modes: missing answer context (e.g., business ethics questions missing vital statements), incorrect or low-quality source material, option-order mixups, formatting ambiguity, grammatical ambiguity, and cases where the “correct” answer isn’t even among the options. It also claims dev-set contamination: if a question in the dev set has an incorrect answer, models can learn the wrong mapping as ground truth. Because models are now judged at very high accuracy levels, even 1–3% error can matter when comparisons are made to two decimal places.

What practical caution does the transcript give about using model results?

It warns that Phi-family models may be sensitive to prompt variations. As prompt length increases, models can forget, ignore, or misinterpret parts of the prompt, which means benchmark success may not transfer cleanly to every real-world prompting style. The transcript also implies that prompt and evaluation design can change outcomes enough to affect conclusions about capability.

Review Questions

What specific data-generation and filtering steps were used in the Phi-1 / Phi-1.5 pipeline before reaching Phi-2?
List at least three distinct ways the transcript claims MMLU questions can be flawed (e.g., missing context, incorrect sources, option-order issues).
Why might a model’s performance on a benchmark not translate directly to real tasks, according to the prompt-sensitivity warning?

Key Points

1
Phi-2 is a 2.7B-parameter model positioned as small enough for local smartphone use while still outperforming similarly sized models and some much larger ones.
2
Phi training emphasizes synthetic, filtered, textbook-quality data built through GPT-4 labeling and GPT-3.5 generation, aiming to improve data quality rather than just scale compute.
3
Phi-2 reportedly trained in 14 days on under 100 A100 GPUs using 1.4 trillion tokens, with reported toxicity reductions before any reinforcement learning from human feedback.
4
Benchmark results—especially on MMLU—can be distorted by contamination, missing context, incorrect source answers, formatting/ambiguity issues, and dev-set problems that teach models wrong answers.
5
Model performance may depend heavily on prompt wording and length, with longer prompts increasing the risk of ignoring or misreading instructions.
6
The transcript links the “small model” momentum to broader releases like Imagen 2 (API availability, watermarking, diffusion-based generation) and Optimus Gen 2 (robotics progress).

Highlights

Phi-2 is framed as a smartphone-sized 2.7B model that can outperform comparable models and even some 25× larger systems.

A central critique targets MMLU: missing context, incorrect sources, option-order mixups, and dev-set leakage can all undermine fine-grained accuracy comparisons.

Synthetic-data training is presented as potentially cleaner, with toxicity scores dropping across the board before human-feedback reinforcement.

Prompt sensitivity is flagged as a practical risk: performance can shift when prompts change in wording or length.

Topics

Phi-2
Synthetic Data
MMLU Benchmarks
Imagen 2
Optimus Gen 2

Mentioned

Sassan Adella
Ronan Lan
Sebastian Bäck
MMLU
A100
H100
EPO
API
GPT
LLM
AGI