Phi 2: Small Language Model Better Than 7B LLMs? | Google Colab Tutorial with Python
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Phi-2 (2.7B parameters) is designed to reduce inference and deployment costs compared with 7B–13B models while aiming for comparable usefulness.
Briefing
Microsoft’s Phi-2 (2.7B parameters) is positioned as a test of whether “small” language models can match the useful behavior of much larger 7B–13B systems—without the same inference burden. The practical stakes are clear: smaller models can run faster and require less GPU memory, which matters for real deployments and for teams fine-tuning models on custom tasks. But Phi-2 also comes with a major limitation: it’s intended for research use only and isn’t available for commercial applications, even if it can be downloaded from Hugging Face.
Phi-2 is an incremental step from Microsoft’s earlier Phi-1.5 (1.3B parameters), doubling parameter count while relying on training-data strategy to close the performance gap. Microsoft’s key research claim is that scaling laws aren’t only about model size; data quality and targeted synthetic data can drive strong results. The training mix reportedly includes synthetic datasets designed to teach common-sense reasoning and general knowledge, plus a large volume of coding-related questions. Synthetic content was generated using GPT-4 and GPT-3.5, and the dataset was further augmented with carefully selected web data filtered for educational value and content quality—again with help from GPT-3.5/GPT-4-style filtering. Microsoft also describes “embedding” knowledge from the smaller Phi-1.5 into Phi-2, though the transcript doesn’t detail the mechanism.
On benchmarks, Phi-2’s headline strength is coding performance. Across standard evaluation suites such as SquAD v2 and MMLU, the transcript notes modest overall gains (roughly 10–15% on some measures), but larger improvements on coding (often cited as 10–20%). In direct comparisons, Phi-2 is described as beating Mistral and Llama 2–class 7B models on common-sense reasoning and language understanding, while sometimes trailing larger 70B systems. The model is also described as performing strongly on math and “MAT” tasks, with coding emerging as the standout.
The tutorial then shifts from benchmark claims to hands-on reality in a Google Colab notebook using a T4 GPU. Phi-2 is loaded via Hugging Face with Transformers and PyTorch, using flash attention for speed. The model size is still non-trivial—about 5.5GB of storage in the setup described—and generation settings emphasize reproducibility (low temperature, capped tokens, and streaming output).
In practice, Phi-2 produces fluent, well-formatted answers on some general prompts (including a comparison of ChatGPT vs open-source models) and can return correct results on simple arithmetic like “3 + 8 − 2.” However, several coding and structured-reasoning attempts go poorly: it generates code that is incorrect or nonsensical, continues “spilling” extra text after completing tasks, and fails at a reading-comprehension extraction task where the transcript claims the model’s extracted values don’t match the table. Even when the model seems to build a dictionary correctly for one model, it still outputs inconsistent numbers afterward.
Overall, Phi-2 looks like a promising starting point—especially for fast experimentation and potential fine-tuning on permitted data—but the transcript’s live tests suggest benchmark performance doesn’t automatically translate to reliable correctness under casual prompting. The takeaway is less “small beats big everywhere” and more “small can be strong, but evaluation and prompting discipline still matter.”
Cornell Notes
Phi-2 is Microsoft’s 2.7B-parameter language model built to test whether small models can reach performance near larger 7B–13B systems. Its training strategy emphasizes data quality: synthetic datasets (generated with GPT-4 and GPT-3.5) for common-sense reasoning and general knowledge, plus heavy coding-related training, and filtered web data for educational value. In benchmark summaries, Phi-2’s biggest gains appear in coding, with smaller but meaningful improvements on broader suites like SquAD v2 and MMLU. A hands-on Colab run shows Phi-2 can answer clearly and handle simple math, but it can still fail on coding tasks and structured extraction, highlighting that benchmark results may not fully match real-world prompting behavior. The model is research-only and not licensed for commercial use.
Why does Phi-2 matter if larger models already exist?
What training-data approach is credited for Phi-2’s performance?
How does Phi-2 compare to larger open models in the reported benchmarks?
What does the Hugging Face setup emphasize for running Phi-2?
Where did Phi-2 struggle in the hands-on tests?
What’s the practical conclusion about Phi-2 from these experiments?
Review Questions
- What specific data-generation and filtering steps are described as central to Phi-2’s training?
- Which evaluation area is repeatedly described as Phi-2’s strongest (and what evidence is given)?
- In the Colab tests, what kinds of tasks most often produced incorrect or inconsistent outputs?
Key Points
- 1
Phi-2 (2.7B parameters) is designed to reduce inference and deployment costs compared with 7B–13B models while aiming for comparable usefulness.
- 2
Phi-2’s training emphasizes data quality: synthetic datasets (from GPT-4 and GPT-3.5) plus filtered web data for educational value.
- 3
Coding performance is the most consistently highlighted strength in the benchmark summaries.
- 4
Phi-2 is research-only and not licensed for commercial applications, even if it’s accessible via Hugging Face.
- 5
Hands-on prompting can yield fluent answers and correct simple math, but coding and structured extraction can still fail or become inconsistent.
- 6
Running Phi-2 in Colab uses Transformers/PyTorch with flash attention and instruct-style prompt formatting, and the model still requires several gigabytes of storage.