Phi 2: Small Language Model Better Than 7B LLMs? | Google Colab Tutorial with Python

TL;DR

Phi-2 (2.7B parameters) is designed to reduce inference and deployment costs compared with 7B–13B models while aiming for comparable usefulness.

Briefing Cornell Notes

Briefing

Microsoft’s Phi-2 (2.7B parameters) is positioned as a test of whether “small” language models can match the useful behavior of much larger 7B–13B systems—without the same inference burden. The practical stakes are clear: smaller models can run faster and require less GPU memory, which matters for real deployments and for teams fine-tuning models on custom tasks. But Phi-2 also comes with a major limitation: it’s intended for research use only and isn’t available for commercial applications, even if it can be downloaded from Hugging Face.

Phi-2 is an incremental step from Microsoft’s earlier Phi-1.5 (1.3B parameters), doubling parameter count while relying on training-data strategy to close the performance gap. Microsoft’s key research claim is that scaling laws aren’t only about model size; data quality and targeted synthetic data can drive strong results. The training mix reportedly includes synthetic datasets designed to teach common-sense reasoning and general knowledge, plus a large volume of coding-related questions. Synthetic content was generated using GPT-4 and GPT-3.5, and the dataset was further augmented with carefully selected web data filtered for educational value and content quality—again with help from GPT-3.5/GPT-4-style filtering. Microsoft also describes “embedding” knowledge from the smaller Phi-1.5 into Phi-2, though the transcript doesn’t detail the mechanism.

On benchmarks, Phi-2’s headline strength is coding performance. Across standard evaluation suites such as SquAD v2 and MMLU, the transcript notes modest overall gains (roughly 10–15% on some measures), but larger improvements on coding (often cited as 10–20%). In direct comparisons, Phi-2 is described as beating Mistral and Llama 2–class 7B models on common-sense reasoning and language understanding, while sometimes trailing larger 70B systems. The model is also described as performing strongly on math and “MAT” tasks, with coding emerging as the standout.

The tutorial then shifts from benchmark claims to hands-on reality in a Google Colab notebook using a T4 GPU. Phi-2 is loaded via Hugging Face with Transformers and PyTorch, using flash attention for speed. The model size is still non-trivial—about 5.5GB of storage in the setup described—and generation settings emphasize reproducibility (low temperature, capped tokens, and streaming output).

In practice, Phi-2 produces fluent, well-formatted answers on some general prompts (including a comparison of ChatGPT vs open-source models) and can return correct results on simple arithmetic like “3 + 8 − 2.” However, several coding and structured-reasoning attempts go poorly: it generates code that is incorrect or nonsensical, continues “spilling” extra text after completing tasks, and fails at a reading-comprehension extraction task where the transcript claims the model’s extracted values don’t match the table. Even when the model seems to build a dictionary correctly for one model, it still outputs inconsistent numbers afterward.

Overall, Phi-2 looks like a promising starting point—especially for fast experimentation and potential fine-tuning on permitted data—but the transcript’s live tests suggest benchmark performance doesn’t automatically translate to reliable correctness under casual prompting. The takeaway is less “small beats big everywhere” and more “small can be strong, but evaluation and prompting discipline still matter.”

Cornell Notes

Phi-2 is Microsoft’s 2.7B-parameter language model built to test whether small models can reach performance near larger 7B–13B systems. Its training strategy emphasizes data quality: synthetic datasets (generated with GPT-4 and GPT-3.5) for common-sense reasoning and general knowledge, plus heavy coding-related training, and filtered web data for educational value. In benchmark summaries, Phi-2’s biggest gains appear in coding, with smaller but meaningful improvements on broader suites like SquAD v2 and MMLU. A hands-on Colab run shows Phi-2 can answer clearly and handle simple math, but it can still fail on coding tasks and structured extraction, highlighting that benchmark results may not fully match real-world prompting behavior. The model is research-only and not licensed for commercial use.

Why does Phi-2 matter if larger models already exist?

Inference cost is the core driver. Smaller models like Phi-2 can run faster and require less GPU memory than 7B–13B systems, making them more practical for deployment and for fine-tuning on custom tasks. The transcript frames this as a key constraint of large language models: heavy GPU and memory needs for production inference.

What training-data approach is credited for Phi-2’s performance?

The transcript attributes much of Phi-2’s strength to data quality and synthetic augmentation. It describes training on synthetic datasets aimed at common-sense reasoning and general knowledge, including science, daily activities, and theory-of-mind. It also notes a high volume of coding questions and synthetic generation using GPT-4 and GPT-3.5, plus additional web data filtered for educational value and content quality (with GPT-3.5/GPT-4-style filtering).

How does Phi-2 compare to larger open models in the reported benchmarks?

The transcript reports that Phi-2 performs strongly on coding tasks—often cited as 10–20% improvements in some coding-related evaluations. It also claims Phi-2 beats Mistral and Llama 2–class 7B models on common-sense reasoning and language understanding, while being close to 70B results in some areas. It still trails the largest models in certain math/benchmark comparisons.

What does the Hugging Face setup emphasize for running Phi-2?

The notebook uses Transformers and PyTorch, with flash attention to speed inference. The model is loaded using Hugging Face’s instruct-style prompt formatting (system prompt + user prompt) and generation settings like low temperature, a max token cap (1,24 max new tokens as stated), and end-of-sequence handling. The transcript also notes the model’s storage footprint is about 5.5GB despite being only 2.7B parameters.

Where did Phi-2 struggle in the hands-on tests?

Several coding and structured tasks failed. It produced incorrect code for list-splitting with randomness, returned nonsensical code for fetching Tesla stock prices, and didn’t perform the expected sentiment analysis or rewriting for a tweet prompt. In a reading-comprehension extraction task from a markdown table, the transcript claims Phi-2 output incorrect values (e.g., mismatched comprehension scores), even when it appeared to parse the table into a dictionary.

What’s the practical conclusion about Phi-2 from these experiments?

Phi-2 can be a strong starting point—especially for quick experimentation and potentially for fine-tuning on allowed data—but it isn’t reliably correct across all coding and extraction scenarios under casual prompting. The transcript suggests that benchmark claims may overstate real-world robustness unless prompting and evaluation are handled carefully.

Review Questions

What specific data-generation and filtering steps are described as central to Phi-2’s training?
Which evaluation area is repeatedly described as Phi-2’s strongest (and what evidence is given)?
In the Colab tests, what kinds of tasks most often produced incorrect or inconsistent outputs?

Key Points

1
Phi-2 (2.7B parameters) is designed to reduce inference and deployment costs compared with 7B–13B models while aiming for comparable usefulness.
2
Phi-2’s training emphasizes data quality: synthetic datasets (from GPT-4 and GPT-3.5) plus filtered web data for educational value.
3
Coding performance is the most consistently highlighted strength in the benchmark summaries.
4
Phi-2 is research-only and not licensed for commercial applications, even if it’s accessible via Hugging Face.
5
Hands-on prompting can yield fluent answers and correct simple math, but coding and structured extraction can still fail or become inconsistent.
6
Running Phi-2 in Colab uses Transformers/PyTorch with flash attention and instruct-style prompt formatting, and the model still requires several gigabytes of storage.

Highlights

Phi-2’s biggest reported gains cluster around coding benchmarks, even though it remains far smaller than 7B–13B competitors.

Microsoft attributes performance to synthetic-data strategy—GPT-4/GPT-3.5-generated training data plus filtered web content—rather than parameter count alone.

In live Colab tests, Phi-2 sometimes answers well but can generate incorrect code and inconsistent extracted values from tables.

Phi-2 is explicitly positioned as research-only, limiting real-world commercial use despite open availability for downloading. 

Topics

Phi-2 Model
Small Language Models
Synthetic Training Data
Benchmarking
Google Colab Inference

Mentioned

LLM
GPU
CPU
MMLU
SquAD
T4
GPT
GPT-4
GPT-3.5
PyTorch
GPU
GPU
ALPACA
LoRA
RLHF
SFT