Microsoft Loves SLM (Small Language Models) - Phi-2 / Ocra 2
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Azure AI Studio’s “model as a service” preview is positioned to provide pay-as-you-go, token-based inference APIs plus hosted fine-tuning for open models.
Briefing
Microsoft is pushing open-source small language models (SLMs) into practical, pay-as-you-go deployment—an approach that could make high-quality generative AI cheaper and easier to iterate on than running large frontier models on expensive GPU infrastructure.
A key thread is Microsoft’s “model as a service” preview in Azure AI Studio (referred to as Asher AI Studio in the transcript). The service is positioned as a way to offer open models through a catalog with hosted inference APIs and hosted fine-tuning. Instead of paying for always-on high-end hardware, pricing is described as input/output-token based—similar to how API usage is billed for larger models—making it more attractive for experimentation and “dead test cycles.” The transcript also notes compatibility with common LLM app tooling such as LangChain, signaling that developers can build applications on top of these hosted models without building their own serving stack.
Within Azure AI Studio’s model catalog, the transcript highlights Meta’s Llama 2 family as an early offering. The catalog view includes model documentation such as architecture details, training data and evaluation results, and downloadable artifacts. There’s also an interactive “try it” option (example given: “what is 2 + 2”), plus a deploy option—though the pay-as-you-go behavior is not yet available in the preview environment the narrator is using.
Alongside Llama 2, the transcript spotlights Microsoft Research’s Phi-2. Phi-2 is described as a 2.7B-parameter model with a research license, implying limits on commercial use. Training reportedly took seven days, and the dataset is characterized as a mix of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon, with GPT-4 used for assessment. The small size and research-license constraint are framed as part of why developers should watch for the upcoming pay-as-you-go service before investing in experiments.
The other major research anchor is Microsoft Research’s OCRA 2 (from the paper “Teaching Small Language Models How to Reason,” released November 20). OCRA 2 is described as a 13B-parameter model designed to improve reasoning by imitating step-by-step reasoning traces from more capable LLMs. Rather than expecting a smaller model to answer complex tasks in one shot, the approach teaches it to break tasks into structured steps—using methods such as recall/generate and direct answer, plus training it to choose among solution strategies depending on the task. Reported comparisons against Llama 2 suggest OCRA 2 can outperform in some areas, including a noted advantage over a 7B Llama 2 variant.
The transcript closes with a forward-looking plan: the creator intends to test these open models through the model-as-a-service workflow, explore fine-tuning for specialized, high-precision datasets, and also run additional experiments using OpenAI’s ChatGPT-3.5 fine-tuning. The overall message is that Microsoft’s bet on SLMs—paired with token-based hosted inference and fine-tuning—could shift development toward smaller, task-specific models that are faster and cheaper to iterate on while still delivering strong performance when trained with the right data and reasoning scaffolding.
Cornell Notes
Microsoft is rolling out a “model as a service” preview in Azure AI Studio that brings open-source small language models (SLMs) into token-based hosted inference and hosted fine-tuning. The catalog approach is meant to lower the cost and friction of experimenting compared with running large frontier models on expensive GPUs. Early examples include Meta’s Llama 2 family and Microsoft Research’s Phi-2 (2.7B parameters, research license), with Phi-2 trained on a mix of GPT-3.5 synthetic data and filtered web data assessed by GPT-4. Microsoft also highlights OCRA 2 (13B), trained to reason by imitating step-by-step traces from stronger LLMs and teaching multiple reasoning strategies. Together, the push suggests SLMs plus hosted fine-tuning could make specialized, high-quality AI systems easier to build and iterate.
What does “model as a service” change for developers working with open-source LLMs?
Why is Phi-2 positioned as a notable SLM, and what constraints come with it?
How does OCRA 2 aim to improve reasoning in smaller models?
What does the transcript say about OCRA 2’s performance relative to Llama 2?
What workflow does the transcript suggest for experimenting with SLMs going forward?
Review Questions
- How does token-based pay-as-you-go inference address the cost barrier of hosting large models for experimentation?
- What training signals and data sources are described for Phi-2, and how do they differ from OCRA 2’s reasoning-focused approach?
- What specific reasoning behaviors does OCRA 2 learn, and why might step-by-step imitation help smaller models on complex tasks?
Key Points
- 1
Azure AI Studio’s “model as a service” preview is positioned to provide pay-as-you-go, token-based inference APIs plus hosted fine-tuning for open models.
- 2
Token-based pricing is framed as a way to avoid expensive, always-on GPU hosting during experimentation.
- 3
Meta’s Llama 2 family is highlighted in the model catalog, with documentation, artifacts, and an interactive “try it” experience.
- 4
Microsoft Research’s Phi-2 (2.7B parameters) is described as research-licensed and trained using GPT-3.5 synthetic data plus filtered Falcon web data assessed by GPT-4.
- 5
Microsoft Research’s OCRA 2 (13B) targets reasoning by imitating step-by-step traces from stronger LLMs and teaching multiple solution strategies.
- 6
The transcript’s forward plan centers on fine-tuning SLMs on small, task-specific datasets and evaluating results systematically.