Get AI summaries of any video or article — Sign up free
Microsoft Loves SLM (Small Language Models) - Phi-2 / Ocra 2 thumbnail

Microsoft Loves SLM (Small Language Models) - Phi-2 / Ocra 2

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Azure AI Studio’s “model as a service” preview is positioned to provide pay-as-you-go, token-based inference APIs plus hosted fine-tuning for open models.

Briefing

Microsoft is pushing open-source small language models (SLMs) into practical, pay-as-you-go deployment—an approach that could make high-quality generative AI cheaper and easier to iterate on than running large frontier models on expensive GPU infrastructure.

A key thread is Microsoft’s “model as a service” preview in Azure AI Studio (referred to as Asher AI Studio in the transcript). The service is positioned as a way to offer open models through a catalog with hosted inference APIs and hosted fine-tuning. Instead of paying for always-on high-end hardware, pricing is described as input/output-token based—similar to how API usage is billed for larger models—making it more attractive for experimentation and “dead test cycles.” The transcript also notes compatibility with common LLM app tooling such as LangChain, signaling that developers can build applications on top of these hosted models without building their own serving stack.

Within Azure AI Studio’s model catalog, the transcript highlights Meta’s Llama 2 family as an early offering. The catalog view includes model documentation such as architecture details, training data and evaluation results, and downloadable artifacts. There’s also an interactive “try it” option (example given: “what is 2 + 2”), plus a deploy option—though the pay-as-you-go behavior is not yet available in the preview environment the narrator is using.

Alongside Llama 2, the transcript spotlights Microsoft Research’s Phi-2. Phi-2 is described as a 2.7B-parameter model with a research license, implying limits on commercial use. Training reportedly took seven days, and the dataset is characterized as a mix of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon, with GPT-4 used for assessment. The small size and research-license constraint are framed as part of why developers should watch for the upcoming pay-as-you-go service before investing in experiments.

The other major research anchor is Microsoft Research’s OCRA 2 (from the paper “Teaching Small Language Models How to Reason,” released November 20). OCRA 2 is described as a 13B-parameter model designed to improve reasoning by imitating step-by-step reasoning traces from more capable LLMs. Rather than expecting a smaller model to answer complex tasks in one shot, the approach teaches it to break tasks into structured steps—using methods such as recall/generate and direct answer, plus training it to choose among solution strategies depending on the task. Reported comparisons against Llama 2 suggest OCRA 2 can outperform in some areas, including a noted advantage over a 7B Llama 2 variant.

The transcript closes with a forward-looking plan: the creator intends to test these open models through the model-as-a-service workflow, explore fine-tuning for specialized, high-precision datasets, and also run additional experiments using OpenAI’s ChatGPT-3.5 fine-tuning. The overall message is that Microsoft’s bet on SLMs—paired with token-based hosted inference and fine-tuning—could shift development toward smaller, task-specific models that are faster and cheaper to iterate on while still delivering strong performance when trained with the right data and reasoning scaffolding.

Cornell Notes

Microsoft is rolling out a “model as a service” preview in Azure AI Studio that brings open-source small language models (SLMs) into token-based hosted inference and hosted fine-tuning. The catalog approach is meant to lower the cost and friction of experimenting compared with running large frontier models on expensive GPUs. Early examples include Meta’s Llama 2 family and Microsoft Research’s Phi-2 (2.7B parameters, research license), with Phi-2 trained on a mix of GPT-3.5 synthetic data and filtered web data assessed by GPT-4. Microsoft also highlights OCRA 2 (13B), trained to reason by imitating step-by-step traces from stronger LLMs and teaching multiple reasoning strategies. Together, the push suggests SLMs plus hosted fine-tuning could make specialized, high-quality AI systems easier to build and iterate.

What does “model as a service” change for developers working with open-source LLMs?

It shifts deployment from self-hosting on high-end GPUs to using hosted inference APIs and hosted fine-tuning inside Azure AI Studio. The transcript emphasizes token-based (input/output) pricing—described as pay-as-you-go—so experimentation can be cheaper than paying for always-on infrastructure during “dead test cycles.” It also mentions support for building LLM apps with LangChain.

Why is Phi-2 positioned as a notable SLM, and what constraints come with it?

Phi-2 is described as a 2.7B-parameter model from Microsoft Research, trained over seven days. Its dataset is characterized as NLP synthetic data created by GPT-3.5 plus filtered web data from Falcon, with GPT-4 used to assess the data. The transcript also flags a research license, implying it’s not intended for commercial use.

How does OCRA 2 aim to improve reasoning in smaller models?

OCRA 2 is trained to imitate step-by-step reasoning traces from more capable LLMs, teaching the smaller model to decompose tasks into structured steps. The transcript lists reasoning-oriented training behaviors such as recall/generate, extract/generate, and direct answer methods, plus instruction for choosing different solution strategies depending on the task.

What does the transcript say about OCRA 2’s performance relative to Llama 2?

It references comparisons in the OCRA 2 paper where OCRA 2 (shown in blue/black) is compared against Llama 2 variants, including a 7B Llama 2 line. The transcript claims OCRA 2 is outperforming Llama 2 in some areas, though it notes the results are “all over the place,” implying mixed but meaningful gains.

What workflow does the transcript suggest for experimenting with SLMs going forward?

The plan is to test open models through the Azure AI Studio model-as-a-service workflow, then fine-tune them on small, highly specialized datasets. The creator also mentions exploring fine-tuning with ChatGPT-3.5 via an OpenAI interface, indicating a parallel track between open SLM fine-tuning and closed-model fine-tuning.

Review Questions

  1. How does token-based pay-as-you-go inference address the cost barrier of hosting large models for experimentation?
  2. What training signals and data sources are described for Phi-2, and how do they differ from OCRA 2’s reasoning-focused approach?
  3. What specific reasoning behaviors does OCRA 2 learn, and why might step-by-step imitation help smaller models on complex tasks?

Key Points

  1. 1

    Azure AI Studio’s “model as a service” preview is positioned to provide pay-as-you-go, token-based inference APIs plus hosted fine-tuning for open models.

  2. 2

    Token-based pricing is framed as a way to avoid expensive, always-on GPU hosting during experimentation.

  3. 3

    Meta’s Llama 2 family is highlighted in the model catalog, with documentation, artifacts, and an interactive “try it” experience.

  4. 4

    Microsoft Research’s Phi-2 (2.7B parameters) is described as research-licensed and trained using GPT-3.5 synthetic data plus filtered Falcon web data assessed by GPT-4.

  5. 5

    Microsoft Research’s OCRA 2 (13B) targets reasoning by imitating step-by-step traces from stronger LLMs and teaching multiple solution strategies.

  6. 6

    The transcript’s forward plan centers on fine-tuning SLMs on small, task-specific datasets and evaluating results systematically.

Highlights

Microsoft’s model-as-a-service preview aims to make open-source SLMs usable via hosted, token-priced inference and fine-tuning rather than self-hosting on expensive GPUs.
Phi-2’s training mix—GPT-3.5 synthetic data plus filtered Falcon web data assessed by GPT-4—pairs small model size with curated supervision.
OCRA 2 trains smaller models to reason by imitating step-by-step traces and learning to choose among different solution strategies.
The model catalog in Azure AI Studio includes model documentation, artifacts, and an interactive testing interface before deployment.

Topics

Mentioned