Unlock Open Multimodality with Phi-4

TL;DR

Microsoft released official Phi-4 3.8B mini instruct weights after earlier delays, including multiple variants on Hugging Face.

Briefing Cornell Notes

Briefing

Microsoft’s Phi-4 family just got more practical for local, multimodal work: the Phi-4 3.8B “mini instruct” lineup now includes function calling and a dedicated Phi-4 multimodal model that can process images and audio. The release matters because it brings “agent-ready” capabilities—choosing tools via function calling—plus on-device multimodal understanding, without requiring the largest Phi-4 models or cloud inference.

Weights for the Phi-4 14B had been discussed since December, but official downloads lagged until January. That same pattern of delayed weight availability is now resolved for the smaller, more deployable 3.8B “mini instruct” models. Hugging Face hosts multiple variants, including instruct-tuned models, “Onyx” versions, and a GG UF option. Notably, the base (non-instruct) model weights aren’t provided for the Phi-4 3.8B line, limiting full fine-tuning workflows that were easier with earlier releases.

For the Phi-4 mini instruct (3.8B), one standout addition is function calling. That enables local tool-use patterns—small agents that don’t need heavy reasoning but do need decision points about which tools to invoke. The transcript also emphasizes deployment reality: many users run models on devices through stacks like LM Studio and Ollama, often using quantized formats such as GG UF and Onyx. Microsoft’s release of an Onyx runtime is framed as a direct response to that trend, with support aimed at serving models across environments ranging from Raspberry Pi to mobile devices.

On multimodality, the Phi-4 multimodal instruct model uses the same 3.8B backbone but extends it with both vision and audio encoders. The vision side converts images into tokens via a SigLIP encoder (described as likely SigLIP-1 rather than the newer SigLIP-2). The model can handle image inputs up to 1344×1344 and is trained to generate interleaved image-and-text token sequences. For audio, the audio encoder is larger than the vision encoder, and the system uses audio-specific LoRA adapters to route audio tokens into the Phi-4 backbone. Training details cited include pre-training on 2 million hours of speech pairs, followed by 100 million curated speech/audio supervised fine-tuning samples.

In practical tests using the Transformers library, the model runs with an Auto processor that handles both images and audio. Image results are described as strong for recognition and OCR: it identifies a bee on a flower, answers color and content questions, and transcribes text from a screenshot of a blog post with high apparent accuracy. Some limitations show up in counting (planes in an image are miscounted), and bounding boxes appear inconsistent in placement even when the model returns many boxes.

Audio performance is presented as a key differentiator. The model transcribes an MP3 interview clip into text and is described as accurate enough to feel competitive with or surpass Whisper on certain cases—while also enabling translation of the transcribed text into French. The transcript closes by positioning the Phi-4 multimodal model as a strong open-weights option for local multimodal tasks and agent workflows, especially when paired with function calling and tool-use patterns.

Cornell Notes

Phi-4’s 3.8B “mini instruct” line adds two capabilities that make local deployment more useful: function calling and true multimodality. The function-calling update targets small, tool-using agents where decisions about which tools to run matter more than deep reasoning. The Phi-4 multimodal instruct model pairs the 3.8B backbone with a vision encoder (SigLIP) and an audio encoder, then uses LoRA adapters so image and audio tokens can flow into the language model. Reported training includes 2 million hours of speech-pair pre-training plus 100 million curated supervised speech/audio samples. In tests, the model performs well on OCR and audio transcription/translation, though counting and bounding-box accuracy are less reliable.

What changed in Phi-4 3.8B “mini instruct” that makes it more agent-friendly?

Function calling was added to the Phi-4 mini instruct (3.8B). That means the model can decide when to invoke tools and which tools to use, enabling local “small agents” with decision points rather than requiring large-scale reasoning.

Why do the released formats matter for running Phi-4 locally?

The transcript highlights that many users run models on devices rather than only in the cloud. Hugging Face provides instruct variants plus Onyx versions and a GG UF option, and Microsoft’s Onyx runtime is positioned as a way to serve the model across environments like Raspberry Pi and mobile devices. This aligns with common local tooling such as LM Studio and Ollama.

How does Phi-4 multimodal handle images and audio differently from a text-only model?

It adds a vision encoder and an audio encoder on top of the 3.8B backbone. Images are converted into image tokens via a SigLIP encoder and then combined with text in interleaved image-and-text token generation. Audio is converted into audio tokens via an audio encoder, and audio-specific LoRA adapters route those tokens into the Phi-4 backbone.

What training scale details are cited for the multimodal model?

For audio, the transcript cites pre-training on 2 million hours of speech pairs, followed by 100 million curated speech/audio supervised fine-tuning samples. It also notes that Phi-4 models rely heavily on synthetic data for skills like math and coding, with improvements across versions.

What strengths and weaknesses show up in the reported image tests?

Strengths include recognition and OCR: the model correctly identifies a bee on a pink flower and transcribes text from a blog screenshot. It also answers visual questions like color queries. Weaknesses include counting accuracy (planes are miscounted) and bounding-box quality (many boxes are returned, but placements don’t look reliable).

How does the model perform on audio tasks beyond transcription?

The transcript describes accurate transcription of an MP3 interview clip into text, and then translation of that transcription into French. It also suggests that, for some cases, the audio transcription quality can surpass Whisper, though it cautions that transcription-only use might not be the best choice compared with smaller specialized models.

Review Questions

What role does function calling play in enabling tool-using agents with Phi-4 mini instruct?
Describe the pipeline for turning images and audio into tokens that the Phi-4 backbone can generate from.
Which multimodal tasks in the transcript appear strongest (and which appear weakest), and what examples were used to demonstrate each?

Key Points

1
Microsoft released official Phi-4 3.8B mini instruct weights after earlier delays, including multiple variants on Hugging Face.
2
Phi-4 3.8B mini instruct now supports function calling, enabling local tool-use agents with decision points.
3
The release emphasizes on-device deployment, with Onyx runtime support and quantized formats like GG UF for use in tools such as LM Studio and Ollama.
4
Phi-4 multimodal instruct extends the 3.8B backbone with both a vision encoder (SigLIP) and an audio encoder, using LoRA adapters to integrate audio tokens.
5
The multimodal model supports image inputs up to 1344×1344 and is trained to generate interleaved image-and-text token sequences.
6
Reported audio capabilities include transcription and translation (e.g., into French), with accuracy described as strong for some cases.
7
In tests, OCR and recognition are strong, while counting and bounding-box placement are less reliable.

Highlights

Function calling arrives in the Phi-4 3.8B mini instruct line, making local agents more practical for tool selection.

Phi-4 multimodal is built by bolting vision and audio encoders onto the same 3.8B backbone, then routing tokens through LoRA adapters.

The model’s OCR and audio transcription/translation performance are highlighted as standout strengths in hands-on tests.

Counting and bounding-box localization appear to be weaker areas, even when the model returns many candidate boxes.

Topics

Phi-4 Release
Function Calling
Onyx Runtime
Multimodal Encoders
Local Inference

Mentioned

GG UF
LM Studio
Ollama
OCR
LoRA
SigLIP
T4
A100
LAX