Unlock Open Multimodality with Phi-4
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Microsoft released official Phi-4 3.8B mini instruct weights after earlier delays, including multiple variants on Hugging Face.
Briefing
Microsoft’s Phi-4 family just got more practical for local, multimodal work: the Phi-4 3.8B “mini instruct” lineup now includes function calling and a dedicated Phi-4 multimodal model that can process images and audio. The release matters because it brings “agent-ready” capabilities—choosing tools via function calling—plus on-device multimodal understanding, without requiring the largest Phi-4 models or cloud inference.
Weights for the Phi-4 14B had been discussed since December, but official downloads lagged until January. That same pattern of delayed weight availability is now resolved for the smaller, more deployable 3.8B “mini instruct” models. Hugging Face hosts multiple variants, including instruct-tuned models, “Onyx” versions, and a GG UF option. Notably, the base (non-instruct) model weights aren’t provided for the Phi-4 3.8B line, limiting full fine-tuning workflows that were easier with earlier releases.
For the Phi-4 mini instruct (3.8B), one standout addition is function calling. That enables local tool-use patterns—small agents that don’t need heavy reasoning but do need decision points about which tools to invoke. The transcript also emphasizes deployment reality: many users run models on devices through stacks like LM Studio and Ollama, often using quantized formats such as GG UF and Onyx. Microsoft’s release of an Onyx runtime is framed as a direct response to that trend, with support aimed at serving models across environments ranging from Raspberry Pi to mobile devices.
On multimodality, the Phi-4 multimodal instruct model uses the same 3.8B backbone but extends it with both vision and audio encoders. The vision side converts images into tokens via a SigLIP encoder (described as likely SigLIP-1 rather than the newer SigLIP-2). The model can handle image inputs up to 1344×1344 and is trained to generate interleaved image-and-text token sequences. For audio, the audio encoder is larger than the vision encoder, and the system uses audio-specific LoRA adapters to route audio tokens into the Phi-4 backbone. Training details cited include pre-training on 2 million hours of speech pairs, followed by 100 million curated speech/audio supervised fine-tuning samples.
In practical tests using the Transformers library, the model runs with an Auto processor that handles both images and audio. Image results are described as strong for recognition and OCR: it identifies a bee on a flower, answers color and content questions, and transcribes text from a screenshot of a blog post with high apparent accuracy. Some limitations show up in counting (planes in an image are miscounted), and bounding boxes appear inconsistent in placement even when the model returns many boxes.
Audio performance is presented as a key differentiator. The model transcribes an MP3 interview clip into text and is described as accurate enough to feel competitive with or surpass Whisper on certain cases—while also enabling translation of the transcribed text into French. The transcript closes by positioning the Phi-4 multimodal model as a strong open-weights option for local multimodal tasks and agent workflows, especially when paired with function calling and tool-use patterns.
Cornell Notes
Phi-4’s 3.8B “mini instruct” line adds two capabilities that make local deployment more useful: function calling and true multimodality. The function-calling update targets small, tool-using agents where decisions about which tools to run matter more than deep reasoning. The Phi-4 multimodal instruct model pairs the 3.8B backbone with a vision encoder (SigLIP) and an audio encoder, then uses LoRA adapters so image and audio tokens can flow into the language model. Reported training includes 2 million hours of speech-pair pre-training plus 100 million curated supervised speech/audio samples. In tests, the model performs well on OCR and audio transcription/translation, though counting and bounding-box accuracy are less reliable.
What changed in Phi-4 3.8B “mini instruct” that makes it more agent-friendly?
Why do the released formats matter for running Phi-4 locally?
How does Phi-4 multimodal handle images and audio differently from a text-only model?
What training scale details are cited for the multimodal model?
What strengths and weaknesses show up in the reported image tests?
How does the model perform on audio tasks beyond transcription?
Review Questions
- What role does function calling play in enabling tool-using agents with Phi-4 mini instruct?
- Describe the pipeline for turning images and audio into tokens that the Phi-4 backbone can generate from.
- Which multimodal tasks in the transcript appear strongest (and which appear weakest), and what examples were used to demonstrate each?
Key Points
- 1
Microsoft released official Phi-4 3.8B mini instruct weights after earlier delays, including multiple variants on Hugging Face.
- 2
Phi-4 3.8B mini instruct now supports function calling, enabling local tool-use agents with decision points.
- 3
The release emphasizes on-device deployment, with Onyx runtime support and quantized formats like GG UF for use in tools such as LM Studio and Ollama.
- 4
Phi-4 multimodal instruct extends the 3.8B backbone with both a vision encoder (SigLIP) and an audio encoder, using LoRA adapters to integrate audio tokens.
- 5
The multimodal model supports image inputs up to 1344×1344 and is trained to generate interleaved image-and-text token sequences.
- 6
Reported audio capabilities include transcription and translation (e.g., into French), with accuracy described as strong for some cases.
- 7
In tests, OCR and recognition are strong, while counting and bounding-box placement are less reliable.