SmolLMv3 - A Small Reasoner with Tool Use.
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
SmolLMv3 is a 3B Hugging Face model released as base, instruct, and onyx variants, targeting local deployment with reasoning and tool use.
Briefing
Hugging Face has released SmolLMv3, a 3B-parameter language model aimed at “small” local deployment without giving up reasoning and tool use. The standout feature isn’t just its size—it’s the combination of a dual-mode reasoning system (reasoning on/off) with agentic capabilities like function calling, plus an unusually transparent training “blueprint” that lays out the full recipe from pretraining through long-context and post-training.
SmolLMv3 comes in multiple variants: a base model, an instruct model, and an “onyx” version. Community tooling is already moving quickly, with GGUF builds appearing for use in apps such as LM Studio and Ollama. In performance terms, the model is positioned between Quen 3 1.7B and Quen 3 4B, and for its 3B class it’s described as beating older 2.53B and 3B-era baselines like Quen 2.5 and Llama 3.2B. Hugging Face claims training on 11 trillion tokens and frames the result as state-of-the-art among 3B models while also competing with 4B models.
A key capability is long-context handling, with claims reaching up to 128K context length (and the blueprint mentioning up to 256). Just as important for practical use is the model’s “dual think” setup: prompts can toggle whether the model emits extended reasoning traces or skips them entirely. In testing, reasoning-on prompts often produce longer, more structured plans (for example, generating a detailed lemonade-stand breakdown), while reasoning-off prompts tend to yield simpler outputs.
The most consequential part of the release for builders is the training transparency. Alongside the weights, Hugging Face provides a blueprint detailing each training step: distributed training setup, model architecture choices, long-context training, and post-training recipes. The architecture is described as similar to Llama 3-style design, including group query attention, while incorporating other techniques such as “nope” (contrasted with rotary embeddings) and training-stability ideas like removing weight decay from embedding layers (inspired by MO-style work). The compute budget is also spelled out: 384 H100 GPUs over 24 days, estimated at roughly 220,000+ GPU hours—suggesting a training cost in the “few hundred thousand” range rather than the multi-million-dollar budgets typical of larger proprietary runs.
On the data and alignment side, the blueprint points to a three-phase pretraining approach with shifting data mixes—web-heavy early, then increasing code and math emphasis in later phases. For reasoning, the training appears to lean heavily on synthetic reasoning traces derived from Deepseek R1 and Quen 3 rather than extensive RLVR-style reinforcement learning with verifiable rewards. Alignment uses a variant of DPO, and the recipe includes model merging across checkpoints to form a stronger combined model.
In code and agent tests, SmolLMv3 integrates with common tooling stacks (Transformers out of the box, plus support for frameworks like SG Lang and VLM). Function calling works: when given a tool schema, the model emits structured tool calls (e.g., weather for Copenhagen, or web search queries for Open weights release rumors). When multiple tools are provided, behavior is mixed—sometimes it correctly declines tool use, but other times it may call tools even when not strictly necessary, depending on prompt/tool descriptions.
Overall, SmolLMv3 positions itself as a credible local “small reasoner” for workflows that need both planning and tool use—while the blueprint lowers the barrier for researchers and developers who want to replicate or modify the training recipe without relying on proprietary black boxes.
Cornell Notes
SmolLMv3 is a 3B Hugging Face model released with base, instruct, and onyx variants, designed for local deployment while supporting reasoning and tool use. Its standout practical feature is a dual-mode “think” system that can be toggled via prompts to produce long reasoning traces or skip them for faster, cleaner answers. Hugging Face also publishes a detailed training blueprint, including architecture choices (group query attention, “nope”), long-context training, and post-training recipes, plus compute estimates using 384 H100s over 24 days. In tests, the model performs well for its size on reasoning benchmarks like GSM8K and can generate structured function calls (e.g., weather and web search) when provided tool schemas. The release matters because it combines usable agentic behavior with unusually transparent training documentation.
What makes SmolLMv3 more than a standard 3B chat model?
How does the training transparency change what developers can do with the model?
What does the blueprint suggest about how reasoning capability was built?
How strong is the long-context claim, and what does it imply for real use?
How does function calling behave when multiple tools are available?
Why is the compute budget cited as significant for the community?
Review Questions
- How does the prompt-based “think on/off” mechanism change the structure and usefulness of outputs in SmolLMv3 tests?
- What specific architectural and training-stability choices (e.g., group query attention, “nope,” embedding weight decay removal) are highlighted in the blueprint, and why might they matter?
- When tool schemas are provided, what signals in the prompt or tool descriptions might cause SmolLMv3 to call a tool even when the user expects no tool use?
Key Points
- 1
SmolLMv3 is a 3B Hugging Face model released as base, instruct, and onyx variants, targeting local deployment with reasoning and tool use.
- 2
A dual-mode reasoning system lets users toggle whether the model emits extended “think” traces or returns answers without them.
- 3
Hugging Face’s published blueprint details the training recipe end-to-end, including architecture choices, long-context training, and post-training alignment steps.
- 4
The compute budget cited—384 H100 GPUs for 24 days—implies training costs that are far lower than many large proprietary efforts, making replication more realistic.
- 5
Pretraining is described as three-phase with shifting data mixes toward more code and math later, while reasoning training leans on synthetic traces from Deepseek R1 and Quen 3.
- 6
Function calling works with tool schemas, but multi-tool scenarios can produce mixed behavior when the prompt implies no tool should be used.
- 7
The release combines open weights with agentic tooling compatibility (Transformers out of the box, plus additional framework support mentioned in the transcript).