SmolLMv3 - A Small Reasoner with Tool Use.

TL;DR

SmolLMv3 is a 3B Hugging Face model released as base, instruct, and onyx variants, targeting local deployment with reasoning and tool use.

Briefing Cornell Notes

Briefing

Hugging Face has released SmolLMv3, a 3B-parameter language model aimed at “small” local deployment without giving up reasoning and tool use. The standout feature isn’t just its size—it’s the combination of a dual-mode reasoning system (reasoning on/off) with agentic capabilities like function calling, plus an unusually transparent training “blueprint” that lays out the full recipe from pretraining through long-context and post-training.

SmolLMv3 comes in multiple variants: a base model, an instruct model, and an “onyx” version. Community tooling is already moving quickly, with GGUF builds appearing for use in apps such as LM Studio and Ollama. In performance terms, the model is positioned between Quen 3 1.7B and Quen 3 4B, and for its 3B class it’s described as beating older 2.53B and 3B-era baselines like Quen 2.5 and Llama 3.2B. Hugging Face claims training on 11 trillion tokens and frames the result as state-of-the-art among 3B models while also competing with 4B models.

A key capability is long-context handling, with claims reaching up to 128K context length (and the blueprint mentioning up to 256). Just as important for practical use is the model’s “dual think” setup: prompts can toggle whether the model emits extended reasoning traces or skips them entirely. In testing, reasoning-on prompts often produce longer, more structured plans (for example, generating a detailed lemonade-stand breakdown), while reasoning-off prompts tend to yield simpler outputs.

The most consequential part of the release for builders is the training transparency. Alongside the weights, Hugging Face provides a blueprint detailing each training step: distributed training setup, model architecture choices, long-context training, and post-training recipes. The architecture is described as similar to Llama 3-style design, including group query attention, while incorporating other techniques such as “nope” (contrasted with rotary embeddings) and training-stability ideas like removing weight decay from embedding layers (inspired by MO-style work). The compute budget is also spelled out: 384 H100 GPUs over 24 days, estimated at roughly 220,000+ GPU hours—suggesting a training cost in the “few hundred thousand” range rather than the multi-million-dollar budgets typical of larger proprietary runs.

On the data and alignment side, the blueprint points to a three-phase pretraining approach with shifting data mixes—web-heavy early, then increasing code and math emphasis in later phases. For reasoning, the training appears to lean heavily on synthetic reasoning traces derived from Deepseek R1 and Quen 3 rather than extensive RLVR-style reinforcement learning with verifiable rewards. Alignment uses a variant of DPO, and the recipe includes model merging across checkpoints to form a stronger combined model.

In code and agent tests, SmolLMv3 integrates with common tooling stacks (Transformers out of the box, plus support for frameworks like SG Lang and VLM). Function calling works: when given a tool schema, the model emits structured tool calls (e.g., weather for Copenhagen, or web search queries for Open weights release rumors). When multiple tools are provided, behavior is mixed—sometimes it correctly declines tool use, but other times it may call tools even when not strictly necessary, depending on prompt/tool descriptions.

Overall, SmolLMv3 positions itself as a credible local “small reasoner” for workflows that need both planning and tool use—while the blueprint lowers the barrier for researchers and developers who want to replicate or modify the training recipe without relying on proprietary black boxes.

Cornell Notes

SmolLMv3 is a 3B Hugging Face model released with base, instruct, and onyx variants, designed for local deployment while supporting reasoning and tool use. Its standout practical feature is a dual-mode “think” system that can be toggled via prompts to produce long reasoning traces or skip them for faster, cleaner answers. Hugging Face also publishes a detailed training blueprint, including architecture choices (group query attention, “nope”), long-context training, and post-training recipes, plus compute estimates using 384 H100s over 24 days. In tests, the model performs well for its size on reasoning benchmarks like GSM8K and can generate structured function calls (e.g., weather and web search) when provided tool schemas. The release matters because it combines usable agentic behavior with unusually transparent training documentation.

What makes SmolLMv3 more than a standard 3B chat model?

SmolLMv3 pairs a small parameter count with two practical capabilities: (1) a dual-mode reasoning system that can be turned on or off through prompt instructions, and (2) agentic tool use via function calling. In prompt tests, reasoning-on outputs include extended “think” traces and more detailed plans, while reasoning-off outputs omit those traces. In tool tests, the model emits structured tool calls (with arguments) when given a tool schema, such as calling a weather tool for Copenhagen or generating a web-search query for Open weights release date rumors.

How does the training transparency change what developers can do with the model?

Instead of only releasing weights, Hugging Face provides a blueprint that lays out the training recipe step-by-step: pretraining phases, distributed training setup, architecture details, long-context training, and post-training alignment. It also includes compute budgeting (384 H100 GPUs for 24 days) and describes architectural and training-stability choices like group query attention, “nope” positional handling, and removing weight decay from embedding layers. That level of detail makes it easier to replicate, audit, or modify specific phases without guessing from vague proprietary descriptions.

What does the blueprint suggest about how reasoning capability was built?

The reasoning training appears to rely heavily on synthetic reasoning traces sourced from Deepseek R1 and Quen 3, rather than primarily using long reinforcement learning loops with verifiable rewards (RLVR). Alignment is described as using a variant of DPO. The recipe also includes model merging across checkpoints to create a stronger final model.

How strong is the long-context claim, and what does it imply for real use?

The release claims long context up to 128K, while the blueprint mentions up to 256. For real deployments, that matters because it affects how much conversation history, documents, or tool-related context can be included without truncation. The transcript also notes uncertainty about how well performance holds beyond 128K, so users may want to test their specific workloads at the upper limit.

How does function calling behave when multiple tools are available?

Function calling works when tools are defined with schemas: the model chooses a tool and returns structured arguments. For example, it can call weather for a location and generate search queries for time-sensitive questions. However, when multiple tools are provided and the prompt suggests no tool is needed, results are mixed—sometimes it correctly declines tool use, but other times it may still attempt a search tool, potentially influenced by the tool descriptions and prompt wording.

Why is the compute budget cited as significant for the community?

The blueprint’s compute estimate—384 H100 GPUs over 24 days, roughly 220,000+ GPU hours—suggests training costs in the “few hundred thousand” range rather than the multi-million-dollar budgets associated with many larger proprietary models. Combined with open documentation, that makes smaller-model experimentation more feasible for teams that can access H100 rental capacity as prices decline.

Review Questions

How does the prompt-based “think on/off” mechanism change the structure and usefulness of outputs in SmolLMv3 tests?
What specific architectural and training-stability choices (e.g., group query attention, “nope,” embedding weight decay removal) are highlighted in the blueprint, and why might they matter?
When tool schemas are provided, what signals in the prompt or tool descriptions might cause SmolLMv3 to call a tool even when the user expects no tool use?

Key Points

1
SmolLMv3 is a 3B Hugging Face model released as base, instruct, and onyx variants, targeting local deployment with reasoning and tool use.
2
A dual-mode reasoning system lets users toggle whether the model emits extended “think” traces or returns answers without them.
3
Hugging Face’s published blueprint details the training recipe end-to-end, including architecture choices, long-context training, and post-training alignment steps.
4
The compute budget cited—384 H100 GPUs for 24 days—implies training costs that are far lower than many large proprietary efforts, making replication more realistic.
5
Pretraining is described as three-phase with shifting data mixes toward more code and math later, while reasoning training leans on synthetic traces from Deepseek R1 and Quen 3.
6
Function calling works with tool schemas, but multi-tool scenarios can produce mixed behavior when the prompt implies no tool should be used.
7
The release combines open weights with agentic tooling compatibility (Transformers out of the box, plus additional framework support mentioned in the transcript).

Highlights

SmolLMv3’s most practical differentiator is the combination of prompt-toggled reasoning traces and function calling for tool-using workflows.

The training blueprint is unusually detailed for an open release, including compute budgeting and step-by-step recipes from pretraining through long-context and alignment.

In tool tests, the model reliably emits structured tool calls (weather and web search), though it sometimes calls tools even when not strictly needed.

Topics

SmolLMv3 Release
Tool Calling
Dual Think Reasoning
Long Context
Training Blueprint

Mentioned

Sam Witteveen