smolagents - HuggingFace's NEW Agent Framework

TL;DR

smolagents introduces “code agents” that can write and execute Python in a sandbox, aiming to make agent reasoning more direct than JSON-only planning.

Briefing Cornell Notes

Briefing

Hugging Face’s new “smolagents” framework pushes agent building toward “code agents”: instead of forcing an LLM to emit JSON-style plans, it can write and run Python in a sandbox to decide what to do next. The practical payoff is a simpler path from prompt to action—often just “model + tools”—while still keeping guardrails through restricted, sandboxed imports and tool access.

The framework’s first big differentiator is how it defines “agency.” It positions agents on a spectrum: from tightly constrained tool-calling loops (safer, more predictable) to higher-agency setups that let models take multi-step actions. smolagents leans into the middle ground by supporting both tool-calling agents and code agents, with code agents designed to let the model “think in code” and execute it step-by-step. This direction draws on prior research showing benefits from letting models execute code (e.g., via Python) and feed back structured results, rather than relying purely on text or JSON.

On the model side, smolagents is built to work with Hugging Face Hub models, with an out-of-the-box default using “Qwen 2.5 Coder 32B Instruct.” Access depends on Hugging Face account tier, but the framework also supports proprietary models through “light llm,” enabling use of OpenAI and Anthropic-style backends. In practice, the setup is framed as minimal: import a code agent, provide a model wrapper, and register tools. A built-in example computes the cube root of 27 by running Python directly—no external search tool needed.

Where the framework becomes most revealing is in real tasks that require external data and reasoning. For a route-time question (“drive from Melbourne to Sydney”), the agent switches to web search, extracts distance/time candidates from results, and produces an estimated range. For a more complex finance scenario (“buy Bitcoin with $11,000 to reach $1 million”), the agent attempts to fetch historical prices and write code to compute the answer, but it fails repeatedly due to sandbox restrictions—certain Python libraries (like requests) and even JSON handling are not authorized by default. The agent then falls back to less reliable strategies (web queries and printed outputs) and hits a maximum iteration limit, consuming substantial token budgets along the way.

The transcript also highlights how to tune the sandbox: authorized imports can be expanded (e.g., adding requests, bs4, or math), and system prompts can be overridden to change agent behavior. Even with tuning, the agent may still struggle with errors when assembling tables or performing multi-step data work, suggesting that reliability depends heavily on allowed libraries, iteration limits, and prompt/model fit.

Finally, smolagents supports traditional tool-calling patterns (React-like loops) and custom tools defined by developers, including specifying input/output schemas and pushing tools to the Hugging Face Hub for reuse. The overall message is that smolagents makes agent experimentation faster and more flexible, but code-agent reliability still hinges on sandbox permissions and error-handling—especially for data-heavy tasks.

Cornell Notes

smolagents is a Hugging Face library for building agents, with a standout focus on “code agents” that can write and execute Python in a sandbox. It supports both code agents and tool-calling agents, letting developers choose how much agency to grant. The framework is designed to be easy to start with—often just a model plus tools—using Hugging Face Hub models by default (Qwen 2.5 Coder 32B Instruct) or proprietary models via light llm. In demonstrations, simple math works well, while data-heavy tasks can fail when the sandbox blocks needed libraries (e.g., requests) or when iteration limits are reached. Custom tools, authorized imports, and system prompt overrides are key levers for improving outcomes.

What does “code agent” mean in smolagents, and how is it different from JSON-style agents?

A code agent is designed to generate and run Python code as part of its reasoning loop. Instead of producing a structured JSON plan and then calling tools purely from that plan, the model can execute code in a sandbox to compute intermediate results and decide next steps. The transcript contrasts this with “traditional kind of JSON agents,” emphasizing that code execution can return objects and enable tighter feedback than text-only approaches.

How does smolagents handle model choice—Hugging Face Hub vs proprietary models?

smolagents can call models from the Hugging Face Hub, with the default example using “Qwen 2.5 Coder 32B Instruct.” Access may depend on whether the user is on a paid/pro tier. For proprietary models, the transcript describes using “light llm” and passing a model ID such as “GPT 40,” along with the necessary API key setup.

Why did the Bitcoin investment example fail, and what does that reveal about sandboxing?

The agent attempted to import “requests” to fetch historical Bitcoin prices, but the sandbox blocked it. The error indicates that only a limited set of Python libraries were authorized (the transcript lists examples like random, date time, time, stat, math, inter tools, etc.). Because JSON was also not allowed in that run, the agent couldn’t assemble the needed data reliably and eventually hit a maximum iteration limit.

What knobs can improve code-agent performance in smolagents?

The transcript points to three main levers: (1) change the model (e.g., trying “Llama 3.37B Instruct,” though it required a pro subscription), (2) expand authorized imports (adding libraries like requests and bs4, and noting that json might be needed), and (3) adjust the system prompt template to better match the model’s behavior. It also mentions increasing max iterations as a possible remedy when the agent needs more attempts.

How do tool-calling agents and custom tools fit alongside code agents?

smolagents still supports tool-calling agents that behave more like React-style loops: the model calls tools, observes results, and iterates. Developers can also define custom tools by inheriting from the framework’s tool class, specifying name/description and input/output structure. The transcript notes that tools can be pushed to the Hugging Face Hub so others can import and reuse them.

What role does memory play when an agent makes repeated mistakes?

The transcript suggests adding memory so the agent can learn from prior failures—similar to approaches like Voyager, where unsuccessful strategies are remembered to avoid repeating them. It also mentions summarizing or storing information from agent logs, which could help prevent repeated code execution errors in future runs.

Review Questions

In smolagents, what are the practical consequences of restricting authorized imports in the sandbox?
Compare the failure modes of the cube-root task versus the Bitcoin investment task—what changed in the agent’s required capabilities?
How would you design a custom tool (inputs/outputs) to reduce reliance on code execution for a data-heavy workflow?

Key Points

1
smolagents introduces “code agents” that can write and execute Python in a sandbox, aiming to make agent reasoning more direct than JSON-only planning.
2
The framework supports a spectrum of agency, including both code agents and tool-calling agents, so developers can choose safety vs flexibility.
3
Default model usage centers on Hugging Face Hub models (including Qwen 2.5 Coder 32B Instruct), while proprietary models are supported via light llm (e.g., GPT 40).
4
Sandbox restrictions on authorized imports (such as blocking requests) can cause data-fetching and table-building tasks to fail, even when the model attempts multiple strategies.
5
Authorized imports, system prompt overrides, and model selection are key levers for improving reliability and reducing repeated errors.
6
Custom tools can be defined with explicit input/output schemas and can be shared via the Hugging Face Hub for reuse.
7
Token usage can spike during multi-step code-agent runs, especially when the agent hits max iterations after repeated failures.

Highlights

smolagents’ core shift is letting agents execute Python in a sandbox—turning reasoning into runnable code rather than only emitting JSON plans.

The Bitcoin example demonstrates how sandbox policy (e.g., blocking requests and json) can derail an otherwise plausible computation and force fallback behaviors.

Authorized imports and system prompt templates are treated as first-class controls for steering code-agent behavior and capabilities.

Custom tools can be pushed to the Hugging Face Hub, suggesting a path toward a reusable ecosystem of agent capabilities.

Topics

Agent Framework
Code Agents
Tool Calling
Sandbox Imports
Model Integration

Mentioned

Sam Witteveen