MiroThinker 1.5 - The 30B That Outperforms 1T Models
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
MirrorThinker 1.5 is designed for long-horizon, tool-centric agent behavior, aiming to support up to 400 tool calls rather than relying solely on model-internal knowledge.
Briefing
MirrorThinker 1.5 is positioned as a practical shift in agent design: instead of relying on a single, information-heavy model, it’s built to repeatedly call tools—up to 400 tool calls—so it can research, verify, and generate multi-step outputs. The core claim is that this tool-centric “agentic” capability lets a comparatively smaller model compete with far larger systems on tasks that demand sustained tool use, not just raw language fluency. That matters because many real workflows—web research, code execution, report writing, and even podcast-style or slide-deck generation—depend on retrieving new information and transforming it through multiple steps.
The lineup centers on two open models under an MIT license: MirrorThinker 235B (22B active parameters) and a smaller 30B variant (3B active parameters). Both are derived from the Qwen mixture-of-experts base models and then improved for long-horizon tool calling. They come with a 256,000-token context window and are intended to support up to 400 tool calls, though the transcript notes that reaching 400 elegantly without loops or repetition remains a broader challenge across agent systems.
Benchmark comparisons are used to argue that MirrorThinker’s tool-use focus narrows the gap with much larger models. On some evaluations—such as humanities-oriented tasks—the larger MirrorThinker model is described as close to Gemini 3 Pro and nearer to GPT-5’s high-level model and GLM 4.7. In browser-oriented benchmarks, MirrorThinker is described as state-of-the-art. The emphasis, however, is less on “climbing” a single leaderboard and more on overall competitiveness against other tool-using families, including DeepSeek V 3.2, MiniMax models, GLM, and Qwen’s own K2 thinking model.
Under the hood, the agent setup relies on a suite of tool interfaces: code execution in sandboxes, file management, information retrieval, web search, and page fetching. Long multi-step runs require context management—deciding what to keep, what to truncate, and how to retain recency—so the model can keep track of what matters as tool results accumulate. The transcript highlights that this kind of context retention and compaction is increasingly being “baked into” agent systems, citing similar directions from Anthropic (Claude models and Claude Code) and OpenAI, with early examples also appearing in Google products.
A hands-on walkthrough shows how the 30B model can be run locally, but with real hardware costs: full-resolution use reportedly needs an NVIDIA A100 with 80GB RAM, and model weight downloads take several minutes. Instead of using Hugging Face Transformers directly, the setup runs a VLM server so the model can be accessed via an OpenAI-style API, leveraging built-in function-calling support. Tool calls are executed through a custom agent loop built from scratch (not LangChain), with explicit handling for tool invocation, tool results, and iteration limits.
In practice, the agent can solve math via calculate tools and perform web research by searching and fetching pages, but the transcript flags two recurring failure modes: hitting the maximum iteration budget before producing a final answer, and producing answers influenced by irrelevant retrieved content. A time-zone example (Singapore time) illustrates another friction point—correct results may require many tool steps, especially when the runtime isn’t in the target region. The takeaway is that MirrorThinker 1.5 looks useful for local, non-real-time workflows that tolerate multi-step latency, while smaller models still struggle with efficiency and step quality.
Finally, the transcript raises a forward-looking question: if quantized versions (MLX 8-bit or 4-bit) can preserve reasoning quality, MirrorThinker-style tool agents could become more accessible for local deployment using tools like Llama.cpp or LLM Studio. An online demo is offered for experimentation, with the expectation that it may use the larger model given the longer “thinking” behavior.
Cornell Notes
MirrorThinker 1.5 is an MIT-licensed, open tool-using agent model built for long chains of tool calls—up to 400—so it can research, fetch information, run code, and generate multi-step outputs. It comes in two sizes: MirrorThinker 235B (22B active parameters) and a smaller 30B variant (3B active parameters), both based on Qwen mixture-of-experts and improved for tool calling. The models support a 256,000-token context window and rely on context retention/compaction to manage what to keep across many tool results. Benchmarks are presented as showing competitiveness with much larger systems, especially on tasks that reward sustained tool use. A local run demo shows the practical tradeoffs: correct answers are possible, but iteration limits and irrelevant retrieval can prevent completion or reduce answer quality.
What makes MirrorThinker 1.5 different from “bigger model, more knowledge” approaches?
How do the two MirrorThinker models compare in size and active parameters?
Why does context management matter when an agent can call tools hundreds of times?
What are the main failure modes observed during multi-step tasks?
How does the local setup work, and what hardware constraints are mentioned?
What does the Singapore time example reveal about tool-agent efficiency?
Review Questions
- What does “up to 400 tool calls” imply for an agent’s workflow, and why is context retention essential to make that feasible?
- Compare MirrorThinker 235B and MirrorThinker 30B in terms of active parameters and practical deployment constraints mentioned in the walkthrough.
- Identify two specific reasons a multi-step web research task might fail to produce a final answer even when the agent is capable of finding relevant pages.
Key Points
- 1
MirrorThinker 1.5 is designed for long-horizon, tool-centric agent behavior, aiming to support up to 400 tool calls rather than relying solely on model-internal knowledge.
- 2
The model family includes MirrorThinker 235B (22B active parameters) and MirrorThinker 30B (3B active parameters), both based on Qwen mixture-of-experts and improved for tool calling.
- 3
A 256,000-token context window is paired with recency-based context retention/truncation so the agent can keep working across many tool results.
- 4
Benchmark results are presented as showing competitiveness with larger tool-using models, with state-of-the-art claims in browser-related evaluations.
- 5
Local experimentation is feasible but hardware-heavy at full resolution; the walkthrough cites an NVIDIA A100 with 80GB RAM and multi-minute weight downloads.
- 6
Agent runs can fail due to iteration limits or irrelevant retrieval influencing outputs, so custom evaluation of step quality is important.
- 7
Quantization (MLX 8-bit/4-bit) is raised as a potential path to make tool agents more locally deployable without losing too much reasoning quality.