Ollama Launch + Claude Code + GLM Flash
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Ollama’s “Ollama launch” adds a one-command way to run Claude Code–style coding agents locally using Anthropic API compatibility.
Briefing
Ollama has introduced “Ollama launch,” a one-command way to run Anthropic API–compatible coding assistants locally—making it possible to use Claude Code–style workflows with models like ZAI’s GLM 4.7 Flash. The setup is straightforward: update Ollama, pull the desired model, adjust the model’s context length, then launch Claude Code from the terminal. In practice, the workflow works, including tool use via MCP, but performance is the tradeoff.
The test centers on GLM 4.7 Flash, a smaller GLM 4.7 variant described as 30B parameters with 3B active parameters—positioned as similar in scale to mixture-of-experts models used in the Quen 3 family. With Ollama launch, the model can be run on a Mac, and the user confirms the GLM 4.7 Flash model loads correctly and can drive a “Claude Code” style experience (including plan mode and agent-like task execution).
A key requirement is context length. Ollama defaults to a 4,096-token context window, but the user found that Claude Code–style tooling and memory behavior breaks down without a larger window. The fix is to raise the context length to 64K in Ollama settings for the model. Without that change, the assistant “churns” and fails to retain enough context to use tools properly or save outputs reliably.
Even with the 64K setting, speed becomes the limiting factor. Running the model locally slows both the initial “prefill” and subsequent token decoding, and the user reports that results take noticeably longer than cloud-based Claude Code setups using larger Anthropic models. After about 90 minutes of testing on a Mac Mini Pro with 32GB RAM, the system could perform many Claude Code-supported tasks and detect MCP tools, but it sometimes produced incorrect tool arguments—behavior the user did not see with Opus 4.5. The likely causes are the quantized model variant and the very large context window.
Overall, the local approach is functional but not yet “prime time” for everyday Claude Code replacement on modest hardware. The user suggests it’s best suited for people with powerful machines (ideally with a capable GPU) and for those willing to accept slower coding-agent cycles. Still, the promise is clear: using smaller GLM/Gemma/“Quen 4s”–class models via the Anthropic API could enable cheaper, local “small agents” that run efficiently, and future coding models may make this setup practical for more users.
The takeaway is pragmatic: Ollama launch makes local Claude Code–style coding assistants easy to start, but hardware and latency determine whether it’s a viable alternative to paid Claude Code plans. The user encourages others to try it and share results, especially any setups that improve speed or tool accuracy.
Cornell Notes
Ollama’s new “Ollama launch” feature enables Claude Code–style coding agents to run with Anthropic API compatibility using local models such as ZAI’s GLM 4.7 Flash. The setup is simple—update Ollama, download the model, then launch Claude from the terminal—but it requires changing the context length from the default 4,096 tokens to 64K for reliable tool use and memory. In testing on a Mac Mini Pro (32GB RAM), the system worked with MCP tools but was much slower due to local prefill and decoding. Tool calls sometimes used wrong arguments, likely tied to quantization and the large context window, so it’s not yet a full replacement for cloud Claude Code on typical hardware.
What does “Ollama launch” change for running Claude Code–style tools locally?
Why does context length matter so much for this setup?
How does local GLM 4.7 Flash performance compare to cloud-based Claude Code?
What kinds of tool-use issues showed up during testing?
What hardware constraints does the tester imply for making this practical?
Review Questions
- What specific configuration change (including the token value) is required to make Claude Code–style tool use work with Ollama launch?
- What performance bottlenecks appear when running GLM 4.7 Flash locally, and which parts of the generation process are affected?
- Why might tool calls be less reliable with a quantized local model compared with Opus 4.5?
Key Points
- 1
Ollama’s “Ollama launch” adds a one-command way to run Claude Code–style coding agents locally using Anthropic API compatibility.
- 2
GLM 4.7 Flash is positioned as a smaller GLM 4.7 variant (30B parameters with 3B active parameters) that can run on a Mac.
- 3
Reliable Claude Code–style behavior requires increasing Ollama’s context length from 4,096 to 64K for the model.
- 4
Local inference slows both prefill and decoding, making results significantly slower than cloud Claude Code setups.
- 5
On a Mac Mini Pro with 32GB RAM, MCP tool detection works, but tool arguments can be wrong sometimes.
- 6
The tester sees the approach as promising for small local agents, but not yet a full replacement for paid Claude Code on typical hardware.