Ollama Launch + Claude Code + GLM Flash

TL;DR

Ollama’s “Ollama launch” adds a one-command way to run Claude Code–style coding agents locally using Anthropic API compatibility.

Briefing Cornell Notes

Briefing

Ollama has introduced “Ollama launch,” a one-command way to run Anthropic API–compatible coding assistants locally—making it possible to use Claude Code–style workflows with models like ZAI’s GLM 4.7 Flash. The setup is straightforward: update Ollama, pull the desired model, adjust the model’s context length, then launch Claude Code from the terminal. In practice, the workflow works, including tool use via MCP, but performance is the tradeoff.

The test centers on GLM 4.7 Flash, a smaller GLM 4.7 variant described as 30B parameters with 3B active parameters—positioned as similar in scale to mixture-of-experts models used in the Quen 3 family. With Ollama launch, the model can be run on a Mac, and the user confirms the GLM 4.7 Flash model loads correctly and can drive a “Claude Code” style experience (including plan mode and agent-like task execution).

A key requirement is context length. Ollama defaults to a 4,096-token context window, but the user found that Claude Code–style tooling and memory behavior breaks down without a larger window. The fix is to raise the context length to 64K in Ollama settings for the model. Without that change, the assistant “churns” and fails to retain enough context to use tools properly or save outputs reliably.

Even with the 64K setting, speed becomes the limiting factor. Running the model locally slows both the initial “prefill” and subsequent token decoding, and the user reports that results take noticeably longer than cloud-based Claude Code setups using larger Anthropic models. After about 90 minutes of testing on a Mac Mini Pro with 32GB RAM, the system could perform many Claude Code-supported tasks and detect MCP tools, but it sometimes produced incorrect tool arguments—behavior the user did not see with Opus 4.5. The likely causes are the quantized model variant and the very large context window.

Overall, the local approach is functional but not yet “prime time” for everyday Claude Code replacement on modest hardware. The user suggests it’s best suited for people with powerful machines (ideally with a capable GPU) and for those willing to accept slower coding-agent cycles. Still, the promise is clear: using smaller GLM/Gemma/“Quen 4s”–class models via the Anthropic API could enable cheaper, local “small agents” that run efficiently, and future coding models may make this setup practical for more users.

The takeaway is pragmatic: Ollama launch makes local Claude Code–style coding assistants easy to start, but hardware and latency determine whether it’s a viable alternative to paid Claude Code plans. The user encourages others to try it and share results, especially any setups that improve speed or tool accuracy.

Cornell Notes

Ollama’s new “Ollama launch” feature enables Claude Code–style coding agents to run with Anthropic API compatibility using local models such as ZAI’s GLM 4.7 Flash. The setup is simple—update Ollama, download the model, then launch Claude from the terminal—but it requires changing the context length from the default 4,096 tokens to 64K for reliable tool use and memory. In testing on a Mac Mini Pro (32GB RAM), the system worked with MCP tools but was much slower due to local prefill and decoding. Tool calls sometimes used wrong arguments, likely tied to quantization and the large context window, so it’s not yet a full replacement for cloud Claude Code on typical hardware.

What does “Ollama launch” change for running Claude Code–style tools locally?

It provides a quick, command-based way to launch Claude Code–compatible apps through Ollama while using an Anthropic API–supported workflow. After updating Ollama and pulling a model (like GLM 4.7 Flash), a single terminal command (e.g., launching Claude) starts the coding agent, and the model is confirmed as loaded and usable for plan-mode style tasks.

Why does context length matter so much for this setup?

Ollama defaults to a 4,096-token context window, which proved too small for Claude Code–style behavior. Without increasing context length, the agent “churns,” doesn’t retain enough information in the context window, and struggles with tool usage and saving outputs. Setting the context length to 64K in Ollama settings for the model is presented as the necessary fix.

How does local GLM 4.7 Flash performance compare to cloud-based Claude Code?

Local execution is slower. Because the model runs on-device, both prefill and decoding take longer, so results arrive with noticeable latency compared with setups using larger Anthropic models like Opus 4.5. The user characterizes it as workable but not fast enough for smooth day-to-day replacement.

What kinds of tool-use issues showed up during testing?

The agent could detect and call MCP tools, but it sometimes produced incorrect tool arguments—something the user says Opus 4.5 didn’t do. The likely contributors mentioned are the quantized model version and the very large 64K context window, both of which can affect how reliably the model formats tool calls.

What hardware constraints does the tester imply for making this practical?

A Mac Mini Pro with 32GB RAM can run it, but efficiency is limited and latency remains high, especially with a 64K token window. The tester suggests a powerful machine—ideally with a GPU—and faster, less aggressively quantized model settings are needed for smoother performance.

Review Questions

What specific configuration change (including the token value) is required to make Claude Code–style tool use work with Ollama launch?
What performance bottlenecks appear when running GLM 4.7 Flash locally, and which parts of the generation process are affected?
Why might tool calls be less reliable with a quantized local model compared with Opus 4.5?

Key Points

1
Ollama’s “Ollama launch” adds a one-command way to run Claude Code–style coding agents locally using Anthropic API compatibility.
2
GLM 4.7 Flash is positioned as a smaller GLM 4.7 variant (30B parameters with 3B active parameters) that can run on a Mac.
3
Reliable Claude Code–style behavior requires increasing Ollama’s context length from 4,096 to 64K for the model.
4
Local inference slows both prefill and decoding, making results significantly slower than cloud Claude Code setups.
5
On a Mac Mini Pro with 32GB RAM, MCP tool detection works, but tool arguments can be wrong sometimes.
6
The tester sees the approach as promising for small local agents, but not yet a full replacement for paid Claude Code on typical hardware.

Highlights

Ollama launch turns Claude Code–style workflows into a local, command-line launch process once the model is downloaded.

Setting context length to 64K is presented as the difference between “churning” and usable tool/memory behavior.

Even when it works, local GLM 4.7 Flash is noticeably slower because prefill and decoding happen on-device.

Tool calls can be less accurate than Opus 4.5, with incorrect arguments showing up during MCP tool use.

Topics

Ollama Launch
Claude Code
GLM 4.7 Flash
MCP Tools
Local LLM Performance