Get AI summaries of any video or article — Sign up free
100% Free Claude Code | Run Claude Code with Local LLM with Ollama and Qwen 3.5 thumbnail

100% Free Claude Code | Run Claude Code with Local LLM with Ollama and Qwen 3.5

Venelin Valkov·
4 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Claude Code can be run locally by launching it with Ollama as the inference provider rather than using quota-based hosted inference.

Briefing

Running Claude Code locally with an Ollama-backed Qwen model can deliver practical coding assistance—especially when the task is narrowly scoped to specific files—without relying on Anthropic’s hosted, quota-based API. The setup is straightforward: install Claude Code, then launch it in a way that points inference to a local Ollama instance, optionally selecting a model via a command-line flag. In testing, a quantized Qwen 3.5 “35 billion parameter” mixture-of-experts model was used as the local engine, and it successfully handled repository navigation and targeted code analysis.

The most telling results came from how the model performed on different kinds of requests. When asked to understand the repository as a whole, the model struggled to produce a true project-level overview. Instead, it fixated on a single uncommitted file, producing a well-formatted explanation of that file but not the broader “what is my project about?” answer. That mismatch matters because it highlights a limitation of smaller local models: they may excel at local context and formatting, yet fail to synthesize a whole codebase.

When the prompts were directed at a specific file—such as a Python module in an agents directory—the behavior improved sharply. The model read the file, evaluated most of it, and even compared it to a related “trader gate” file. It also surfaced a potentially important issue: a “max steps” or iteration-limit behavior that sounded like a bug. The user then tested whether the model could correct the problem by switching into an auto-edit exception mode (triggered with Shift+Tap). After the edit attempt, the script’s behavior became more explicit: instead of silently returning response content when the agent hit an iteration limit, the agent now reported the limit more clearly. The changes were described as small but meaningful—adding five lines and removing two—suggesting the model could make surgical fixes rather than broad rewrites.

Overall, the local Claude Code + Ollama + Qwen 3.5 workflow looked viable for consumer hardware, at least for tasks that can be grounded in specific files and concrete edits. The presenter’s takeaway was cautious but optimistic: with a 35B quantized mixture-of-experts model, results were “pretty good” for targeted analysis and patching, while repository-wide understanding may require a larger model (for example, 120B-class) to be more reliable. The core message is that local inference can replace hosted Claude Code in many day-to-day coding workflows—provided expectations match the model’s context and synthesis limits.

Cornell Notes

Local Claude Code can run without Anthropic’s quota-based API by routing inference through Ollama and a Qwen 3.5 model. In testing with a quantized Qwen 3.5 35B mixture-of-experts model, targeted file analysis worked well: the model read a specific agent file, compared it to a related file, and identified a likely max-steps/iteration-limit bug. Switching into auto-edit exception mode enabled a small patch that made the iteration-limit behavior explicit instead of silently returning content. Repository-wide understanding was weaker, with the model fixating on a single uncommitted file rather than summarizing the whole project. The workflow is practical for local coding help, especially when prompts are grounded in specific files and edits.

How does the local setup replace quota-based Claude Code usage?

Claude Code is launched in a mode that uses Ollama as the inference provider. The workflow starts by installing the unmodified Claude Code, then launching a Claude Code instance via an Ollama launch command (e.g., using an Ollama launch config) and optionally passing a model flag. In the test, Ollama served a quantized Qwen 3.5 35B mixture-of-experts model, and Claude Code ran against that local model while still behaving like a Claude Code instance.

Why did the model perform worse on “What is my project about?” than on file-level tasks?

When asked for a repository-level overview, the model focused on a single uncommitted file rather than synthesizing across the codebase. The output was still high-quality for that one file—well formatted and explanatory—but it didn’t answer the broader project question. In contrast, prompts that named a specific file (like a module in an agents directory) gave the model a clear target context, leading to faster responses and more complete evaluation of the file.

What bug-like behavior did the model identify, and how was it changed?

The model flagged a “max steps” or iteration-limit issue in the agent script. Initially, it described behavior where the script could generate two calls after five iterations and then silently return response content without indicating the agent hit the iteration limit. After switching into auto-edit exception mode (Shift+Tap), the behavior became explicit: the agent now reported that the iteration limit was reached. The patch was described as adding five lines and removing two.

What evidence suggested the model could make surgical edits rather than rewriting everything?

The fix involved a small, localized change—five lines added and two removed—targeting the iteration-limit handling logic. The edit also altered user-visible behavior (silent return vs explicit limit reporting), indicating the model wasn’t just summarizing code but modifying it in a controlled way.

What scaling expectation was raised for better results?

The repository overview task was described as too hard for the 35B model in that configuration. The expectation was that a larger model—around 120B parameters or more—could produce better project-level understanding, while the 35B mixture-of-experts model was still “pretty good” for targeted analysis and edits.

Review Questions

  1. When Claude Code was asked to summarize the repository, what failure mode occurred, and how did it differ from the file-specific tasks?
  2. What change did auto-edit exception mode make to the agent’s max-steps/iteration-limit behavior?
  3. Based on the test results, what kinds of coding tasks are most likely to work well with a local 35B Qwen model?

Key Points

  1. 1

    Claude Code can be run locally by launching it with Ollama as the inference provider rather than using quota-based hosted inference.

  2. 2

    A quantized Qwen 3.5 35B mixture-of-experts model can run Claude Code on consumer hardware for practical coding workflows.

  3. 3

    Repository-wide understanding may fail or become narrow when the model fixates on a single file instead of synthesizing across the project.

  4. 4

    Targeted prompts that point to specific files improve speed and completeness, enabling useful analysis and comparisons across modules.

  5. 5

    Auto-edit exception mode (Shift+Tap) can drive small, concrete code changes, including bug-fix style edits.

  6. 6

    In testing, the model improved iteration-limit handling by making max-steps behavior explicit instead of silently returning content.

  7. 7

    For more reliable project-level summaries, a larger model (e.g., ~120B-class) may be needed.

Highlights

Local Claude Code + Ollama worked with a quantized Qwen 3.5 35B mixture-of-experts model, enabling hosted-free coding assistance.
The model produced strong, well-formatted results for specific files but struggled to generate a true repository-level overview.
A max-steps/iteration-limit issue was patched so the agent now reports hitting the iteration limit instead of silently returning content.
Small edits (five lines added, two removed) were enough to change observable agent behavior.

Topics