100% Free Claude Code | Run Claude Code with Local LLM with Ollama and Qwen 3.5
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Claude Code can be run locally by launching it with Ollama as the inference provider rather than using quota-based hosted inference.
Briefing
Running Claude Code locally with an Ollama-backed Qwen model can deliver practical coding assistance—especially when the task is narrowly scoped to specific files—without relying on Anthropic’s hosted, quota-based API. The setup is straightforward: install Claude Code, then launch it in a way that points inference to a local Ollama instance, optionally selecting a model via a command-line flag. In testing, a quantized Qwen 3.5 “35 billion parameter” mixture-of-experts model was used as the local engine, and it successfully handled repository navigation and targeted code analysis.
The most telling results came from how the model performed on different kinds of requests. When asked to understand the repository as a whole, the model struggled to produce a true project-level overview. Instead, it fixated on a single uncommitted file, producing a well-formatted explanation of that file but not the broader “what is my project about?” answer. That mismatch matters because it highlights a limitation of smaller local models: they may excel at local context and formatting, yet fail to synthesize a whole codebase.
When the prompts were directed at a specific file—such as a Python module in an agents directory—the behavior improved sharply. The model read the file, evaluated most of it, and even compared it to a related “trader gate” file. It also surfaced a potentially important issue: a “max steps” or iteration-limit behavior that sounded like a bug. The user then tested whether the model could correct the problem by switching into an auto-edit exception mode (triggered with Shift+Tap). After the edit attempt, the script’s behavior became more explicit: instead of silently returning response content when the agent hit an iteration limit, the agent now reported the limit more clearly. The changes were described as small but meaningful—adding five lines and removing two—suggesting the model could make surgical fixes rather than broad rewrites.
Overall, the local Claude Code + Ollama + Qwen 3.5 workflow looked viable for consumer hardware, at least for tasks that can be grounded in specific files and concrete edits. The presenter’s takeaway was cautious but optimistic: with a 35B quantized mixture-of-experts model, results were “pretty good” for targeted analysis and patching, while repository-wide understanding may require a larger model (for example, 120B-class) to be more reliable. The core message is that local inference can replace hosted Claude Code in many day-to-day coding workflows—provided expectations match the model’s context and synthesis limits.
Cornell Notes
Local Claude Code can run without Anthropic’s quota-based API by routing inference through Ollama and a Qwen 3.5 model. In testing with a quantized Qwen 3.5 35B mixture-of-experts model, targeted file analysis worked well: the model read a specific agent file, compared it to a related file, and identified a likely max-steps/iteration-limit bug. Switching into auto-edit exception mode enabled a small patch that made the iteration-limit behavior explicit instead of silently returning content. Repository-wide understanding was weaker, with the model fixating on a single uncommitted file rather than summarizing the whole project. The workflow is practical for local coding help, especially when prompts are grounded in specific files and edits.
How does the local setup replace quota-based Claude Code usage?
Why did the model perform worse on “What is my project about?” than on file-level tasks?
What bug-like behavior did the model identify, and how was it changed?
What evidence suggested the model could make surgical edits rather than rewriting everything?
What scaling expectation was raised for better results?
Review Questions
- When Claude Code was asked to summarize the repository, what failure mode occurred, and how did it differ from the file-specific tasks?
- What change did auto-edit exception mode make to the agent’s max-steps/iteration-limit behavior?
- Based on the test results, what kinds of coding tasks are most likely to work well with a local 35B Qwen model?
Key Points
- 1
Claude Code can be run locally by launching it with Ollama as the inference provider rather than using quota-based hosted inference.
- 2
A quantized Qwen 3.5 35B mixture-of-experts model can run Claude Code on consumer hardware for practical coding workflows.
- 3
Repository-wide understanding may fail or become narrow when the model fixates on a single file instead of synthesizing across the project.
- 4
Targeted prompts that point to specific files improve speed and completeness, enabling useful analysis and comparisons across modules.
- 5
Auto-edit exception mode (Shift+Tap) can drive small, concrete code changes, including bug-fix style edits.
- 6
In testing, the model improved iteration-limit handling by making max-steps behavior explicit instead of silently returning content.
- 7
For more reliable project-level summaries, a larger model (e.g., ~120B-class) may be needed.