Run Claude Code Locally on Apple Silicon Using LM Studio and LiteLLM | Tech Edge AI
Based on Tech Edge AI-ML's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Claude Code expects Anthropic’s messages API, so local setups need a compatibility bridge to talk to OpenAI-compatible local endpoints.
Briefing
Running Claude Code locally on Apple Silicon hinges on one practical fix: Claude Code expects Anthropic’s messages API, while most local LLM servers expose an OpenAI-compatible interface. The workaround is a lightweight protocol-translation layer using LiteLLM, which bridges Claude Code’s Anthropic-style requests to a locally hosted model served by LM Studio. The result is an agentic coding workflow that behaves like the cloud—multifile edits, test runs, shell commands, refactors, and debugging—while keeping all inference on the Mac for zero usage fees and stronger privacy.
The setup starts with LM Studio, where the user downloads and loads the “Quinn 3 coder 30B” model and enables the local server. LM Studio then exposes an OpenAI-compatible chat completions endpoint, which LiteLLM will call. Because Apple Silicon performance depends on using Apple-optimized model formats, the transcript emphasizes that Apple Silicon setups benefit from MLX-optimized models; however, the chosen path uses LM Studio plus LiteLLM to ensure the hardware is effectively utilized on M1, M2, and M3 chips.
Next comes LiteLLM configuration. A clean Python virtual environment is created using Python 3.1 or newer, then LiteLLM is installed with proxy support. A config.yaml file maps Claude Code’s model aliases to the actual LM Studio model ID and drops Anthropic-specific parameters that would otherwise cause errors. With that mapping in place, LiteLLM runs as a local proxy server on a specified port.
Before installing Claude Code, the proxy connection is verified with a curl test message. A successful response from the “Quinn 3 coder 30B” model confirms that the local stack—LM Studio plus LiteLLM—is functioning end-to-end.
Claude Code is then installed globally via npm (or the platform-specific commands mentioned for Mac/Linux and Windows). Environment variables are set so Claude Code routes requests to the local LiteLLM proxy instead of any cloud endpoint. When Claude Code launches, prompts should work as if connected to Anthropic Cloud, but all processing happens on the Mac.
The transcript also flags a key tradeoff: local models can be slower than hosted cloud models because they rely entirely on local compute and memory. Still, with enough RAM for a 30B parameter model and Apple Silicon’s efficiency, performance can be “excellent.” For faster iteration, it suggests using smaller variants—specifically noting that “Quinn 3 coder 3B” is strong for refactoring, test generation, and large repository changes.
Overall, the workflow lowers the barrier to agentic coding on consumer hardware by combining LM Studio, LiteLLM, and an open-source coding model into a fully offline, cost-free development environment tailored to Apple Silicon.
Cornell Notes
The core challenge is compatibility: Claude Code expects Anthropic’s messages API, while local LLM servers typically provide an OpenAI-compatible API. The solution is to run LM Studio locally with an open-source coding model (such as “Quinn 3 coder 30B”) and place LiteLLM in between as a proxy/translation layer. LiteLLM maps Claude Code model names to the LM Studio model ID and removes Anthropic-specific parameters that would break requests. After verifying the proxy with a curl test, Claude Code is installed and configured to send all requests to the local proxy. The payoff is an offline, zero-usage-fee agentic coding setup that keeps code and prompts on the Mac.
Why does Claude Code need a translation layer when running with local models?
What role does LM Studio play in the local Claude Code workflow?
What must be configured in LiteLLM to prevent API mismatches?
How can you confirm the local stack works before installing Claude Code?
What changes when Claude Code is pointed at the local proxy instead of the cloud?
What performance tradeoffs should be expected with local models?
Review Questions
- What specific API incompatibility exists between Claude Code and typical local LLM runtimes, and how does LiteLLM resolve it?
- Which files and environment variables must be set so Claude Code routes requests to the local proxy rather than any cloud endpoint?
- How do model size choices (e.g., “Quinn 3 coder 30B” vs “Quinn 3 coder 3B”) affect speed and task suitability in this setup?
Key Points
- 1
Claude Code expects Anthropic’s messages API, so local setups need a compatibility bridge to talk to OpenAI-compatible local endpoints.
- 2
LM Studio should host the chosen open-source model (e.g., “Quinn 3 coder 30B”) and expose an OpenAI-compatible chat completions endpoint.
- 3
LiteLLM runs as a local proxy/translation layer and requires a config.yaml that maps Claude Code model aliases to the LM Studio model ID.
- 4
LiteLLM configuration must drop Anthropic-specific parameters to avoid request errors when forwarding to LM Studio.
- 5
Verify the LM Studio↔LiteLLM connection with a curl test before installing or launching Claude Code.
- 6
Point Claude Code to the local LiteLLM proxy using environment variables so all inference stays on the Mac.
- 7
Local models may be slower than cloud models, so choose model size (30B vs 3B) based on the speed vs capability tradeoff.