Get AI summaries of any video or article — Sign up free
Run Claude Code Locally on Apple Silicon Using LM Studio and LiteLLM | Tech Edge AI thumbnail

Run Claude Code Locally on Apple Silicon Using LM Studio and LiteLLM | Tech Edge AI

Tech Edge AI-ML·
4 min read

Based on Tech Edge AI-ML's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Claude Code expects Anthropic’s messages API, so local setups need a compatibility bridge to talk to OpenAI-compatible local endpoints.

Briefing

Running Claude Code locally on Apple Silicon hinges on one practical fix: Claude Code expects Anthropic’s messages API, while most local LLM servers expose an OpenAI-compatible interface. The workaround is a lightweight protocol-translation layer using LiteLLM, which bridges Claude Code’s Anthropic-style requests to a locally hosted model served by LM Studio. The result is an agentic coding workflow that behaves like the cloud—multifile edits, test runs, shell commands, refactors, and debugging—while keeping all inference on the Mac for zero usage fees and stronger privacy.

The setup starts with LM Studio, where the user downloads and loads the “Quinn 3 coder 30B” model and enables the local server. LM Studio then exposes an OpenAI-compatible chat completions endpoint, which LiteLLM will call. Because Apple Silicon performance depends on using Apple-optimized model formats, the transcript emphasizes that Apple Silicon setups benefit from MLX-optimized models; however, the chosen path uses LM Studio plus LiteLLM to ensure the hardware is effectively utilized on M1, M2, and M3 chips.

Next comes LiteLLM configuration. A clean Python virtual environment is created using Python 3.1 or newer, then LiteLLM is installed with proxy support. A config.yaml file maps Claude Code’s model aliases to the actual LM Studio model ID and drops Anthropic-specific parameters that would otherwise cause errors. With that mapping in place, LiteLLM runs as a local proxy server on a specified port.

Before installing Claude Code, the proxy connection is verified with a curl test message. A successful response from the “Quinn 3 coder 30B” model confirms that the local stack—LM Studio plus LiteLLM—is functioning end-to-end.

Claude Code is then installed globally via npm (or the platform-specific commands mentioned for Mac/Linux and Windows). Environment variables are set so Claude Code routes requests to the local LiteLLM proxy instead of any cloud endpoint. When Claude Code launches, prompts should work as if connected to Anthropic Cloud, but all processing happens on the Mac.

The transcript also flags a key tradeoff: local models can be slower than hosted cloud models because they rely entirely on local compute and memory. Still, with enough RAM for a 30B parameter model and Apple Silicon’s efficiency, performance can be “excellent.” For faster iteration, it suggests using smaller variants—specifically noting that “Quinn 3 coder 3B” is strong for refactoring, test generation, and large repository changes.

Overall, the workflow lowers the barrier to agentic coding on consumer hardware by combining LM Studio, LiteLLM, and an open-source coding model into a fully offline, cost-free development environment tailored to Apple Silicon.

Cornell Notes

The core challenge is compatibility: Claude Code expects Anthropic’s messages API, while local LLM servers typically provide an OpenAI-compatible API. The solution is to run LM Studio locally with an open-source coding model (such as “Quinn 3 coder 30B”) and place LiteLLM in between as a proxy/translation layer. LiteLLM maps Claude Code model names to the LM Studio model ID and removes Anthropic-specific parameters that would break requests. After verifying the proxy with a curl test, Claude Code is installed and configured to send all requests to the local proxy. The payoff is an offline, zero-usage-fee agentic coding setup that keeps code and prompts on the Mac.

Why does Claude Code need a translation layer when running with local models?

Claude Code is built around Anthropic’s messages API, but LM Studio’s local server exposes an OpenAI-compatible chat completions endpoint. LiteLLM acts as a bridge: it receives Claude Code-style requests and translates them into the OpenAI-compatible format that LM Studio can serve, letting Claude Code communicate with a local model seamlessly.

What role does LM Studio play in the local Claude Code workflow?

LM Studio hosts the model and provides the API endpoint. The transcript instructs downloading and loading “Quinn 3 coder 30B,” enabling the local server, and noting the endpoint address. That endpoint is what LiteLLM uses to send chat-completions requests to the model running on the Mac.

What must be configured in LiteLLM to prevent API mismatches?

A config.yaml file maps Claude Code model aliases to the actual LM Studio model ID. It also drops Anthropic-specific parameters so LiteLLM doesn’t forward fields that the local OpenAI-compatible interface can’t handle. This mapping and parameter cleanup are described as crucial for avoiding errors and ensuring smooth communication.

How can you confirm the local stack works before installing Claude Code?

After starting the LiteLLM proxy server, the transcript recommends a curl test that sends a simple message to the proxy. If the response comes back successfully from the “Quinn 3 coder 30B” model, it confirms that LM Studio and LiteLLM are connected correctly and that Claude Code will have a working local target.

What changes when Claude Code is pointed at the local proxy instead of the cloud?

Claude Code behaves like its cloud-connected mode—reading and modifying multifile code bases, running tests and shell commands, refactoring, implementing features, and debugging—but requests are routed locally. The transcript emphasizes that no data needs to be sent to the cloud, which improves privacy and eliminates usage fees.

What performance tradeoffs should be expected with local models?

Local inference can be slower than hosted cloud models because it depends on the Mac’s compute and memory. The transcript notes that performance is generally strong on Apple Silicon if there’s enough RAM for the chosen model size (it cites a 30B parameter model requirement) and suggests “Quinn 3 coder 3B” for faster responses.

Review Questions

  1. What specific API incompatibility exists between Claude Code and typical local LLM runtimes, and how does LiteLLM resolve it?
  2. Which files and environment variables must be set so Claude Code routes requests to the local proxy rather than any cloud endpoint?
  3. How do model size choices (e.g., “Quinn 3 coder 30B” vs “Quinn 3 coder 3B”) affect speed and task suitability in this setup?

Key Points

  1. 1

    Claude Code expects Anthropic’s messages API, so local setups need a compatibility bridge to talk to OpenAI-compatible local endpoints.

  2. 2

    LM Studio should host the chosen open-source model (e.g., “Quinn 3 coder 30B”) and expose an OpenAI-compatible chat completions endpoint.

  3. 3

    LiteLLM runs as a local proxy/translation layer and requires a config.yaml that maps Claude Code model aliases to the LM Studio model ID.

  4. 4

    LiteLLM configuration must drop Anthropic-specific parameters to avoid request errors when forwarding to LM Studio.

  5. 5

    Verify the LM Studio↔LiteLLM connection with a curl test before installing or launching Claude Code.

  6. 6

    Point Claude Code to the local LiteLLM proxy using environment variables so all inference stays on the Mac.

  7. 7

    Local models may be slower than cloud models, so choose model size (30B vs 3B) based on the speed vs capability tradeoff.

Highlights

The key unlock is protocol translation: LiteLLM bridges Claude Code’s Anthropic-style requests to LM Studio’s OpenAI-compatible API.
Once Claude Code is configured to use the local proxy, multifile coding tasks run entirely on the Mac with no cloud data transfer.
The setup’s reliability depends on config.yaml mapping and removing Anthropic-specific parameters that would otherwise break local calls.
Model choice matters: “Quinn 3 coder 3B” is positioned for faster responses, while “Quinn 3 coder 30B” targets heavier repository work.

Topics

  • Local Agentic Coding
  • Apple Silicon
  • LM Studio
  • LiteLLM Proxy
  • Claude Code Offline Setup

Mentioned