Get AI summaries of any video or article — Sign up free
Run LLMs Locally With Docker Model Runner thumbnail

Run LLMs Locally With Docker Model Runner

Krish Naik·
5 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Docker Desktop 4.40+ is required to use Docker Model Runner.

Briefing

Running open-source LLMs locally is now straightforward with Docker Model Runner, as long as Docker Desktop is updated and a few settings are enabled. The workflow matters because it lets developers test multiple models on their own machine—without cloud calls—while still integrating with familiar tooling. The setup is designed to work on both Mac and Windows, with Docker Desktop version 4.40+ called out as the minimum requirement.

The process starts by installing Docker Desktop and then checking the version in Docker Desktop settings. A key requirement is enabling Docker Model Runner under “Features in development.” To make the local model accessible from code, “Enable host side TCP support” must also be turned on, followed by applying the change and restarting Docker Desktop. For users who prefer the command line, the same capability can be enabled via a Docker Desktop command that activates model runner and confirms the TCP port (the transcript references port 12434).

Once enabled, the system can be verified with Docker Model Runner commands. “docker model status” confirms the runner is running and provides a status output. “docker model help” lists the operational commands: inspect (detailed model info), list (available models), logs (runtime logs), pull (download models), push (upload to Docker Hub), rm (remove downloaded models), run (start a model), plus tag/version utilities. This command set is the backbone for managing models locally.

Model availability is handled through Docker Hub. Searching for Llama-family models on Docker Hub reveals multiple open-source options, including variants like Llama 3.3 and Llama 3.1. The transcript then focuses on a smaller model—AI/small lm2—described as a compact, speed-oriented language model built for efficient local use. Pulling it uses “docker model pull AI/small lm2,” and the transcript notes the model size is roughly a few hundred MB (about 256.35 MB for the referenced variant). Running it uses “docker model run AI/small lm2,” which launches an interactive chat session where prompts return responses like a local chatbot.

Beyond chat mode, the runner can accept direct input during runtime, using a command form that returns an output without maintaining a conversational session. A major practical advantage is compatibility with the OpenAI API library: the local model is exposed through a localhost base URL using the enabled TCP port. In the example, Python code imports the OpenAI client, points base_url to localhost:12434, and calls the chat completion endpoint with the local model name (AI/small lm2). The transcript also demonstrates streaming responses line-by-line by setting stream=True, and mentions function/tool calling support via the same OpenAI-style interface.

Overall, the takeaway is a repeatable local development loop: update Docker Desktop, enable Docker Model Runner with TCP access, pull and run an open-source model from Docker Hub, and then integrate it into applications using OpenAI-compatible client code—on Mac or Windows—without changing the developer workflow.

Cornell Notes

Docker Model Runner lets developers run open-source LLMs locally using Docker Desktop on both Mac and Windows. The minimum Docker Desktop version mentioned is 4.40, and setup requires enabling “Docker Model Runner” plus “host side TCP support” so applications can reach the model over a localhost port (referenced as 12434). After enabling, commands like docker model status and docker model help confirm the runner is working and list available operations (pull, run, list, inspect, logs, rm, etc.). Models are pulled from Docker Hub (example: AI/small lm2), then started with docker model run AI/small lm2 for interactive chat or direct prompt execution. The local server is OpenAI-library compatible, enabling Python code to call chat completions using the local base_url and model name, including streaming and tool-calling patterns.

What prerequisites and settings are required before Docker Model Runner can serve LLMs locally?

Docker Desktop must be installed and meet the minimum version requirement of 4.40 (beta availability is referenced for 4.40+ on April 4, 2025). In Docker Desktop settings, “Features in development” must be opened, then “Enable docker model runner” must be turned on. To let code connect to the model, “Enable host side TCP support” must also be enabled, followed by Apply and a Docker Desktop restart.

How can someone verify that Docker Model Runner is actually running?

After enabling the feature, the transcript uses the command docker model status. When it returns a status indicating the model runner is running, the local service is ready. It also uses docker model help to confirm the available command set for managing and running models.

Which Docker Model Runner commands are most useful for working with local LLMs?

The transcript highlights docker model help listing: inspect (detailed info for one model), list (available models), logs (runtime logs), pull (download a model), push (upload to Docker Hub), rm (remove downloaded models), and run (start a model). It also mentions tag and version-related options for model management.

How does the workflow connect Docker Hub models to local execution?

Models are discovered on Docker Hub (including Llama-family variants). To use a model locally, the user pulls it with docker model pull <namespace/model>, then starts it with docker model run <namespace/model>. The transcript demonstrates this with AI/small lm2, pulling it first and then running it to create a local chat session.

How does OpenAI compatibility work with a locally running model?

With host TCP support enabled, the model is reachable via a localhost base URL using the TCP port (referenced as 12434). The transcript shows Python code using from openai import OpenAI, setting base_url to localhost:12434, and calling client.chat.completions.create with model set to AI/small lm2. The endpoint path shown includes /v1 (via /v1/chat/completions style usage).

How can streaming and tool/function calling be used with the local setup?

Streaming is enabled by setting stream=True in the OpenAI-style call, causing output to appear incrementally rather than all at once. The transcript also mentions tool calling/function calling support through the same OpenAI library approach, with code provided in the description for that pattern.

Review Questions

  1. What exact Docker Desktop features must be enabled to allow local applications to connect to Docker Model Runner, and why is TCP support necessary?
  2. How would you pull and run a new Docker Hub LLM model using the Docker Model Runner command set?
  3. In the OpenAI-compatible Python example, which parameters determine the local server address and the model name?

Key Points

  1. 1

    Docker Desktop 4.40+ is required to use Docker Model Runner.

  2. 2

    Enable “Docker Model Runner” and “host side TCP support” in Docker Desktop, then restart to expose the local model over TCP.

  3. 3

    Use docker model status to confirm the runner is running, and docker model help to access commands like pull, run, list, inspect, logs, and rm.

  4. 4

    Pull models from Docker Hub with docker model pull <namespace/model> and start them locally with docker model run <namespace/model>.

  5. 5

    The transcript’s example model AI/small lm2 runs as an interactive chatbot and can also be prompted directly during runtime.

  6. 6

    Local models are OpenAI-library compatible via a localhost base_url using the enabled TCP port (referenced as 12434), enabling chat completions from Python.

  7. 7

    Streaming (stream=True) and tool/function calling patterns work through the same OpenAI-style client interface.

Highlights

Docker Model Runner turns Docker Desktop into a local LLM service that can be accessed from code over a localhost TCP port (12434 referenced).
A single model name from Docker Hub (example: AI/small lm2) becomes the model identifier used in OpenAI-compatible Python calls.
Streaming responses can be enabled with stream=True, producing token-by-token style output from a locally hosted model.

Topics

  • Docker Model Runner
  • Local LLMs
  • Docker Desktop Setup
  • Docker Hub Models
  • OpenAI Compatibility