Ollama - Loading Custom Models

TL;DR

Download the quantized GGUF weights for the target fine-tuned model, since Ollama uses LLaMA.cpp-compatible formats.

Briefing Cornell Notes

Briefing

Ollama can run fine-tuned models that aren’t already listed locally—by downloading the right quantized weights (GGUF) and creating a small Ollama “model file” that points to those files. The workflow matters because many popular models on Hugging Face come as fine-tunes, and Ollama won’t automatically make them available unless the model is packaged in the format Ollama expects.

The process starts with choosing a target model. In the walkthrough, the custom model is “Jackalope,” described as a 7B fine-tune of Mistral 7B. The key step is selecting the quantized GGUF weights from the model’s files—specifically the GGUF artifact produced by TheBloke, which converts the original weights into the GGUF format. GGUF is tied to the LLaMA.cpp ecosystem, which Ollama uses under the hood to run quantized models efficiently.

Next comes picking a quantization level. The transcript notes multiple GGUF options that trade off quality and size; the example chooses “Q6K” as a “reasonably big” quantization. After selecting the file, the weights are downloaded and placed into the user’s Ollama model folder so they can be referenced later.

With the weights in place, the user creates an Ollama model file (saved as a text file). This file includes a “from” directive that points directly to the downloaded checkpoint/weights, rather than naming a standard model like LLaMA-2. The model file also includes a template for a system prompt, allowing the system instructions to be filled in at runtime. Once saved, Ollama uses that model file to generate the model: it builds the model layers, writes the weights, and ends with a success message.

After creation, the new model appears in Ollama’s model list. From there, it can be used like any other model—running it by name (e.g., “Ollama run Jackalope”) and chatting normally. The transcript also highlights that the command set and usage behavior remain consistent with existing Ollama models.

Finally, the walkthrough offers a practical expectation-setting note: not every model will work, but fine-tunes of LLaMA-2, Mistral 7B, and Falcon models are expected to function with this approach. The overall takeaway is that custom model support in Ollama is less about waiting for official listings and more about correctly selecting GGUF quantized weights and wiring them into an Ollama model file.

Cornell Notes

Custom models can be added to Ollama by downloading quantized GGUF weights (the format Ollama uses via LLaMA.cpp), placing them in the local models folder, and creating an Ollama model file that points to those weights. The example uses “Jackalope,” a 7B fine-tune of Mistral 7B, and selects a GGUF quantization option (Q6K) from TheBloke’s converted files. The model file includes a system-prompt template so runtime prompts can be customized. After running the model-file build, the new model appears in Ollama’s model list and can be used with the usual “Ollama run <name>” workflow. This method should work for many LLaMA-2, Mistral 7B, and Falcon fine-tunes found on Hugging Face.

Why does the GGUF file matter when adding a custom model to Ollama?

Ollama relies on LLaMA.cpp under the hood, and GGUF is the quantized model format used there. The walkthrough emphasizes downloading the GGUF artifact (converted by TheBloke) so Ollama can load the weights in the format it expects. Without GGUF quantized weights, Ollama won’t be able to run the model using the same mechanism.

How does someone choose between different GGUF quantization options like Q6K?

The transcript notes multiple GGUF choices that represent different quality/size trade-offs (different quantizations). It recommends picking a “reasonably big” one for better results; in the example, that’s Q6K. Other models may require choosing a different quantization level depending on desired quality and available resources.

What does the Ollama model file need to do for a custom model?

The model file must include a “from” section that points to the downloaded checkpoint/weights file (the quantized GGUF). Instead of referencing a standard model name like LLaMA-2, it directly references the local weights. It also includes a system prompt template so the system instructions can be supplied when running the model.

What happens after the model file is created and run?

Ollama processes the model file, builds the model structure (layers and system components), and writes the weights. A success message confirms the model was created, after which the model appears in the local model list.

Once created, how is the custom model used compared with built-in models?

The custom model behaves like any other Ollama model. The transcript shows running it by name (e.g., “Ollama run Jackalope”) and then chatting normally. The available commands and interaction pattern match the standard Ollama workflow.

Which kinds of fine-tuned models are most likely to work with this approach?

The walkthrough suggests that fine-tunes of LLaMA-2, fine-tunes of Mistral 7B, and Falcon models should work when they provide compatible quantized GGUF weights. It also cautions that some models may not work, depending on format and compatibility.

Review Questions

What two requirements must a Hugging Face model meet to be added to Ollama using this method?
In the model file, what replaces a standard model name like LLaMA-2, and why?
How does quantization choice (e.g., Q6K) affect the custom model setup?

Key Points

1
Download the quantized GGUF weights for the target fine-tuned model, since Ollama uses LLaMA.cpp-compatible formats.
2
Select a GGUF quantization level (such as Q6K) based on the quality vs. resource trade-off you want.
3
Place the downloaded GGUF file into Ollama’s local models folder so it can be referenced by the model file.
4
Create an Ollama model file that uses a “from” directive pointing to the downloaded weights rather than a built-in model name.
5
Include a system prompt template in the model file so system instructions can be customized at runtime.
6
Run the model file to generate the model; confirm success and then use the new model from Ollama’s model list.
7
Expect compatibility to be strongest for fine-tunes of LLaMA-2, Mistral 7B, and Falcon models that provide appropriate GGUF files.

Highlights

Ollama custom models hinge on GGUF quantized weights, which match the LLaMA.cpp format Ollama uses internally.

A model file can point directly to locally downloaded checkpoint weights, avoiding any need for official Ollama model listings.

After building from the model file, the new model appears in Ollama’s model list and runs with the same “Ollama run <name>” workflow as built-ins.

Topics

Ollama Custom Models
GGUF Quantization
LLaMA.cpp Integration
Model File Setup
Hugging Face Weights

Mentioned

Sam Witteveen
GGUF
GGML
LLaMA
LLaMA-2
LLaMA.cpp