Ollama - Loading Custom Models
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Download the quantized GGUF weights for the target fine-tuned model, since Ollama uses LLaMA.cpp-compatible formats.
Briefing
Ollama can run fine-tuned models that aren’t already listed locally—by downloading the right quantized weights (GGUF) and creating a small Ollama “model file” that points to those files. The workflow matters because many popular models on Hugging Face come as fine-tunes, and Ollama won’t automatically make them available unless the model is packaged in the format Ollama expects.
The process starts with choosing a target model. In the walkthrough, the custom model is “Jackalope,” described as a 7B fine-tune of Mistral 7B. The key step is selecting the quantized GGUF weights from the model’s files—specifically the GGUF artifact produced by TheBloke, which converts the original weights into the GGUF format. GGUF is tied to the LLaMA.cpp ecosystem, which Ollama uses under the hood to run quantized models efficiently.
Next comes picking a quantization level. The transcript notes multiple GGUF options that trade off quality and size; the example chooses “Q6K” as a “reasonably big” quantization. After selecting the file, the weights are downloaded and placed into the user’s Ollama model folder so they can be referenced later.
With the weights in place, the user creates an Ollama model file (saved as a text file). This file includes a “from” directive that points directly to the downloaded checkpoint/weights, rather than naming a standard model like LLaMA-2. The model file also includes a template for a system prompt, allowing the system instructions to be filled in at runtime. Once saved, Ollama uses that model file to generate the model: it builds the model layers, writes the weights, and ends with a success message.
After creation, the new model appears in Ollama’s model list. From there, it can be used like any other model—running it by name (e.g., “Ollama run Jackalope”) and chatting normally. The transcript also highlights that the command set and usage behavior remain consistent with existing Ollama models.
Finally, the walkthrough offers a practical expectation-setting note: not every model will work, but fine-tunes of LLaMA-2, Mistral 7B, and Falcon models are expected to function with this approach. The overall takeaway is that custom model support in Ollama is less about waiting for official listings and more about correctly selecting GGUF quantized weights and wiring them into an Ollama model file.
Cornell Notes
Custom models can be added to Ollama by downloading quantized GGUF weights (the format Ollama uses via LLaMA.cpp), placing them in the local models folder, and creating an Ollama model file that points to those weights. The example uses “Jackalope,” a 7B fine-tune of Mistral 7B, and selects a GGUF quantization option (Q6K) from TheBloke’s converted files. The model file includes a system-prompt template so runtime prompts can be customized. After running the model-file build, the new model appears in Ollama’s model list and can be used with the usual “Ollama run <name>” workflow. This method should work for many LLaMA-2, Mistral 7B, and Falcon fine-tunes found on Hugging Face.
Why does the GGUF file matter when adding a custom model to Ollama?
How does someone choose between different GGUF quantization options like Q6K?
What does the Ollama model file need to do for a custom model?
What happens after the model file is created and run?
Once created, how is the custom model used compared with built-in models?
Which kinds of fine-tuned models are most likely to work with this approach?
Review Questions
- What two requirements must a Hugging Face model meet to be added to Ollama using this method?
- In the model file, what replaces a standard model name like LLaMA-2, and why?
- How does quantization choice (e.g., Q6K) affect the custom model setup?
Key Points
- 1
Download the quantized GGUF weights for the target fine-tuned model, since Ollama uses LLaMA.cpp-compatible formats.
- 2
Select a GGUF quantization level (such as Q6K) based on the quality vs. resource trade-off you want.
- 3
Place the downloaded GGUF file into Ollama’s local models folder so it can be referenced by the model file.
- 4
Create an Ollama model file that uses a “from” directive pointing to the downloaded weights rather than a built-in model name.
- 5
Include a system prompt template in the model file so system instructions can be customized at runtime.
- 6
Run the model file to generate the model; confirm success and then use the new model from Ollama’s model list.
- 7
Expect compatibility to be strongest for fine-tunes of LLaMA-2, Mistral 7B, and Falcon models that provide appropriate GGUF files.