Ollama - Local Models on your machine
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Ollama provides a simple command-line workflow to run local LLMs by downloading model weights and serving them through a local API.
Briefing
Ollama is a user-friendly way to run large language models locally on a Mac or Linux machine by downloading them and serving them through a local API—no complex setup required. The practical payoff is speed and accessibility: instead of wrestling with model files and tooling, non-experts can pull down models, try prompts, and compare behaviors on their own hardware. Windows support is expected soon, which would broaden the audience beyond Mac and Linux users.
Beyond LLaMA-2, Ollama supports a wider catalog of models, including uncensored LLaMA variants, CodeLLaMA, Falcon, and newer additions such as Mistral. It also includes open-source fine-tunes like Vicuna and WizardCoder (including Wizard uncensored). That variety matters because it lets users test different instruction styles, coding abilities, and policy behaviors without changing their workflow—download once, then run and chat.
Getting started is straightforward: download and install Ollama from its website, then use the command line to interact with it. Once running, Ollama serves the selected model locally, so prompts can be sent to the model through commands rather than through a separate server setup. The workflow centers on a small set of terminal commands: `ollama list` to see what’s installed, `ollama run <model>` to download (if needed) and start chatting, and `ollama pull <model>` when downloading is separated from running.
A key moment in the setup is the first model download. Running LLaMA-2 instruct triggers a manifest download followed by a multi-gigabyte model download (about 3.8 GB in the example). After the weights finish downloading, the interactive prompt becomes responsive quickly. Ollama also provides operational details: users can query available commands and enable verbose output to see performance metrics such as tokens-per-second.
The transcript also highlights how Ollama handles model behavior differences. LLaMA-2 instruct is described as censored, so the demonstration switches to an uncensored chat model. The uncensored model is shown as being stored in GGML format, tied to a quantized (four-bit) approach that reduces resource demands.
Finally, Ollama supports customization through model “files.” A user can create a new model definition (e.g., “Hogwarts”) that sets hyperparameters like temperature and, crucially, a system prompt that forces a persona and topic boundaries. After saving the model file and creating the model, running `ollama run Hogwarts` produces responses “in character” (Professor Albus Dumbledore) and constrained to Hogwarts and wizardry. Model management is also built in: models can be removed without necessarily deleting shared underlying weights if other models still reference them.
Overall, Ollama positions local LLM experimentation as a practical, repeatable workflow—one that can later connect to tools like LangChain for local development and testing, while keeping the barrier to entry low for everyday users.
Cornell Notes
Ollama makes local large language model use practical by letting users download and run models on Mac and Linux through a command-line workflow that serves a local API. It supports more than LLaMA-2, including uncensored LLaMA variants, CodeLLaMA, Falcon, Mistral, and fine-tunes such as Vicuna and WizardCoder. The first run downloads large model weights (e.g., LLaMA-2 instruct at about 3.8 GB), and verbose mode can show performance like tokens per second. Ollama also supports custom model definitions via model files, where users set hyperparameters and a system prompt to create personas and topic constraints. Models can be removed while preserving shared weights when other models still depend on them.
What problem Ollama solves for local LLM users, and why it matters?
How does the basic workflow work in practice (download vs run vs list)?
How can users check performance or operational details while chatting?
Why does model choice affect behavior, and what formats are involved?
How does Ollama enable custom personas or constraints without changing code?
What happens when a model is removed—are weights deleted or reused?
Review Questions
- What command(s) would you use to download a model without starting a chat session, and how does that differ from running a model directly?
- How does creating a custom model file change the model’s behavior compared with using a stock model like LLaMA-2 instruct?
- What does verbose mode reveal during generation, and why might that be useful when choosing between models?
Key Points
- 1
Ollama provides a simple command-line workflow to run local LLMs by downloading model weights and serving them through a local API.
- 2
Mac OS and Linux are supported now, with Windows support described as coming soon.
- 3
Model availability goes beyond LLaMA-2, including uncensored variants, CodeLLaMA, Falcon, Mistral, Vicuna, and WizardCoder.
- 4
First-time model use may require downloading multi-gigabyte weights (about 3.8 GB for LLaMA-2 instruct in the example).
- 5
Verbose mode can display generation performance such as tokens per second, helping users compare models.
- 6
Custom model files let users set hyperparameters and a system prompt to create personas and constrain topics (e.g., “Hogwarts”).
- 7
Removing a model may not delete shared weights if other installed models still depend on them.