Ollama - Local Models on your machine

TL;DR

Ollama provides a simple command-line workflow to run local LLMs by downloading model weights and serving them through a local API.

Briefing Cornell Notes

Briefing

Ollama is a user-friendly way to run large language models locally on a Mac or Linux machine by downloading them and serving them through a local API—no complex setup required. The practical payoff is speed and accessibility: instead of wrestling with model files and tooling, non-experts can pull down models, try prompts, and compare behaviors on their own hardware. Windows support is expected soon, which would broaden the audience beyond Mac and Linux users.

Beyond LLaMA-2, Ollama supports a wider catalog of models, including uncensored LLaMA variants, CodeLLaMA, Falcon, and newer additions such as Mistral. It also includes open-source fine-tunes like Vicuna and WizardCoder (including Wizard uncensored). That variety matters because it lets users test different instruction styles, coding abilities, and policy behaviors without changing their workflow—download once, then run and chat.

Getting started is straightforward: download and install Ollama from its website, then use the command line to interact with it. Once running, Ollama serves the selected model locally, so prompts can be sent to the model through commands rather than through a separate server setup. The workflow centers on a small set of terminal commands: `ollama list` to see what’s installed, `ollama run <model>` to download (if needed) and start chatting, and `ollama pull <model>` when downloading is separated from running.

A key moment in the setup is the first model download. Running LLaMA-2 instruct triggers a manifest download followed by a multi-gigabyte model download (about 3.8 GB in the example). After the weights finish downloading, the interactive prompt becomes responsive quickly. Ollama also provides operational details: users can query available commands and enable verbose output to see performance metrics such as tokens-per-second.

The transcript also highlights how Ollama handles model behavior differences. LLaMA-2 instruct is described as censored, so the demonstration switches to an uncensored chat model. The uncensored model is shown as being stored in GGML format, tied to a quantized (four-bit) approach that reduces resource demands.

Finally, Ollama supports customization through model “files.” A user can create a new model definition (e.g., “Hogwarts”) that sets hyperparameters like temperature and, crucially, a system prompt that forces a persona and topic boundaries. After saving the model file and creating the model, running `ollama run Hogwarts` produces responses “in character” (Professor Albus Dumbledore) and constrained to Hogwarts and wizardry. Model management is also built in: models can be removed without necessarily deleting shared underlying weights if other models still reference them.

Overall, Ollama positions local LLM experimentation as a practical, repeatable workflow—one that can later connect to tools like LangChain for local development and testing, while keeping the barrier to entry low for everyday users.

Cornell Notes

Ollama makes local large language model use practical by letting users download and run models on Mac and Linux through a command-line workflow that serves a local API. It supports more than LLaMA-2, including uncensored LLaMA variants, CodeLLaMA, Falcon, Mistral, and fine-tunes such as Vicuna and WizardCoder. The first run downloads large model weights (e.g., LLaMA-2 instruct at about 3.8 GB), and verbose mode can show performance like tokens per second. Ollama also supports custom model definitions via model files, where users set hyperparameters and a system prompt to create personas and topic constraints. Models can be removed while preserving shared weights when other models still depend on them.

What problem Ollama solves for local LLM users, and why it matters?

Ollama reduces the friction of running large language models locally. Instead of manually handling model files and complex setup, users install Ollama, then use simple commands to list models and run them. The tool downloads model weights automatically when needed and serves the model locally, making experimentation fast—especially for people who aren’t comfortable with technical model-management workflows. Windows support is expected soon, which would further expand access.

How does the basic workflow work in practice (download vs run vs list)?

The workflow is centered on terminal commands: `ollama list` shows installed models; `ollama run <model>` starts a model and triggers a download if the model isn’t installed; `ollama pull <model>` downloads weights separately. In the example, running LLaMA-2 instruct first pulls a manifest and then downloads roughly a 3.8 GB model before interactive chatting becomes available.

How can users check performance or operational details while chatting?

Ollama supports introspection commands and verbose output. The transcript demonstrates using a help-style query (via `/` commands) to see available commands, then enabling verbose mode to display tokens-per-second during generation. This turns local experimentation into something measurable, not just qualitative.

Why does model choice affect behavior, and what formats are involved?

Model behavior can differ significantly, including whether responses are censored. The transcript notes that LLaMA-2 instruct is censored, so it switches to an uncensored chat model. It also points out that the uncensored model is stored in GGML format and uses a four-bit quantized approach, which is relevant to the RAM requirements for running models locally.

How does Ollama enable custom personas or constraints without changing code?

Users create a model file that defines a new model name (e.g., “Hogwarts”) and sets parameters like temperature plus a system prompt. The example system prompt sets the assistant as Professor Dumbledore and restricts responses to Hogwarts and wizardry. After saving and creating the model, running `ollama run Hogwarts` produces in-character answers and topic-limited guidance.

What happens when a model is removed—are weights deleted or reused?

Removing a model can preserve shared weights. In the example, deleting the “Mario” model leaves LLaMA-2 weights intact because other installed models still reference them. Only when all dependent models are removed would the underlying weights be deleted.

Review Questions

What command(s) would you use to download a model without starting a chat session, and how does that differ from running a model directly?
How does creating a custom model file change the model’s behavior compared with using a stock model like LLaMA-2 instruct?
What does verbose mode reveal during generation, and why might that be useful when choosing between models?

Key Points

1
Ollama provides a simple command-line workflow to run local LLMs by downloading model weights and serving them through a local API.
2
Mac OS and Linux are supported now, with Windows support described as coming soon.
3
Model availability goes beyond LLaMA-2, including uncensored variants, CodeLLaMA, Falcon, Mistral, Vicuna, and WizardCoder.
4
First-time model use may require downloading multi-gigabyte weights (about 3.8 GB for LLaMA-2 instruct in the example).
5
Verbose mode can display generation performance such as tokens per second, helping users compare models.
6
Custom model files let users set hyperparameters and a system prompt to create personas and constrain topics (e.g., “Hogwarts”).
7
Removing a model may not delete shared weights if other installed models still depend on them.

Highlights

Ollama turns local LLM experimentation into a download-and-run workflow, with a local API behind the scenes.

Running LLaMA-2 instruct first pulls a manifest and then downloads large weights (about 3.8 GB) before interactive chatting starts.

Custom system prompts in model files can make the assistant adopt a specific persona and stay within defined subject boundaries.

Verbose output provides tokens-per-second, giving immediate feedback on local generation speed.

Model removal can preserve shared weights when multiple models reference the same underlying parameters.