Ollama.ai: A Developer's Quick Start Guide!

TL;DR

Cloud LLM APIs can fail practical requirements for real-time UX and for regulated industries that restrict sending sensitive data to third parties.

Briefing Cornell Notes

Briefing

Local, on-device LLMs are moving from “cloud-only” APIs to a developer-friendly workflow where models download to a machine, run locally, and can be accessed either through a command-line interface or a local REST API. That shift matters because it reduces latency and avoids sending sensitive data to third-party cloud infrastructure—constraints that often block healthcare, finance, and real-time media use cases.

For years, developers interacted with large language models by installing language-specific packages (Python packages, npm packages) that wrapped cloud API calls and returned JSON responses. But the cloud model approach runs into practical limits: responses can be slow enough to break interactive experiences, and many industries can’t legally or operationally transmit sensitive information to cloud-hosted systems. Real-time scenarios—like live captioning for video calls or automatic captioning for streaming—also don’t fit the “wait several seconds for an API response” pattern. The alternative is running inference on the client, which pushes developers toward browser-based solutions (WebML via TensorFlow.js or Hugging Face Transformers.js) or toward desktop/local execution.

Ollama is presented as the desktop/local execution path: a way to fetch large language models onto a client machine (including consumer GPUs) and run them locally. Instead of loading a model inside a browser and being constrained by web app lifecycle and cache behavior, Ollama keeps models on the device and supports use cases that need local integration—such as plugins for tools like Zoom-style live captioning or audio enhancement workflows where exporting, uploading, and re-aligning tracks would be too cumbersome. The promise is local inference that can be embedded into existing desktop environments.

Setup begins with downloading the Ollama app, then using model “tags” to pull specific variants. Llama 2 is highlighted as a popular starting point, with a default 7B chat-tuned model that’s about 3.8 GB. The transcript emphasizes that model size and hardware requirements scale quickly: a 7B variant may require around 8 GB of RAM, while larger 70B variants can demand dramatically more memory (the example cited is 64 GB RAM) and can be very large on disk (the 70B example is about 138 GB). Variants matter too: “chat” tags correspond to chat fine-tuning, while “text” tags provide a different behavior profile.

Other models broaden the local toolkit. Mistral is described as gaining traction for outperforming a larger Llama 2 13B baseline while staying smaller (a 7B default around 4.1 GB). Lava is positioned as a multimodal, open-source alternative to GPT-4-class vision systems—capable of answering questions about images and text without generating images itself. A coding-focused model (“Codellama”) is also mentioned, with official support noted for macOS and Linux.

A key workflow demo shows three interaction modes: (1) CLI-based chatting, (2) URL summarization by prompting a locally running model to summarize a long webpage, and (3) multimodal Q&A over local image files. The transcript also demonstrates REST API access via Ollama’s localhost port (11434), using POST requests that specify the model name and a `stream` flag—returning either token-by-token streaming or a single JSON response when streaming is disabled. The result is a practical blueprint for building applications that rely on local LLM inference while still supporting typical developer patterns like API calls and JSON formatting.

Cornell Notes

Ollama shifts large language model development from cloud-only API calls to local execution by downloading models onto a user’s machine and running inference locally. This approach targets latency-sensitive and privacy-sensitive scenarios where sending data to cloud infrastructure is slow or legally constrained. Developers can interact with models through a CLI, run tasks like URL summarization, and query multimodal models using local image files. Ollama also exposes a local REST API (localhost:11434), enabling POST requests that return either streamed tokens or a single JSON response. Model selection matters: tags define chat vs text behavior and hardware needs scale sharply from smaller 7B models to much larger 70B variants.

Why do developers move away from cloud LLM API calls in certain applications?

Cloud APIs can be too slow for interactive experiences (waiting several seconds breaks real-time UX). More importantly, healthcare and financial workflows may restrict sending sensitive patient or customer information to cloud-hosted models. Real-time media tasks—like live captioning for video calls or streaming—also require on-device inference rather than frame-by-frame cloud requests.

How does Ollama change the deployment model for LLMs?

Ollama downloads selected LLM variants to the client machine and runs them locally. Instead of loading models inside a browser (WebML) and being constrained by page lifecycle and cache behavior, Ollama supports desktop/local workflows and can integrate with local applications. The transcript emphasizes that models can run on consumer GPUs, with larger models requiring more RAM and disk space.

What do model “tags” mean in Ollama, and why do they matter?

Tags select specific model variants and behavior. The transcript uses Llama 2 as an example: the default pull is the 7B chat-tuned model (~3.8 GB), while “text” tags provide a non-chat fine-tuned variant. Hardware requirements scale with size: a 7B variant is cited as needing around 8 GB RAM, while a 70B variant is cited as needing about 64 GB RAM and is much larger on disk (~138 GB).

How can developers interact with Ollama locally besides the CLI?

Ollama runs a local web API exposed on port 11434. Developers can send POST requests to endpoints like `/api/generate`, specifying the model name in the request body. A `stream` flag controls output: `stream: true` returns token-by-token streaming, while `stream: false` returns the full response in one JSON object.

What kinds of tasks does the transcript demonstrate with local models?

It demonstrates CLI chatting (including stopping an instance with Ctrl+D), summarizing a long URL by prompting the model to summarize content on-device, and multimodal Q&A using local images. For multimodal models (Lava), the transcript shows the model describing objects and scene context, and it also tests a chart image where the model struggles with reading fine details.

What hardware and OS constraints appear in the transcript?

Model size drives hardware needs: smaller 7B models are feasible on consumer setups, while 70B variants require substantial RAM (example cited: 64 GB). For model availability, the transcript notes that “Codellama” is not yet available for Windows officially, with official support for macOS and Linux and mention of workarounds for Windows.

Review Questions

How do privacy and latency constraints change the architecture choice between cloud LLM APIs and local inference?
Explain how Ollama’s model tags affect both behavior (chat vs text) and resource requirements (RAM/disk).
When using Ollama’s REST API, what does the `stream` parameter change about the response format?

Key Points

1
Cloud LLM APIs can fail practical requirements for real-time UX and for regulated industries that restrict sending sensitive data to third parties.
2
Ollama enables local LLM inference by downloading model variants to the client machine and running them on-device, targeting consumer GPUs.
3
Model tags determine which variant is pulled (e.g., chat vs text fine-tuning) and directly affect hardware requirements and disk footprint.
4
Ollama supports multiple developer workflows: CLI interaction, local URL summarization, and multimodal Q&A over local image files.
5
Ollama exposes a localhost REST API on port 11434, letting developers integrate LLM inference into applications using POST requests.
6
The `stream` flag in the API controls whether responses arrive token-by-token or as a single JSON object for easier downstream parsing.
7
Model choice is a trade-off between capability and feasibility; smaller models like 7B variants are far more manageable than 70B models.

Highlights

Ollama’s core value is local inference: models download to the machine, run there, and can be accessed via CLI or a localhost REST API—reducing latency and avoiding cloud data transfer.

Model tags aren’t cosmetic; they select chat vs text fine-tuning and can change resource needs dramatically (example cited: 7B vs 70B).

A single local API call can return either streamed tokens or a complete JSON response depending on the `stream` parameter.