Ollama meets LangChain

TL;DR

Ollama provides a local API that LangChain can call, enabling on-device inference with models like LLaMA-2.

Briefing Cornell Notes

Briefing

Running Ollama models locally turns LangChain into an on-device workflow: Python code can call a local LLaMA-2 instance through an API, generate text, and even perform retrieval-augmented generation (RAG) over web content—without writing custom parsing logic.

The setup starts with a local development environment (VS Code and a Python virtual environment for installing LangChain). Ollama models are already available locally, including a LLaMA-2 model and other previously configured models. From there, the simplest step is loading an Ollama-backed LLM inside LangChain. Using LangChain’s pre-made Ollama LLM integration, the code instantiates an LLM object with the chosen model name (LLaMA-2) and optionally attaches a streaming callback for token-by-token output. When the Python script runs, it triggers Ollama via its local API, meaning the model inference happens on the machine rather than through a hosted service.

Next comes a basic “chain” built around a prompt template. A prompt asks for “five interesting facts about” a topic, and the chain runs the LLM with configurable generation parameters such as temperature and max tokens. The workflow can stream output or run silently; in the example, streaming is disabled and verbose logging is turned off, so the program waits for completion and then prints the final response. The result is straightforward: the local LLaMA-2 model returns the requested facts, and the output can be redirected to a file for automation.

The most advanced example demonstrates RAG using web content. The process begins by accepting a command-line argument (URL). A web page loader fetches the page, and a recursive text splitter breaks the content into smaller chunks. Those chunks are embedded and stored in a local Chroma vector database. For embeddings, the workflow relies on Ollama-compatible embedding options referenced through LangChain (including mentions of quantized embedding approaches).

After indexing, a retrieval QA chain is assembled: it combines the local LLM (LLaMA-2), the Chroma vector store, and a prompt pulled from LangChain Hub. The chain then answers a question—“what are the latest headlines on”—by retrieving relevant chunks from the vector database and generating an answer grounded in that retrieved text. Passing “TechCrunch” as the URL argument leads to a list of current headlines extracted from the page. Notably, the number of headlines can vary (sometimes five, sometimes ten) even though no explicit fixed count is enforced, suggesting the model’s generation behavior influences the output length.

Overall, the workflow illustrates how local LLM inference plus LangChain’s chaining and RAG components can automate information gathering—potentially via scheduled jobs—while keeping data processing and model calls on the user’s own machine. The same pattern can be extended toward fuller local RAG systems for documents.

Cornell Notes

Ollama running on a local machine can power LangChain workflows using an API connection. The transcript shows three escalating examples: loading a local LLaMA-2 model through LangChain’s Ollama LLM wrapper, running a simple prompt template chain to generate text (e.g., five facts about the moon), and building a RAG pipeline over a web page. For RAG, a URL is fetched, split into chunks, embedded, and stored in a local Chroma vector database. A retrieval QA chain then answers questions using the retrieved chunks plus a prompt from LangChain Hub, producing outputs like the latest TechCrunch headlines without custom scraping/parsing code.

How does LangChain call a locally running Ollama model from Python?

LangChain uses its pre-made Ollama LLM integration. The code instantiates an LLM object with the desired model name (LLaMA-2) and can attach a streaming callback for token-by-token output. When the script runs, it triggers Ollama through Ollama’s local API, so inference happens on the computer rather than via a remote hosted model.

What does the “simple chain” example add beyond loading the model?

It wraps the LLM in a prompt template and runs it as a chain. The prompt asks for “five interesting facts about” a topic, and the chain can be configured with generation parameters like temperature and max tokens. The example also demonstrates control over streaming and verbosity (e.g., turning off callback streaming and setting verbose to false) so the program prints the completed response after generation.

What are the main steps in the RAG workflow over a web page?

First, the script accepts a URL argument. It loads the page using a web-based loader, then splits the text with a recursive text splitter. Next, it embeds the chunks and stores them in a local Chroma vector database. Finally, it builds a retrieval QA chain that uses the LLM plus the vector store to answer questions grounded in the retrieved chunks.

How does the retrieval QA chain produce “latest headlines” without manual parsing?

The chain retrieves relevant text chunks from the Chroma vector store and then uses the LLM to generate an answer based on those chunks. Because the prompt instructs the model to produce headlines for a given source (e.g., TechCrunch), the model performs the extraction and formatting implicitly, rather than relying on custom code to parse HTML.

Why might the number of headlines vary between runs?

The transcript notes that the output sometimes lists five headlines and sometimes ten, even though no fixed number is explicitly enforced in the code shown. That variability likely comes from the model’s generation behavior and how it interprets the prompt and retrieved context.

Review Questions

What components are required to turn a local LLM into a RAG system over a URL (loader, splitter, embeddings, vector store, retrieval chain)?
In the examples, where do streaming and verbosity settings affect the user experience, and what do they change in the program’s output?
How does the retrieval QA chain combine the vector store (Chroma) with the LLM to answer a question grounded in web content?

Key Points

1
Ollama provides a local API that LangChain can call, enabling on-device inference with models like LLaMA-2.
2
LangChain’s Ollama LLM wrapper can be instantiated with a chosen model name and optional streaming callbacks for incremental token output.
3
Prompt templates plus simple chains let local LLMs generate structured responses (e.g., five facts) with controllable parameters like temperature and max tokens.
4
A RAG pipeline over web content can be built by loading a URL, splitting text into chunks, embedding those chunks, and storing them in a local Chroma vector database.
5
A retrieval QA chain uses the vector store to fetch relevant chunks and then generates answers using a prompt from LangChain Hub.
6
Automating information gathering becomes feasible by running the URL-based RAG script on a schedule and saving outputs locally.
7
Model output length (such as number of headlines) can vary even without explicit constraints, reflecting generation behavior.

Highlights

Local LLaMA-2 inference is triggered from Python by calling Ollama’s local API through LangChain’s Ollama LLM integration.

A single prompt template chain can generate a complete response after generation finishes, with streaming and verbosity toggles controlling output behavior.

RAG over a URL is assembled by fetching web text, chunking it, embedding it into Chroma, and then answering via a retrieval QA chain—no custom parsing required.

Passing TechCrunch as the URL argument yields a list of “latest headlines” generated from retrieved page content, with headline counts that can fluctuate.