Ollama meets LangChain
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Ollama provides a local API that LangChain can call, enabling on-device inference with models like LLaMA-2.
Briefing
Running Ollama models locally turns LangChain into an on-device workflow: Python code can call a local LLaMA-2 instance through an API, generate text, and even perform retrieval-augmented generation (RAG) over web content—without writing custom parsing logic.
The setup starts with a local development environment (VS Code and a Python virtual environment for installing LangChain). Ollama models are already available locally, including a LLaMA-2 model and other previously configured models. From there, the simplest step is loading an Ollama-backed LLM inside LangChain. Using LangChain’s pre-made Ollama LLM integration, the code instantiates an LLM object with the chosen model name (LLaMA-2) and optionally attaches a streaming callback for token-by-token output. When the Python script runs, it triggers Ollama via its local API, meaning the model inference happens on the machine rather than through a hosted service.
Next comes a basic “chain” built around a prompt template. A prompt asks for “five interesting facts about” a topic, and the chain runs the LLM with configurable generation parameters such as temperature and max tokens. The workflow can stream output or run silently; in the example, streaming is disabled and verbose logging is turned off, so the program waits for completion and then prints the final response. The result is straightforward: the local LLaMA-2 model returns the requested facts, and the output can be redirected to a file for automation.
The most advanced example demonstrates RAG using web content. The process begins by accepting a command-line argument (URL). A web page loader fetches the page, and a recursive text splitter breaks the content into smaller chunks. Those chunks are embedded and stored in a local Chroma vector database. For embeddings, the workflow relies on Ollama-compatible embedding options referenced through LangChain (including mentions of quantized embedding approaches).
After indexing, a retrieval QA chain is assembled: it combines the local LLM (LLaMA-2), the Chroma vector store, and a prompt pulled from LangChain Hub. The chain then answers a question—“what are the latest headlines on”—by retrieving relevant chunks from the vector database and generating an answer grounded in that retrieved text. Passing “TechCrunch” as the URL argument leads to a list of current headlines extracted from the page. Notably, the number of headlines can vary (sometimes five, sometimes ten) even though no explicit fixed count is enforced, suggesting the model’s generation behavior influences the output length.
Overall, the workflow illustrates how local LLM inference plus LangChain’s chaining and RAG components can automate information gathering—potentially via scheduled jobs—while keeping data processing and model calls on the user’s own machine. The same pattern can be extended toward fuller local RAG systems for documents.
Cornell Notes
Ollama running on a local machine can power LangChain workflows using an API connection. The transcript shows three escalating examples: loading a local LLaMA-2 model through LangChain’s Ollama LLM wrapper, running a simple prompt template chain to generate text (e.g., five facts about the moon), and building a RAG pipeline over a web page. For RAG, a URL is fetched, split into chunks, embedded, and stored in a local Chroma vector database. A retrieval QA chain then answers questions using the retrieved chunks plus a prompt from LangChain Hub, producing outputs like the latest TechCrunch headlines without custom scraping/parsing code.
How does LangChain call a locally running Ollama model from Python?
What does the “simple chain” example add beyond loading the model?
What are the main steps in the RAG workflow over a web page?
How does the retrieval QA chain produce “latest headlines” without manual parsing?
Why might the number of headlines vary between runs?
Review Questions
- What components are required to turn a local LLM into a RAG system over a URL (loader, splitter, embeddings, vector store, retrieval chain)?
- In the examples, where do streaming and verbosity settings affect the user experience, and what do they change in the program’s output?
- How does the retrieval QA chain combine the vector store (Chroma) with the LLM to answer a question grounded in web content?
Key Points
- 1
Ollama provides a local API that LangChain can call, enabling on-device inference with models like LLaMA-2.
- 2
LangChain’s Ollama LLM wrapper can be instantiated with a chosen model name and optional streaming callbacks for incremental token output.
- 3
Prompt templates plus simple chains let local LLMs generate structured responses (e.g., five facts) with controllable parameters like temperature and max tokens.
- 4
A RAG pipeline over web content can be built by loading a URL, splitting text into chunks, embedding those chunks, and storing them in a local Chroma vector database.
- 5
A retrieval QA chain uses the vector store to fetch relevant chunks and then generates answers using a prompt from LangChain Hub.
- 6
Automating information gathering becomes feasible by running the URL-based RAG script on a schedule and saving outputs locally.
- 7
Model output length (such as number of headlines) can vary even without explicit constraints, reflecting generation behavior.