Gemma 2 - Local RAG with Ollama and LangChain
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use a persistent local vector store (ChromaDB) so embeddings and chunking run once, not on every query.
Briefing
Running a fully local RAG pipeline with Gemma 2 is practical—and the fastest path starts with a clean indexing step, local embeddings, and a persistent vector database. The setup uses Ollama to host Gemma 2 (tested with both the 9B and 27B variants) and Nomic embeddings to embed documents locally, then stores vectors in ChromaDB on disk. The key payoff: once the index is built, queries can run without any internet calls, and swapping models or experimenting with chunking strategies becomes straightforward.
The workflow begins by addressing a common misconception: many “local RAG” demos still rely on cloud embedding services. Here, transcripts from Alex Hormozi’s YouTube channel are treated as raw documents, split into chunks, embedded with Nomic embeddings, and written into a persistent ChromaDB directory. The indexing logic is separated into its own script so it can be rerun only when the dataset or chunking method changes. For chunking, the author tests multiple approaches—specifically the semantic chunker and the recursive character text splitter—because retrieval quality often hinges on chunk boundaries. To speed iteration, the process can start with a small subset of files (e.g., 10–20) to confirm the pipeline works before scaling to the full transcript set.
After indexing, the retrieval-and-generation portion is built as a LangChain chain. The retriever is configured for similarity search with k=5 (returning five relevant chunks per question). While MMR isn’t enabled in the baseline, it’s flagged as a likely improvement for reducing redundancy among retrieved passages. The LLM is Gemma 2 running locally via Ollama, with max tokens set to 512 and temperature set to 0 for more deterministic answers. A keep-alive setting (three hours in the example) keeps the model loaded so repeated calls don’t constantly reload weights.
A prompt template then combines retrieved context with the user’s question. The chain uses a runnable pass-through so the question is fed both into the retriever and into the prompt, ensuring the model sees both the query and the relevant retrieved text. Streaming is enabled so answers appear incrementally. Example questions—such as “What is the rule of 100?” and protein-related queries—produce grounded responses that reflect the retrieved transcript content.
To make debugging less guesswork, a separate debugging script adds progress indicators for embeddings and prints the prompts and inputs sent into the LLM. This helps verify whether the retriever is returning the right context and whether the prompt formatting matches the expected input type (string vs dictionary), avoiding frustrating runtime errors.
From there, the next upgrades are clear: fine-tune prompts for desired output style (bullet-point mini-reports vs longer narratives), experiment with semantic chunker thresholds, and consider retrieval enhancements like a multi-query retriever that generates several rewritten queries to improve recall. The overall message is that a fully local RAG stack—Ollama + Nomic embeddings + ChromaDB + LangChain + Gemma 2—can be built quickly, then iterated methodically by improving chunking, retrieval, and prompting.
Cornell Notes
A fully local RAG system can be built by combining Ollama-hosted Gemma 2, locally run Nomic embeddings, and a persistent ChromaDB vector store. The pipeline separates indexing from querying: transcripts are chunked (testing semantic chunker vs recursive character splitting), embedded, and stored once, then reused for later questions. Query time uses a similarity retriever (k=5) to fetch relevant chunks, injects them into a prompt template, and streams the Gemma 2 answer. Debugging is simplified with progress bars and functions that print the exact prompts/context sent to the LLM. The approach matters because it avoids cloud embeddings and makes it easy to swap models (e.g., Llama-3) and compare indexing strategies on the same local dataset.
Why does separating indexing from querying make local RAG easier to iterate?
What chunking strategies are tested, and why does chunking matter?
How does the retrieval step work in the baseline system?
What settings shape Gemma 2’s behavior during local generation?
How does the chain ensure the question reaches both retrieval and prompting?
What debugging tools are added to make failures easier to diagnose?
Review Questions
- When would you rebuild the ChromaDB index in this workflow, and what changes would trigger a rebuild?
- How would enabling MMR (instead of plain similarity search) likely affect the retrieved chunks and downstream answers?
- What are the main reasons to tune the prompt template after the RAG pipeline is working end-to-end?
Key Points
- 1
Use a persistent local vector store (ChromaDB) so embeddings and chunking run once, not on every query.
- 2
Keep indexing logic in a separate script to enable fast iteration across chunking methods like semantic chunker vs recursive character splitting.
- 3
Run both embeddings and the LLM locally via Ollama to avoid cloud embedding dependencies often found in “local RAG” demos.
- 4
Start with a small document subset to validate retrieval and prompting before scaling to the full transcript dataset.
- 5
Configure retrieval with similarity search (k=5) as a baseline, then consider MMR to reduce redundancy.
- 6
Set deterministic generation parameters (temperature=0) and use keep-alive to avoid repeated model reloads during development.
- 7
Add debugging that prints retrieved context and the exact prompt sent to the LLM to quickly diagnose retrieval or formatting issues.