Gemma 2 - Local RAG with Ollama and LangChain

TL;DR

Use a persistent local vector store (ChromaDB) so embeddings and chunking run once, not on every query.

Briefing Cornell Notes

Briefing

Running a fully local RAG pipeline with Gemma 2 is practical—and the fastest path starts with a clean indexing step, local embeddings, and a persistent vector database. The setup uses Ollama to host Gemma 2 (tested with both the 9B and 27B variants) and Nomic embeddings to embed documents locally, then stores vectors in ChromaDB on disk. The key payoff: once the index is built, queries can run without any internet calls, and swapping models or experimenting with chunking strategies becomes straightforward.

The workflow begins by addressing a common misconception: many “local RAG” demos still rely on cloud embedding services. Here, transcripts from Alex Hormozi’s YouTube channel are treated as raw documents, split into chunks, embedded with Nomic embeddings, and written into a persistent ChromaDB directory. The indexing logic is separated into its own script so it can be rerun only when the dataset or chunking method changes. For chunking, the author tests multiple approaches—specifically the semantic chunker and the recursive character text splitter—because retrieval quality often hinges on chunk boundaries. To speed iteration, the process can start with a small subset of files (e.g., 10–20) to confirm the pipeline works before scaling to the full transcript set.

After indexing, the retrieval-and-generation portion is built as a LangChain chain. The retriever is configured for similarity search with k=5 (returning five relevant chunks per question). While MMR isn’t enabled in the baseline, it’s flagged as a likely improvement for reducing redundancy among retrieved passages. The LLM is Gemma 2 running locally via Ollama, with max tokens set to 512 and temperature set to 0 for more deterministic answers. A keep-alive setting (three hours in the example) keeps the model loaded so repeated calls don’t constantly reload weights.

A prompt template then combines retrieved context with the user’s question. The chain uses a runnable pass-through so the question is fed both into the retriever and into the prompt, ensuring the model sees both the query and the relevant retrieved text. Streaming is enabled so answers appear incrementally. Example questions—such as “What is the rule of 100?” and protein-related queries—produce grounded responses that reflect the retrieved transcript content.

To make debugging less guesswork, a separate debugging script adds progress indicators for embeddings and prints the prompts and inputs sent into the LLM. This helps verify whether the retriever is returning the right context and whether the prompt formatting matches the expected input type (string vs dictionary), avoiding frustrating runtime errors.

From there, the next upgrades are clear: fine-tune prompts for desired output style (bullet-point mini-reports vs longer narratives), experiment with semantic chunker thresholds, and consider retrieval enhancements like a multi-query retriever that generates several rewritten queries to improve recall. The overall message is that a fully local RAG stack—Ollama + Nomic embeddings + ChromaDB + LangChain + Gemma 2—can be built quickly, then iterated methodically by improving chunking, retrieval, and prompting.

Cornell Notes

A fully local RAG system can be built by combining Ollama-hosted Gemma 2, locally run Nomic embeddings, and a persistent ChromaDB vector store. The pipeline separates indexing from querying: transcripts are chunked (testing semantic chunker vs recursive character splitting), embedded, and stored once, then reused for later questions. Query time uses a similarity retriever (k=5) to fetch relevant chunks, injects them into a prompt template, and streams the Gemma 2 answer. Debugging is simplified with progress bars and functions that print the exact prompts/context sent to the LLM. The approach matters because it avoids cloud embeddings and makes it easy to swap models (e.g., Llama-3) and compare indexing strategies on the same local dataset.

Why does separating indexing from querying make local RAG easier to iterate?

Indexing is the expensive step: it loads documents, splits them into chunks, computes embeddings, and writes vectors to a persistent ChromaDB directory. Once that database is created, the query script can load the existing index and run retrieval + generation repeatedly without re-embedding or re-splitting. This also enables quick A/B testing of chunking methods by building separate DBs (e.g., one using semantic chunking and another using recursive character splitting).

What chunking strategies are tested, and why does chunking matter?

Two text splitters are used for experimentation: the semantic chunker and the recursive character text splitter. Chunking affects retrieval quality because embeddings represent chunks, not whole documents. If chunk boundaries are poor, the retriever may return irrelevant or incomplete passages. The workflow encourages starting with a small subset of files (10–20) to validate chunking and retrieval behavior before scaling to the full transcript set.

How does the retrieval step work in the baseline system?

Retrieval uses ChromaDB as a vector store and configures a similarity search retriever with k=5, meaning it returns five most similar chunks for each question. MMR (max marginal relevance) isn’t enabled in the baseline, but it’s suggested as a likely enhancement to reduce redundancy among retrieved chunks.

What settings shape Gemma 2’s behavior during local generation?

Gemma 2 is loaded locally via Ollama as the 9B model in the example. The chain sets max tokens to 512 and temperature to 0 for more deterministic outputs. A keep-alive value (three hours in the example) keeps the model loaded between calls, preventing repeated load/unload overhead during iterative testing.

How does the chain ensure the question reaches both retrieval and prompting?

The chain uses a runnable pass-through so the same user question is sent into the retriever (to fetch relevant context) and also passed into the prompt template. The prompt template then combines retrieved context with the question, so the LLM answers based on both the query and the retrieved transcript excerpts.

What debugging tools are added to make failures easier to diagnose?

A debugging script adds embedding progress indicators (show progress) and prints the prompts and inputs sent into the LLM. It also warns about input-type mismatches: debugging functions placed after prompt steps must expect a dictionary input, while prompt-only steps may use string input. This prevents confusing errors and helps confirm whether the retriever is supplying the right context.

Review Questions

When would you rebuild the ChromaDB index in this workflow, and what changes would trigger a rebuild?
How would enabling MMR (instead of plain similarity search) likely affect the retrieved chunks and downstream answers?
What are the main reasons to tune the prompt template after the RAG pipeline is working end-to-end?

Key Points

1
Use a persistent local vector store (ChromaDB) so embeddings and chunking run once, not on every query.
2
Keep indexing logic in a separate script to enable fast iteration across chunking methods like semantic chunker vs recursive character splitting.
3
Run both embeddings and the LLM locally via Ollama to avoid cloud embedding dependencies often found in “local RAG” demos.
4
Start with a small document subset to validate retrieval and prompting before scaling to the full transcript dataset.
5
Configure retrieval with similarity search (k=5) as a baseline, then consider MMR to reduce redundancy.
6
Set deterministic generation parameters (temperature=0) and use keep-alive to avoid repeated model reloads during development.
7
Add debugging that prints retrieved context and the exact prompt sent to the LLM to quickly diagnose retrieval or formatting issues.

Highlights

A fully local RAG stack can be built by pairing Ollama-hosted Gemma 2 with locally computed Nomic embeddings and a persistent ChromaDB index.

Chunking strategy is treated as a first-class variable, with semantic chunker and recursive character splitting tested via separate DBs.

Streaming answers plus prompt/context printing makes it easier to verify that retrieval is actually feeding the LLM the right passages.

A multi-query retriever is proposed as a next step to improve recall by rewriting the user question into multiple variants.

Once the index exists, swapping models (e.g., moving from Gemma 2 to Llama-3) becomes a practical experiment rather than a full rebuild.

Topics

Local RAG
Gemma 2
Ollama
LangChain
ChromaDB

Mentioned

Alex Hormozi
RAG
VSCode
LLM
MMR
UI
DB