Local RAG with Llama 3.1 for PDFs | Private Chat with Your Documents using LangChain & Streamlit

TL;DR

Ingestion converts PDFs to plain text, splits into chunks, embeds them, and stores vectors locally in Quadrant for private retrieval.

Briefing Cornell Notes

Briefing

A fully local “chat with your PDFs” system can be built using open models and self-hosted infrastructure, with responses grounded in retrieved document passages and accompanied by ordered source snippets. The core payoff is privacy and control: PDFs are ingested, chunked, embedded, and stored locally, then every question triggers retrieval, reranking, optional LLM-based filtering, and finally an answer generated strictly from the most relevant context.

The pipeline starts when a user uploads a PDF. Text extraction converts the PDF into plain text, then chunking breaks that text into smaller segments to stay within model context limits and improve relevance. Embeddings turn each chunk into a vector representation, which is stored in a local vector database (Quadrant via its local client). This creates a private knowledge base that can be queried without sending document content to a third party.

When a question arrives, a retriever pulls the top five most similar chunks from the vector store using similarity search. To improve ordering and accuracy, the system can add two additional stages. First, a reranker reorders the retrieved chunks so the most relevant passage is placed first—important because the downstream prompt is instructed to treat the first context as the most relevant. Second, an optional “chain filter” uses an LLM check to keep only passages that actually answer the question. In the demo, this filtering can collapse multiple candidates down to a single best chunk, reducing noise and helping the model avoid irrelevant or conflicting information.

Answer generation happens through a LangChain question-answering chain that feeds the user’s question plus the selected context into a local LLM. The system prompt instructs the model to use the provided contextual information, return concise answers when the answer is present, and format the output in Markdown. Context is assembled with the most relevant sources first, and the chain also supports chat history so follow-up questions can reference earlier turns.

The implementation ties these components together in a Streamlit interface. The app caches the built retrieval/QA resources so models and indexes aren’t rebuilt on every UI refresh. It streams the model’s response token-by-token and simultaneously captures retrieval events so the UI can display the exact source chunks used—rendered as expandable items alongside the final answer.

For local inference, the setup uses Llama 3.1 (via an open model served locally) and Gemma 2 as part of the retrieval/QA stack, with embeddings and reranking handled by Fast embeddings and a fast re-ranker. The PDF extraction choice is PDF2 (via the PDF2 library from Google engineers), justified through benchmark comparisons that weigh extraction speed and extraction quality.

A practical test uses a technical Porsche 2025 car PDF, including questions like engine specs, acceleration, and number of seats. With reranking and chain filtering enabled, the system returns answers that match the expected facts, and the UI shows the specific retrieved passages that supported each response. Deployment is demonstrated via Streamlit Community Cloud, with configuration steps that avoid running heavy local-only services remotely and instead rely on exported requirements and Streamlit secrets for any optional remote API keys.

Cornell Notes

The system builds a private “RAG chat” app for PDFs using only self-hosted components. PDFs are extracted to text, chunked, embedded, and stored in a local Quadrant vector database. Each user question triggers similarity retrieval (top 5), optional reranking to reorder passages by relevance, and an optional LLM-based chain filter that keeps only context that truly answers the question. The final answer is generated by a local LLM using a prompt that prioritizes the first (most relevant) context and includes chat history. Streamlit provides a UI that streams answers and displays the exact retrieved source chunks used.

What does the ingestion pipeline do, and why are chunking and embeddings essential?

Ingestion extracts plain text from the uploaded PDF, then splits it into smaller chunks so the model isn’t forced to process the entire document at once. Those chunks are converted into embeddings and stored in a local vector store (Quadrant). This turns the PDF into a searchable knowledge base where later retrieval can find the most relevant passages for a specific question.

How does the system decide which passages to use for a question?

A retriever performs similarity search against the vector store and returns the top five most relevant chunks. If enabled, a reranker reorders those chunks so the best match appears first. Optionally, an LLM-based chain filter checks each candidate passage against the question and keeps only those that actually contain the answer.

Why does reranking and chain filtering matter for answer quality?

Similarity search can return semantically related but not directly answer-bearing text. Reranking improves ordering so the prompt’s “most relevant source first” instruction aligns with reality. Chain filtering further reduces noise by discarding passages that don’t answer the question, which can prevent the LLM from using irrelevant context and can improve factual precision.

What role does the prompt play in grounding answers in retrieved context?

The system prompt instructs the LLM to use only the provided contextual information to answer the user question, prioritize concise responses when the answer is present, and format output in Markdown. It also emphasizes that the context is organized with the most relevant source appearing first, which makes the ordering from retrieval/reranking directly consequential.

How does the Streamlit UI support transparency and usability?

The app streams the model’s response and captures retrieval events. When the retrieval returns source documents, the UI renders them as expandable items so users can see exactly which chunks were used. The interface also maintains chat history and limits conversation length to keep the demo stable.

What local-vs-remote considerations appear in deployment?

The app is designed for local hosting, including running models locally (e.g., via an Ollama instance). For Streamlit Community Cloud deployment, the setup avoids requiring local-only services like Ollama on the remote host; instead it exports requirements and uses Streamlit secrets for any optional remote API keys (e.g., Groq) if configured.

Review Questions

Explain the full RAG flow from PDF upload to final answer, naming the roles of chunking, embeddings, retrieval, reranking, and chain filtering.
If chain filtering is turned off but reranking remains on, what kinds of errors might increase and why?
How does the prompt’s instruction about context ordering interact with the reranker’s output?

Key Points

1
Ingestion converts PDFs to plain text, splits into chunks, embeds them, and stores vectors locally in Quadrant for private retrieval.
2
Every question triggers similarity search over the vector store, typically returning the top five candidate chunks.
3
Reranking can reorder retrieved chunks so the most relevant passage is first, aligning with the prompt’s context-priority rule.
4
An optional LLM-based chain filter can remove passages that don’t actually answer the question, reducing noise before generation.
5
The QA chain generates answers using the user question plus retrieved context, with Markdown formatting and concise-response guidance.
6
Streamlit streams responses and displays the exact retrieved source chunks as expandable citations for transparency.
7
Deployment to Streamlit Community Cloud requires configuration to avoid relying on local-only model services remotely.

Highlights

Privacy comes from keeping the entire RAG stack local: PDFs are ingested, embedded, and stored in a local Quadrant vector database before any question is answered.

Accuracy improves in two steps: reranking fixes ordering, and chain filtering can eliminate irrelevant passages so the LLM sees cleaner context.

The prompt explicitly treats the first context item as most relevant, making retrieval ordering a direct driver of answer quality.

The UI doesn’t just output answers—it surfaces the retrieved source chunks used to generate them, supporting verification.

A technical Porsche 2025 PDF demo shows the system answering factual questions like engine specs and seat count with retrieved citations.

Topics

Mentioned

Venelin Valkov
RAG
LLM
UI
PDF
API