Local RAG with Llama 3.1 for PDFs | Private Chat with Your Documents using LangChain & Streamlit
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Ingestion converts PDFs to plain text, splits into chunks, embeds them, and stores vectors locally in Quadrant for private retrieval.
Briefing
A fully local “chat with your PDFs” system can be built using open models and self-hosted infrastructure, with responses grounded in retrieved document passages and accompanied by ordered source snippets. The core payoff is privacy and control: PDFs are ingested, chunked, embedded, and stored locally, then every question triggers retrieval, reranking, optional LLM-based filtering, and finally an answer generated strictly from the most relevant context.
The pipeline starts when a user uploads a PDF. Text extraction converts the PDF into plain text, then chunking breaks that text into smaller segments to stay within model context limits and improve relevance. Embeddings turn each chunk into a vector representation, which is stored in a local vector database (Quadrant via its local client). This creates a private knowledge base that can be queried without sending document content to a third party.
When a question arrives, a retriever pulls the top five most similar chunks from the vector store using similarity search. To improve ordering and accuracy, the system can add two additional stages. First, a reranker reorders the retrieved chunks so the most relevant passage is placed first—important because the downstream prompt is instructed to treat the first context as the most relevant. Second, an optional “chain filter” uses an LLM check to keep only passages that actually answer the question. In the demo, this filtering can collapse multiple candidates down to a single best chunk, reducing noise and helping the model avoid irrelevant or conflicting information.
Answer generation happens through a LangChain question-answering chain that feeds the user’s question plus the selected context into a local LLM. The system prompt instructs the model to use the provided contextual information, return concise answers when the answer is present, and format the output in Markdown. Context is assembled with the most relevant sources first, and the chain also supports chat history so follow-up questions can reference earlier turns.
The implementation ties these components together in a Streamlit interface. The app caches the built retrieval/QA resources so models and indexes aren’t rebuilt on every UI refresh. It streams the model’s response token-by-token and simultaneously captures retrieval events so the UI can display the exact source chunks used—rendered as expandable items alongside the final answer.
For local inference, the setup uses Llama 3.1 (via an open model served locally) and Gemma 2 as part of the retrieval/QA stack, with embeddings and reranking handled by Fast embeddings and a fast re-ranker. The PDF extraction choice is PDF2 (via the PDF2 library from Google engineers), justified through benchmark comparisons that weigh extraction speed and extraction quality.
A practical test uses a technical Porsche 2025 car PDF, including questions like engine specs, acceleration, and number of seats. With reranking and chain filtering enabled, the system returns answers that match the expected facts, and the UI shows the specific retrieved passages that supported each response. Deployment is demonstrated via Streamlit Community Cloud, with configuration steps that avoid running heavy local-only services remotely and instead rely on exported requirements and Streamlit secrets for any optional remote API keys.
Cornell Notes
The system builds a private “RAG chat” app for PDFs using only self-hosted components. PDFs are extracted to text, chunked, embedded, and stored in a local Quadrant vector database. Each user question triggers similarity retrieval (top 5), optional reranking to reorder passages by relevance, and an optional LLM-based chain filter that keeps only context that truly answers the question. The final answer is generated by a local LLM using a prompt that prioritizes the first (most relevant) context and includes chat history. Streamlit provides a UI that streams answers and displays the exact retrieved source chunks used.
What does the ingestion pipeline do, and why are chunking and embeddings essential?
How does the system decide which passages to use for a question?
Why does reranking and chain filtering matter for answer quality?
What role does the prompt play in grounding answers in retrieved context?
How does the Streamlit UI support transparency and usability?
What local-vs-remote considerations appear in deployment?
Review Questions
- Explain the full RAG flow from PDF upload to final answer, naming the roles of chunking, embeddings, retrieval, reranking, and chain filtering.
- If chain filtering is turned off but reranking remains on, what kinds of errors might increase and why?
- How does the prompt’s instruction about context ordering interact with the reranker’s output?
Key Points
- 1
Ingestion converts PDFs to plain text, splits into chunks, embeds them, and stores vectors locally in Quadrant for private retrieval.
- 2
Every question triggers similarity search over the vector store, typically returning the top five candidate chunks.
- 3
Reranking can reorder retrieved chunks so the most relevant passage is first, aligning with the prompt’s context-priority rule.
- 4
An optional LLM-based chain filter can remove passages that don’t actually answer the question, reducing noise before generation.
- 5
The QA chain generates answers using the user question plus retrieved context, with Markdown formatting and concise-response guidance.
- 6
Streamlit streams responses and displays the exact retrieved source chunks as expandable citations for transparency.
- 7
Deployment to Streamlit Community Cloud requires configuration to avoid relying on local-only model services remotely.