100% Local RAG with DeepSeek-R1, Ollama and LangChain - Build Document AI for Your Private Files
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Chunk long documents into overlapping 2,048-character segments to prevent irrelevant text from dominating answers.
Briefing
A practical way to make local RAG work reliably on long documents is to retrieve the right text chunks—then feed only those chunks (plus chat history) into DeepSeek-R1. The build described here targets a common failure mode: when documents span multiple pages, answers can drift because the model gets overwhelmed by irrelevant context. The fix is a full local pipeline that ingests files, splits them into manageable chunks, retrieves the most relevant chunks for each question using a hybrid search strategy, reranks the candidates, and streams grounded answers through a Streamlit chat UI.
The system is organized into two main components: file ingestion and a chatbot. Ingestion takes uploaded files (PDF, Markdown, or text), converts them into a list of documents, and then splits each document into chunks using a recursive character text splitter configured for 2,048-character chunks with overlap. To improve retrieval quality, each chunk is “contextualized” by generating a short 2–3 sentence summary of what that chunk represents within the broader document. This contextualization is produced locally using a lightweight LLM (Llama 3.2 3B) and is prepended to the chunk text so the retriever and downstream model see more informative units.
Retrieval uses a hybrid approach combining semantic search and keyword search. Semantic retrieval is built from embeddings generated with FastEmbed’s “BAAI/bge-small-en-v1.5” model, then queried to return the top candidates. In parallel, a BM25 retriever estimates relevance using term frequency and keyword overlap. An Ensemble Retriever merges both signals with weights set to 60% semantic and 40% BM25. Because hybrid retrieval can still surface near-misses, a reranking step is applied: Flashrank reorders the top results using a local reranker model (ms-marco-MiniLM-L-6-v2), and the pipeline keeps only the top three chunks as context for the answer.
The chatbot layer wraps this retrieval in a state graph (via LangGraph) that manages question, chat history, retrieved documents, and the final answer. For each user query, the graph retrieves relevant chunks, formats them into a prompt (including file names and chunk content), and sends everything—along with recent conversation history—to DeepSeek-R1 running locally. The response is streamed back to the UI, including intermediate “sources” events so users can see which chunks were used. The prompts instruct the model to be helpful, answer based on provided excerpts, and admit when it doesn’t know.
A Streamlit app ties it together: users upload multiple files, the app builds and caches the chatbot, and then supports multi-turn Q&A over private documents. A test on a long Anthropic blog post demonstrates that chunking plus hybrid retrieval helps the model stay on-topic, while also returning sources for transparency. The author closes by suggesting tuning chunk size and other preprocessing choices to further improve answer quality.
Cornell Notes
The build shows how to run a local RAG system that stays accurate on long documents by retrieving the right chunks instead of stuffing the model with everything. Files are ingested, split into overlapping 2,048-character chunks, and each chunk is contextualized with a short 2–3 sentence “what this chunk is about” summary generated locally using Llama 3.2 3B. Retrieval combines semantic embeddings (FastEmbed BAAI/bge-small-en-v1.5) and BM25 keyword search in an Ensemble Retriever (60% semantic, 40% BM25), then reranks candidates with Flashrank using ms-marco-MiniLM-L-6-v2 and keeps the top three chunks. DeepSeek-R1 then answers using only those chunks plus chat history, with streaming responses and displayed sources in a Streamlit UI.
Why does long-document Q&A fail in basic local RAG, and what’s the remedy used here?
What does “contextualized chunks” mean, and how is it implemented?
How does the retrieval pipeline combine semantic search and keyword search?
What role does reranking play, and which models are used?
How does the chatbot maintain multi-turn context while staying grounded in retrieved text?
What does the Streamlit UI provide beyond plain chat?
Review Questions
- If chunking is the main fix for long-document RAG, what additional step here helps retrieval beyond plain chunking (and why)?
- How do the weights (60% semantic, 40% BM25) and the reranker’s top-3 cutoff affect the quality and focus of the context passed to DeepSeek-R1?
- What information is included in the DeepSeek-R1 prompt besides retrieved chunks, and how does that support multi-turn conversations?
Key Points
- 1
Chunk long documents into overlapping 2,048-character segments to prevent irrelevant text from dominating answers.
- 2
Generate short 2–3 sentence “chunk context” using Llama 3.2 3B and prepend it to each chunk before indexing.
- 3
Use hybrid retrieval: semantic embeddings plus BM25 keyword scoring, combined via an Ensemble Retriever (60% semantic, 40% BM25).
- 4
Rerank retrieved candidates with Flashrank using ms-marco-MiniLM-L-6-v2 and keep only the top three chunks as model context.
- 5
Feed DeepSeek-R1 only the reranked chunks plus chat history, and stream responses back to the UI.
- 6
Build the chatbot workflow with LangGraph state so retrieval, prompt construction, and answer generation stay consistent across turns.
- 7
Expose retrieved “sources” in the Streamlit interface so users can verify which excerpts grounded each answer.