Get AI summaries of any video or article — Sign up free
100% Local RAG with DeepSeek-R1, Ollama and LangChain - Build Document AI for Your Private Files thumbnail

100% Local RAG with DeepSeek-R1, Ollama and LangChain - Build Document AI for Your Private Files

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Chunk long documents into overlapping 2,048-character segments to prevent irrelevant text from dominating answers.

Briefing

A practical way to make local RAG work reliably on long documents is to retrieve the right text chunks—then feed only those chunks (plus chat history) into DeepSeek-R1. The build described here targets a common failure mode: when documents span multiple pages, answers can drift because the model gets overwhelmed by irrelevant context. The fix is a full local pipeline that ingests files, splits them into manageable chunks, retrieves the most relevant chunks for each question using a hybrid search strategy, reranks the candidates, and streams grounded answers through a Streamlit chat UI.

The system is organized into two main components: file ingestion and a chatbot. Ingestion takes uploaded files (PDF, Markdown, or text), converts them into a list of documents, and then splits each document into chunks using a recursive character text splitter configured for 2,048-character chunks with overlap. To improve retrieval quality, each chunk is “contextualized” by generating a short 2–3 sentence summary of what that chunk represents within the broader document. This contextualization is produced locally using a lightweight LLM (Llama 3.2 3B) and is prepended to the chunk text so the retriever and downstream model see more informative units.

Retrieval uses a hybrid approach combining semantic search and keyword search. Semantic retrieval is built from embeddings generated with FastEmbed’s “BAAI/bge-small-en-v1.5” model, then queried to return the top candidates. In parallel, a BM25 retriever estimates relevance using term frequency and keyword overlap. An Ensemble Retriever merges both signals with weights set to 60% semantic and 40% BM25. Because hybrid retrieval can still surface near-misses, a reranking step is applied: Flashrank reorders the top results using a local reranker model (ms-marco-MiniLM-L-6-v2), and the pipeline keeps only the top three chunks as context for the answer.

The chatbot layer wraps this retrieval in a state graph (via LangGraph) that manages question, chat history, retrieved documents, and the final answer. For each user query, the graph retrieves relevant chunks, formats them into a prompt (including file names and chunk content), and sends everything—along with recent conversation history—to DeepSeek-R1 running locally. The response is streamed back to the UI, including intermediate “sources” events so users can see which chunks were used. The prompts instruct the model to be helpful, answer based on provided excerpts, and admit when it doesn’t know.

A Streamlit app ties it together: users upload multiple files, the app builds and caches the chatbot, and then supports multi-turn Q&A over private documents. A test on a long Anthropic blog post demonstrates that chunking plus hybrid retrieval helps the model stay on-topic, while also returning sources for transparency. The author closes by suggesting tuning chunk size and other preprocessing choices to further improve answer quality.

Cornell Notes

The build shows how to run a local RAG system that stays accurate on long documents by retrieving the right chunks instead of stuffing the model with everything. Files are ingested, split into overlapping 2,048-character chunks, and each chunk is contextualized with a short 2–3 sentence “what this chunk is about” summary generated locally using Llama 3.2 3B. Retrieval combines semantic embeddings (FastEmbed BAAI/bge-small-en-v1.5) and BM25 keyword search in an Ensemble Retriever (60% semantic, 40% BM25), then reranks candidates with Flashrank using ms-marco-MiniLM-L-6-v2 and keeps the top three chunks. DeepSeek-R1 then answers using only those chunks plus chat history, with streaming responses and displayed sources in a Streamlit UI.

Why does long-document Q&A fail in basic local RAG, and what’s the remedy used here?

Long files (multi-page Markdown/PDF) can cause the model to mix irrelevant sections, because too much text competes for attention. The remedy is chunking plus retrieval: documents are split into overlapping chunks, the system retrieves only the most relevant chunks for each query, and the model answers using those chunks plus recent chat history.

What does “contextualized chunks” mean, and how is it implemented?

Each chunk is augmented with a brief summary describing what that chunk represents in the context of the full document. Implementation-wise, the pipeline uses a lightweight local LLM (Llama 3.2 3B) with a prompt that asks for 2–3 sentences of relevant context, then prepends that generated context to the chunk text before indexing and retrieval.

How does the retrieval pipeline combine semantic search and keyword search?

Semantic search uses embeddings from FastEmbed’s BAAI/bge-small-en-v1.5 model to retrieve top candidates. Keyword search uses BM25 to score chunks based on term overlap. An Ensemble Retriever merges both with weights of 60% semantic and 40% BM25, producing a combined candidate set for later reranking.

What role does reranking play, and which models are used?

Reranking improves precision by ordering the retrieved candidates so the most relevant chunks are selected for the final prompt. Flashrank reranks the ensemble results using ms-marco-MiniLM-L-6-v2 and, by default in this setup, returns only three top chunks to feed into DeepSeek-R1.

How does the chatbot maintain multi-turn context while staying grounded in retrieved text?

A LangGraph state graph stores the user question, chat history messages, retrieved documents (chunks), and the final answer. For each turn, it retrieves new chunks for the current query, formats them into the prompt along with the system instructions, and includes the recent conversation history so follow-up questions remain coherent while answers stay tied to the retrieved excerpts.

What does the Streamlit UI provide beyond plain chat?

The UI supports uploading multiple private files (PDF, Markdown, text), builds the chatbot once and caches it, and streams responses. It also surfaces “sources” events—showing which retrieved chunks were used—so users can trace answers back to specific document excerpts.

Review Questions

  1. If chunking is the main fix for long-document RAG, what additional step here helps retrieval beyond plain chunking (and why)?
  2. How do the weights (60% semantic, 40% BM25) and the reranker’s top-3 cutoff affect the quality and focus of the context passed to DeepSeek-R1?
  3. What information is included in the DeepSeek-R1 prompt besides retrieved chunks, and how does that support multi-turn conversations?

Key Points

  1. 1

    Chunk long documents into overlapping 2,048-character segments to prevent irrelevant text from dominating answers.

  2. 2

    Generate short 2–3 sentence “chunk context” using Llama 3.2 3B and prepend it to each chunk before indexing.

  3. 3

    Use hybrid retrieval: semantic embeddings plus BM25 keyword scoring, combined via an Ensemble Retriever (60% semantic, 40% BM25).

  4. 4

    Rerank retrieved candidates with Flashrank using ms-marco-MiniLM-L-6-v2 and keep only the top three chunks as model context.

  5. 5

    Feed DeepSeek-R1 only the reranked chunks plus chat history, and stream responses back to the UI.

  6. 6

    Build the chatbot workflow with LangGraph state so retrieval, prompt construction, and answer generation stay consistent across turns.

  7. 7

    Expose retrieved “sources” in the Streamlit interface so users can verify which excerpts grounded each answer.

Highlights

The pipeline’s accuracy on long files comes from retrieving a few relevant chunks—not from increasing model context length.
Contextualized chunks are created by asking Llama 3.2 3B to summarize what each chunk represents within the full document, then prepending that summary to the chunk text.
Hybrid retrieval (60% semantic / 40% BM25) plus reranking (top 3 via Flashrank) is the core mechanism that keeps answers focused.
DeepSeek-R1 runs locally and answers using retrieved excerpts and chat history, with streamed output and visible sources in Streamlit.

Mentioned

  • RAG
  • AMA L
  • BM25
  • UI
  • PDF
  • XML
  • LM
  • LLM