Get AI summaries of any video or article — Sign up free
100% Local CAG with Qwen3, Ollama and LangChain - AI Chatbot for Your Private Documents thumbnail

100% Local CAG with Qwen3, Ollama and LangChain - AI Chatbot for Your Private Documents

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

CAG can replace RAG for private-document chat when the entire knowledge base fits in the model’s context window and stays mostly static.

Briefing

Cache-augmented generation (CAG) is presented as a simpler alternative to retrieval-augmented generation (RAG) for private-document chat: instead of searching a database at question time, the system preloads your documents into the model’s context window and reuses that cached prefix during the conversation. The payoff is speed and cost efficiency—prompt caching can make subsequent turns cheaper—while keeping everything local when paired with open models and local inference.

The core idea traces back to a paper on “don’t do RAG when cache augmented generation is all you need for knowledge tasks.” The transcript argues the approach is increasingly practical because modern large language models can accept very large context windows—citing examples like Gemini 2.5, OpenAI o3, and Qwen3.7 with 100K to even 1M tokens. In that setup, a knowledge base (PDFs, text, or URLs) is converted into content that can be inserted up front, then cached by the model provider or local runtime. The speaker draws parallels to how commercial APIs do prompt caching by caching a prefix of tokens and reusing it across later messages.

A key comparison is made between CAG and RAG. RAG depends heavily on retriever quality (chunking, indexing, and whether the right passages are retrieved). CAG depends on the model’s ability to handle long contexts and “search effectively” within them. That’s why the transcript emphasizes evaluation beyond context length alone, pointing to a benchmark for long-context comprehension (FictionBench) that tests whether models can answer questions that reference details appearing early in long documents. The results mentioned highlight that some models handle context better than others, and that even open models like Qwen (the one used in the build) may lag behind top closed models such as Gemini 2.5 Pro or OpenAI o3.

To decide when CAG is appropriate, the transcript offers heuristics: first, check whether the document content fits within the model’s context window with room for chat history. If the knowledge is mostly static—like a fixed FAQ plus a few documents—CAG can avoid the overhead of retrieval. If the knowledge changes dynamically per user (e.g., purchase history), RAG-style retrieval may be more suitable because the injected context must vary.

The implementation is then built as a fully local chatbot using Qwen3 via Ollama and LangChain, with a Streamlit UI. Documents can be uploaded as PDF, text, or Markdown, or added by URL. Dockling is used to convert PDFs and web content into Markdown, and the app stores each converted “knowledge source” in memory as a structured object (ID, name/type, and content). The chatbot constructs a prompt template that injects the cached document context plus the user question, then streams the model output. For Qwen3 specifically, the transcript notes special “thinking” tokens and renders the model’s intermediate reasoning in an expandable UI section.

A small test demonstrates how Ollama’s internal caching behaves by enabling debug logging and observing cache slot usage across turns. In the final demo, the system answers questions about an Apple financial statement PDF and cites numbers from tables, with responses appearing fast after the initial cached context is established. The result is a local, end-to-end CAG workflow that trades retrieval complexity for long-context caching—when the documents are static and the model can reliably reason over large inputs.

Cornell Notes

Cache-augmented generation (CAG) is presented as a way to chat with private documents without retrieval at question time. The system preloads document content into the model’s context window, then relies on prompt caching so later turns reuse the cached prefix. CAG is most appropriate when documents are mostly static and fit within the model’s context window; it can be less reliable when knowledge must change dynamically per user or when the model struggles with long-context comprehension. The build uses Qwen3 through Ollama, LangChain prompt templating, Dockling for converting PDFs/URLs to Markdown, and a Streamlit interface that streams answers and can display Qwen3 “thinking” chunks. The demo shows fast follow-up responses and table-based numeric answers from an uploaded Apple financial statement PDF.

What makes CAG different from RAG for document chat?

CAG precomputes a knowledge cache by inserting the document content into the model’s context up front, then reuses that cached context across turns. RAG instead retrieves relevant chunks from an external store at question time, so performance depends on retriever quality and chunking. In the transcript’s framing, CAG shifts the burden from retrieval to the model’s ability to work over long contexts and to the runtime’s prompt caching behavior.

When should someone choose CAG over RAG?

The heuristics given are: (1) the document content must fit within the model’s context window (e.g., a small FAQ plus a few PDFs can work without retrieval), (2) the knowledge base should remain relatively static so the cache doesn’t need frequent rebuilding, and (3) the model should be strong enough to use the full context effectively. If user-specific data changes each turn (like dynamic purchase information), the transcript suggests RAG-style retrieval is safer.

Why isn’t “context window size” alone enough to predict success?

The transcript points to a long-context comprehension benchmark (FictionBench) that tests whether a model can answer questions about details introduced early in long text and referenced later. The takeaway is that models can have large context windows yet still fail to retrieve the needed information internally. It also notes that benchmark coverage may stop before the largest advertised context lengths (e.g., Gemini 2.5 Pro’s 1M-token window versus a benchmark ending around 120K).

How does the local chatbot implementation build the knowledge cache?

Dockling converts uploaded PDFs and web content into Markdown, and the app stores each converted document as a “knowledge source” object containing an ID, name/type (PDF/URL/document), and the converted content. When the user asks a question, the app formats a prompt template that injects the selected knowledge source content plus the chat history into the model input, enabling the cached prefix behavior in Ollama.

What role do Qwen3 “thinking” tokens play in the UI?

The transcript describes chunk types for Qwen3 outputs, including markers like content start of thinking, thinking, and end of thinking. The app streams chunks and uses these markers to decide whether to render the intermediate reasoning in an expander versus appending it to the final answer text. This is why the UI can show an expandable “thinking” section while still streaming the response.

How is caching verified during development?

A minimal test enables Ollama debug logging and sets caching-related variables to true. By inspecting logs, the transcript shows cache slot usage changing across turns (e.g., cache slot counts increasing as the conversation proceeds), indicating that a large portion of the prompt context is being reused rather than recomputed from scratch.

Review Questions

  1. What conditions make CAG a better fit than RAG, and what failure mode does CAG risk if the model can’t use long context effectively?
  2. How does the system convert PDFs and URLs into a form suitable for prompt injection, and where is that content stored in the app?
  3. During streaming, how does the app distinguish between Qwen3 “thinking” chunks and final content, and how does that affect what the user sees?

Key Points

  1. 1

    CAG can replace RAG for private-document chat when the entire knowledge base fits in the model’s context window and stays mostly static.

  2. 2

    Prompt caching is the main efficiency lever: once the document content is cached as a prefix, later turns can reuse it at lower cost and faster latency.

  3. 3

    CAG performance depends more on long-context comprehension and internal “search” than on retriever quality, unlike RAG which depends on chunk retrieval.

  4. 4

    Long-context benchmarks (e.g., FictionBench) show that context window size alone doesn’t guarantee the model can answer questions about early details.

  5. 5

    The local build uses Qwen3 via Ollama, Dockling for converting PDFs/URLs to Markdown, LangChain for prompt templating, and Streamlit for a chat UI.

  6. 6

    The app supports adding and removing multiple knowledge sources (PDF, text, Markdown, and URL) and streams responses while optionally displaying Qwen3 “thinking” chunks.

Highlights

CAG’s core trade: it avoids retrieval overhead by front-loading documents into the model context, then relies on prompt caching to make follow-up questions cheaper and faster.
A practical decision rule is whether your knowledge is static and fits within the context window; dynamic, user-specific facts push you toward RAG.
The build demonstrates caching behavior by enabling Ollama debug logs and observing cache slot usage across turns.
The Streamlit UI renders Qwen3 “thinking” using explicit thinking start/end chunk markers while streaming the final answer.
Dockling-based conversion lets the system treat PDFs and URLs as Markdown knowledge sources that can be injected into prompts.

Topics

  • Cache Augmented Generation
  • Prompt Caching
  • Long-Context Comprehension
  • Local Document Chatbot
  • Ollama Qwen3

Mentioned