End To End RAG Agent With DeepSeek-R1 And Ollama

TL;DR

The RAG pipeline is fully local: PDFs are uploaded, embedded, indexed, retrieved, and answered without relying on cloud vector databases.

Briefing Cornell Notes

Briefing

An end-to-end Retrieval-Augmented Generation (RAG) app is built to answer questions from locally uploaded PDFs using DeepSeek R1 running through Ollama, with embeddings generated on the same machine via Ollama’s embedding model. The core payoff is practical: once a PDF is uploaded, the system chunks the text, converts chunks into vectors stored in a local in-memory vector database, retrieves the most relevant chunks using cosine similarity, and feeds that context into an LLM to produce concise, factual answers—without sending document data to any cloud service.

The workflow starts with a Streamlit interface and a small set of functions that mirror the RAG pipeline. A PDF upload triggers a save step that writes the file into a local “document store” folder. The app then loads the PDF using PDFPlumber, splits the extracted text into chunks with a recursive character text splitter (chunk size and overlap are configured, with a “start index” enabled), and embeds each chunk using Ollama embeddings. Those embeddings are stored in an in-memory vector store (explicitly avoiding third-party or cloud vector databases), enabling fast similarity search during the chat session.

For retrieval, a dedicated function performs similarity search over the in-memory vector store and returns the most relevant document chunks. Those chunks are concatenated into a context string that becomes the grounding material for generation. Prompting is handled through a chat prompt template that instructs the model to act as an expert research assistant: use the provided context to answer the user’s query, and if the answer isn’t present, respond with “I don’t know.” The response is constrained to be concise—maximum two or three sentences—aiming to keep answers tight and document-grounded.

On the generation side, the app uses LangChain components to chain together the prompt and the local LLM. The LLM is configured via Ollama’s LangChain integration (Ollama LLM), pointing to the DeepSeek R1 1.5 billion model. The transcript emphasizes that DeepSeek is installed locally through Ollama (via an “ollama run” command), and that the entire pipeline runs on the user’s machine—data stays local, and the vector store is created and queried locally.

After launching the Streamlit app, the user uploads a syllabus PDF (“Ultimate Data Science and Gen Bootcamp” batch). The system automatically builds embeddings and the vector index. Follow-up questions like “What are the prerequisites?” and “Can you summarize the entire curriculum?” are answered by retrieving relevant syllabus sections and generating responses grounded in that retrieved context. The result is presented as “accuracy-wise” strong for document-specific Q&A, with the UI optionally supporting chat history and multiple questions against the same uploaded document set.

Cornell Notes

The app delivers a local, end-to-end RAG system for answering questions about user-uploaded PDFs. When a PDF is uploaded, it is saved to a local folder, loaded with PDFPlumber, split into chunks using a recursive character text splitter, and embedded with Ollama embeddings. The embeddings are stored in an in-memory vector store, enabling cosine-similarity retrieval of the most relevant chunks for each question. Those retrieved chunks are concatenated into context and passed to a locally running DeepSeek R1 1.5 billion model via Ollama, guided by a prompt template that demands context-based, concise answers (or “I don’t know” if context is missing). This matters because it keeps document data local while still producing grounded answers.

What exactly happens after a user uploads a PDF in this RAG app?

The upload triggers a sequence: the file is saved into a local “document store” folder, then PDFPlumber loads the PDF text, and a recursive character text splitter breaks the text into chunks (with configured chunk size and overlap, and start index enabled). Those chunks are embedded using Ollama embeddings and indexed into an in-memory vector store. After indexing, the system is ready to retrieve relevant chunks for any user query.

How does the app retrieve the right parts of the document for a question?

A retrieval function runs similarity search over the in-memory vector store using cosine similarity. It compares the user query embedding against stored chunk embeddings and returns the most relevant document chunks. Those chunks become the “context” used for generation.

What controls the style and factuality of the answers?

A chat prompt template instructs the model to act as an expert research assistant, use only the provided context to answer, and respond with “I don’t know” if the answer isn’t in the context. It also limits responses to a maximum of two or three sentences, pushing the output toward concise, document-grounded answers.

Which local models and components power the pipeline?

Embeddings come from Ollama embeddings running locally. The language model is configured through Ollama’s LangChain integration (Ollama LLM) and points to DeepSeek R1 1.5 billion. The transcript also uses LangChain prompt templating and chaining to connect retrieved context with the LLM call.

Why use an in-memory vector store here?

The design explicitly avoids third-party or cloud vector databases to keep everything local and simple. Storing vectors in an in-memory vector store means embeddings are created and queried on the same machine, supporting a fully local RAG workflow.

Review Questions

How does chunking (chunk size, overlap, and start index) influence retrieval quality in this RAG setup?
What prompt constraints are used to prevent the model from answering without supporting context?
If the system returns “I don’t know,” what part of the pipeline is most likely failing to provide relevant context?

Key Points

1
The RAG pipeline is fully local: PDFs are uploaded, embedded, indexed, retrieved, and answered without relying on cloud vector databases.
2
PDFPlumber extracts text, and a recursive character text splitter converts it into chunks suitable for embedding and retrieval.
3
Ollama embeddings generate vectors for each chunk, which are stored in an in-memory vector store for cosine-similarity search.
4
A chat prompt template forces answers to rely on retrieved context and to return “I don’t know” when the context doesn’t contain the answer.
5
DeepSeek R1 1.5 billion runs locally via Ollama and is integrated through LangChain’s Ollama LLM for grounded generation.
6
Streamlit provides the end-to-end user flow: upload PDF → build embeddings/vector index → ask questions → display answers.

Highlights

Uploading a PDF automatically triggers chunking, embedding, and indexing into a local in-memory vector store before any question is answered.

Retrieval uses cosine similarity over Ollama-generated embeddings to select the most relevant document chunks as context.

The prompt template enforces context-grounded responses and caps answers at two or three sentences.

DeepSeek R1 1.5 billion is run locally through Ollama, keeping document data on the user’s machine.