Get AI summaries of any video or article — Sign up free
Build 100% Local AI Agent to Chat with Your Files | Private AI Knowledge Base with MCP & RAG thumbnail

Build 100% Local AI Agent to Chat with Your Files | Private AI Knowledge Base with MCP & RAG

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Run a local MCP tool server that exposes file operations (list, read, summarize, search) so the agent can ground answers in on-device document data.

Briefing

A fully local “private knowledge base” agent can chat with a user’s own files by combining a custom MCP tool server with retrieval-augmented generation (RAG) and a Streamlit chat UI. The core idea is straightforward: keep documents on the machine, convert PDFs to text, expose file operations (list, read, summarize, and semantic search) as MCP tools, then force the language model to answer using those tools with source attribution. The result is an assistant that can navigate a local directory, pull relevant chunks, and respond with citations—without sending document content to a hosted database.

The setup starts with a Streamlit app where users upload documents into a local data directory. PDFs are converted into Markdown using a PDF-to-Markdown pipeline (via the dock link library and pi pdfume). For embeddings, the system uses fast embed embeddings (BG small embeder) and builds a lightweight semantic search over document chunks without a vector database. Instead of storing embeddings in something like a dedicated DB, it computes chunk embeddings on the fly, measures similarity against the query, filters by a threshold (similarity > 0.1), and returns the most relevant chunks along with their document paths and relevance scores.

A custom MCP server exposes the tools the agent can call. The toolset includes: list files (returns metadata such as path, modified/created time, word and character counts, size, and extension), read file contents (returns full text, with a warning risk for large files), summarize file (uses a smaller Quentry model to produce a concise 3-sentence-or-less summary), and a search tool that performs semantic chunking and retrieval. Chunking uses a semantic chunker that splits text at sentence boundaries based on internal metrics, aiming for coherent chunks when working with Markdown-like text. The search tool returns chunk objects containing the document path, chunk text, and a relevance score.

On the agent side, Langgraph orchestrates tool calling. The system prompt positions the model as a personal knowledge manager that operates entirely locally and must search retrieved information before answering. It also demands a structured response: a direct answer, a list of sources/connections (including document paths), and suggested next steps or follow-up questions. The model configuration uses Quentry models—primarily an 8 billion parameter model for the agent, with a 1.7 billion parameter model used for file summarization. The agent supports a configurable context window (noted up to 128K tokens for Quentry models).

The Streamlit UI streams the agent’s output token-by-token, including “thinking” blocks and explicit tool-call events. Users can refresh the knowledge base, browse a recursive file tree in the sidebar, upload new PDFs (which get converted to Markdown and added to the directory), and then ask questions. In testing, the agent first calls list files, then uses summarize file for targeted requests (e.g., summarizing an “about me” document), and can answer questions about code files by reading and summarizing their contents with cited paths.

Overall, the build demonstrates a practical pattern for local AI assistants: MCP turns file operations into callable tools, RAG supplies relevant context via chunk similarity, and Langgraph enforces tool-first behavior and source-grounded answers—wrapped in a UI that makes the workflow usable for everyday document chat.

Cornell Notes

The project builds a fully local AI agent that can chat with a user’s own documents by combining a custom MCP tool server, RAG-style retrieval, and a Streamlit interface. A local MCP server exposes tools to list files, read full contents, summarize files (3 sentences or less), and perform semantic search over chunked document text. Retrieval is implemented without a vector database: embeddings are computed with fast embed, similarity is calculated against query embeddings, and chunks above a threshold (similarity > 0.1) are returned with document paths and relevance scores. Langgraph orchestrates tool calling so the agent searches retrieved information before answering and provides source attribution. This matters because it keeps document content on-device while still enabling grounded Q&A over a personal file collection.

How does the system keep document chat “private” while still letting the model use tools?

Privacy comes from running everything locally: documents live in a local data directory, and a custom MCP server exposes file operations as tools. The agent uses MCP tool calling to list files, read contents, summarize documents, and search chunks—so the model’s responses are grounded in tool outputs generated on the user’s machine rather than relying on external hosted document stores.

What does “RAG” look like here if there’s no vector database?

Retrieval is done in-process. The system chunks documents (using a semantic chunker that splits at sentence boundaries), embeds each chunk with fast embed embeddings (BG small embeder), embeds the user query, then computes similarity/distance between query and chunk embeddings. Chunks with similarity above 0.1 are selected, and the agent receives a list of chunk objects containing document path, chunk text, and a relevance score.

Why use two Quentry model sizes in the same workflow?

The agent primarily runs on a larger Quentry model (8B) for reasoning and tool orchestration, while file summarization uses a smaller Quentry model (1.7B). This keeps summarization cheaper and faster while still producing concise outputs. The summarization prompt is constrained to “three sentences or less” and includes a “no think” instruction to avoid verbose reasoning in the summary output.

What tools does the MCP server provide, and how do they shape the agent’s behavior?

The MCP server provides: (1) list files with metadata (path, timestamps, word/character counts, size, extension), (2) read file contents (full text, which can be problematic for large files), (3) summarize file (3 sentences or less), and (4) search (semantic chunking + similarity scoring returning chunk objects). The agent is nudged to call list files early, then use summarize or search to answer questions with explicit source attribution.

How does the UI make tool use visible during chat?

Streamlit streams the agent’s output and distinguishes chunk types such as thinking start/end, content, and tool calls. When the model triggers a tool call, the UI renders the tool name and arguments (e.g., the file path or search parameters) as they arrive, while the assistant’s final answer is streamed as markdown.

What happens when a user uploads a PDF?

The app converts the uploaded PDF into Markdown using a PDF-to-Markdown pipeline (via dock link library and pi pdfume). It then saves the converted .md file into the data directory (replacing the PDF extension) so the agent can index and chat with the text version.

Review Questions

  1. What are the four MCP tools exposed by the server, and what specific data does each tool return?
  2. Explain how semantic search works in this project without a vector database, including the role of chunking and the similarity threshold.
  3. How does Langgraph enforce tool-first behavior and source attribution in the agent’s responses?

Key Points

  1. 1

    Run a local MCP tool server that exposes file operations (list, read, summarize, search) so the agent can ground answers in on-device document data.

  2. 2

    Convert PDFs to Markdown on upload so the retrieval pipeline can chunk and search clean text.

  3. 3

    Implement RAG without a vector database by embedding chunks with fast embed, computing similarity to the query, and filtering by a threshold (similarity > 0.1).

  4. 4

    Use semantic chunking at sentence boundaries to improve retrieval quality for Markdown-like documents.

  5. 5

    Orchestrate tool calling with Langgraph and a system prompt that requires searching retrieved information before answering.

  6. 6

    Stream responses in Streamlit while rendering tool-call events and “thinking” blocks to make the agent’s actions transparent.

  7. 7

    Use a larger Quentry model for the agent and a smaller Quentry model for file summarization to balance quality and speed.

Highlights

The assistant keeps documents on-device by routing all file access through a local MCP server and tool calling.
Semantic search is implemented without a vector database: embeddings are computed and compared in-process, then filtered by similarity > 0.1.
The agent’s response format is designed for grounded answers: direct answer first, then sources/connections tied to document paths.
PDF uploads are converted to Markdown automatically, turning raw documents into searchable knowledge base entries.
Langgraph enables a clean separation between model reasoning and tool execution, while Streamlit streams both tool calls and final text.

Topics

Mentioned

  • MCP
  • RAG
  • UI
  • HTTP