Build 100% Local AI Agent to Chat with Your Files | Private AI Knowledge Base with MCP & RAG
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Run a local MCP tool server that exposes file operations (list, read, summarize, search) so the agent can ground answers in on-device document data.
Briefing
A fully local “private knowledge base” agent can chat with a user’s own files by combining a custom MCP tool server with retrieval-augmented generation (RAG) and a Streamlit chat UI. The core idea is straightforward: keep documents on the machine, convert PDFs to text, expose file operations (list, read, summarize, and semantic search) as MCP tools, then force the language model to answer using those tools with source attribution. The result is an assistant that can navigate a local directory, pull relevant chunks, and respond with citations—without sending document content to a hosted database.
The setup starts with a Streamlit app where users upload documents into a local data directory. PDFs are converted into Markdown using a PDF-to-Markdown pipeline (via the dock link library and pi pdfume). For embeddings, the system uses fast embed embeddings (BG small embeder) and builds a lightweight semantic search over document chunks without a vector database. Instead of storing embeddings in something like a dedicated DB, it computes chunk embeddings on the fly, measures similarity against the query, filters by a threshold (similarity > 0.1), and returns the most relevant chunks along with their document paths and relevance scores.
A custom MCP server exposes the tools the agent can call. The toolset includes: list files (returns metadata such as path, modified/created time, word and character counts, size, and extension), read file contents (returns full text, with a warning risk for large files), summarize file (uses a smaller Quentry model to produce a concise 3-sentence-or-less summary), and a search tool that performs semantic chunking and retrieval. Chunking uses a semantic chunker that splits text at sentence boundaries based on internal metrics, aiming for coherent chunks when working with Markdown-like text. The search tool returns chunk objects containing the document path, chunk text, and a relevance score.
On the agent side, Langgraph orchestrates tool calling. The system prompt positions the model as a personal knowledge manager that operates entirely locally and must search retrieved information before answering. It also demands a structured response: a direct answer, a list of sources/connections (including document paths), and suggested next steps or follow-up questions. The model configuration uses Quentry models—primarily an 8 billion parameter model for the agent, with a 1.7 billion parameter model used for file summarization. The agent supports a configurable context window (noted up to 128K tokens for Quentry models).
The Streamlit UI streams the agent’s output token-by-token, including “thinking” blocks and explicit tool-call events. Users can refresh the knowledge base, browse a recursive file tree in the sidebar, upload new PDFs (which get converted to Markdown and added to the directory), and then ask questions. In testing, the agent first calls list files, then uses summarize file for targeted requests (e.g., summarizing an “about me” document), and can answer questions about code files by reading and summarizing their contents with cited paths.
Overall, the build demonstrates a practical pattern for local AI assistants: MCP turns file operations into callable tools, RAG supplies relevant context via chunk similarity, and Langgraph enforces tool-first behavior and source-grounded answers—wrapped in a UI that makes the workflow usable for everyday document chat.
Cornell Notes
The project builds a fully local AI agent that can chat with a user’s own documents by combining a custom MCP tool server, RAG-style retrieval, and a Streamlit interface. A local MCP server exposes tools to list files, read full contents, summarize files (3 sentences or less), and perform semantic search over chunked document text. Retrieval is implemented without a vector database: embeddings are computed with fast embed, similarity is calculated against query embeddings, and chunks above a threshold (similarity > 0.1) are returned with document paths and relevance scores. Langgraph orchestrates tool calling so the agent searches retrieved information before answering and provides source attribution. This matters because it keeps document content on-device while still enabling grounded Q&A over a personal file collection.
How does the system keep document chat “private” while still letting the model use tools?
What does “RAG” look like here if there’s no vector database?
Why use two Quentry model sizes in the same workflow?
What tools does the MCP server provide, and how do they shape the agent’s behavior?
How does the UI make tool use visible during chat?
What happens when a user uploads a PDF?
Review Questions
- What are the four MCP tools exposed by the server, and what specific data does each tool return?
- Explain how semantic search works in this project without a vector database, including the role of chunking and the similarity threshold.
- How does Langgraph enforce tool-first behavior and source attribution in the agent’s responses?
Key Points
- 1
Run a local MCP tool server that exposes file operations (list, read, summarize, search) so the agent can ground answers in on-device document data.
- 2
Convert PDFs to Markdown on upload so the retrieval pipeline can chunk and search clean text.
- 3
Implement RAG without a vector database by embedding chunks with fast embed, computing similarity to the query, and filtering by a threshold (similarity > 0.1).
- 4
Use semantic chunking at sentence boundaries to improve retrieval quality for Markdown-like documents.
- 5
Orchestrate tool calling with Langgraph and a system prompt that requires searching retrieved information before answering.
- 6
Stream responses in Streamlit while rendering tool-call events and “thinking” blocks to make the agent’s actions transparent.
- 7
Use a larger Quentry model for the agent and a smaller Quentry model for file summarization to balance quality and speed.