How to Chat With Your Data in Private Without Internet (Using MPT-30B Open-Source LLM)

TL;DR

A private document chatbot can be built by running embeddings, vector retrieval, and LLM generation entirely on local hardware rather than sending content to third-party APIs.

Briefing Cornell Notes

Briefing

Local document chat is possible without sending sensitive text to third parties—by swapping closed APIs for an all-open pipeline built around the MPT-30B open-source LLM. The core payoff is privacy: embeddings, retrieval, and generation can run on a user’s own machine, reducing data-leak risk and avoiding reliance on external services.

The workflow starts the same way as typical “chat with your PDFs” systems: documents are split into chunks, converted into embeddings (numbers), and stored in a local vector database. When a user asks a question, the question is embedded, a similarity search pulls the most relevant chunks from the local store, and those retrieved passages are fed into the language model along with the prompt context to produce an answer. In the demo, the system identifies an earnings-call transcript (Q1 2024) and answers questions using cited source chunks—showing how retrieval grounding works even when the model runs locally.

A major theme is why closed-model setups are a poor fit for private or regulated use. The transcript lists five recurring problems: data leakage concerns when sensitive IP or customer information is sent to an API; limited customization because closed models and pricing are controlled by vendors; the need for constant internet (or high connectivity costs) in regions with limited access; server overload and outages that can block usage; and shifting quality or usage limits after policy changes, plus vendor lock-in.

To address those issues, the pipeline replaces each closed component with open alternatives. Sentence Transformers are used to turn text into embeddings. Chroma DB serves as the local vector store. Most importantly, the language model is swapped from a hosted API to an on-device model: Mosaic’s MPT-30B, distributed with a commercial license, is run via ggml tooling (with a downloadable “ggml version” hosted on Hugging Face). Once the model is downloaded, the system can answer questions without internet access and without exporting document content.

The transcript also gets practical, walking through a repo-based setup with scripts for ingestion and chatting. Users clone the project, ingest files from a “source documents” folder (PDF, TXT, CSV, DOC, etc.), and generate a local DB folder containing embeddings. A separate script downloads the model into a local models directory (noted as about 19GB). Environment variables configure the persist directory, model path, embedding model name, and retrieval parameters such as how many chunks to pull back (“target source chunks”).

Performance is the trade-off. Answers can take minutes when retrieval and context injection are involved, while simpler chat-only queries may return in tens of seconds. The transcript emphasizes that retrieval settings strongly affect latency and that scope management matters—asking for too many chunks slows things down. Hardware guidance is explicit: at least 32GB RAM is recommended for MPT-30B, with Docker suggested as an easier route.

Finally, licensing is flagged: the “chat” version of the MPT model is non-commercial, while the base and instruct variants can be used commercially. The overall message is a move toward democratized, private document Q&A—at the cost of local compute and some speed limitations—while keeping the architecture modular so components can be swapped as needs evolve.

Cornell Notes

The system described enables private “chat with your documents” without internet by running embeddings, retrieval, and generation locally. Documents are chunked, embedded with Sentence Transformers, stored in a local Chroma DB, and retrieved via similarity search when a question is asked; the retrieved chunks are then passed to an on-device MPT-30B model for grounded answers. This avoids common closed-API drawbacks such as data leakage risk, lack of customization, internet dependency, outages/overload, and vendor lock-in. The trade-off is speed: retrieval-augmented answers can take several minutes, and MPT-30B needs substantial hardware (at least 32GB RAM is recommended). Licensing matters too: the MPT chat variant is non-commercial, while base/instruct variants are commercially usable.

How does the local “chat with documents” pipeline keep answers grounded in the source text?

It uses retrieval-augmented generation. Documents are split into chunks, converted to embeddings (numbers), and stored in a local vector database (Chroma DB). When a user asks a question, the question is embedded and a similarity search retrieves the most relevant stored chunks. Those retrieved passages are returned as context and appended to the prompt before the language model generates the response, so answers can cite specific source chunks.

Why are closed hosted LLM setups considered risky or limiting for private data use?

The transcript highlights five issues: (1) data leakage risk when sensitive IP or customer information is sent to third-party APIs; (2) limited customization because vendors control model behavior and pricing; (3) internet connection requirements, which can be costly or unavailable; (4) server overload/outages that prevent consistent access; and (5) lack of control over quality changes, request limits, and vendor lock-in after policy updates.

What open-source components replace the closed parts of the typical architecture?

The embeddings step can use Sentence Transformers (open source). The vector store can be Chroma DB (local, open source). The language model is replaced with an on-device open model: Mosaic’s MPT-30B, run locally via ggml tooling (with a ggml version provided through Hugging Face). Together these allow embeddings, retrieval, and generation to happen without exporting document content.

What practical steps are used to ingest documents and enable local Q&A?

Users clone the repo, place files in a “source documents” folder (PDF/TXT/CSV/DOC), and run an ingestion command (e.g., make ingest or python ingest.py). This creates a local DB folder containing embeddings. Then the MPT-30B model is downloaded locally (noted as ~19GB) and environment variables configure persist directory, model path, embedding model name, and retrieval parameters like target source chunks. Finally, question-answer scripts (e.g., question_answer_docs.py) run retrieval and generation using LangChain streaming/retrieval QA chains.

What hardware and configuration choices affect speed and feasibility?

MPT-30B is compute-heavy: the transcript recommends at least 32GB RAM (and suggests a smaller MPT 7B variant if hardware is limited). Latency increases when retrieval pulls more chunks (“target source chunks”), because the system must process more context. Complex questions and larger retrieval scopes can push response times to several minutes, while simpler chat-only queries may return in tens of seconds.

What licensing constraint is called out for the MPT model variants?

The transcript warns that the “chat” version of the MPT model is non-commercial, so commercial use isn’t allowed under that variant. It says the base and instruct variants can be used commercially, so users need to choose the appropriate model variant for their intended use case.

Review Questions

What components in the described system run locally, and how does that change the privacy profile compared with API-based chat?
How do “target source chunks” and question complexity influence latency in retrieval-augmented local chat?
What licensing distinction is made between the MPT chat variant and the base/instruct variants, and why does it matter?

Key Points

1
A private document chatbot can be built by running embeddings, vector retrieval, and LLM generation entirely on local hardware rather than sending content to third-party APIs.
2
The retrieval-augmented approach chunks documents, embeds them, stores them in a local vector database, and injects the most similar chunks into the prompt for grounded answers.
3
Closed hosted LLM setups are flagged for privacy risk, limited customization, internet dependency, outages/overload, and vendor lock-in.
4
The open-source replacement stack uses Sentence Transformers for embeddings, Chroma DB for local storage, and Mosaic’s MPT-30B run via ggml for on-device generation.
5
MPT-30B requires substantial resources (at least 32GB RAM recommended) and can be slow when retrieval pulls many chunks; tuning retrieval scope is critical.
6
Licensing matters: the MPT chat variant is non-commercial, while base/instruct variants are described as commercially usable.

Highlights

The system demonstrates grounded answers by retrieving relevant document chunks from a local vector store and feeding them into the on-device MPT-30B prompt.

Privacy is achieved not just by “not sharing prompts,” but by keeping embeddings and retrieval local—so document text never needs to leave the machine.

Latency is the main trade-off: retrieval-augmented queries can take minutes, and the number of retrieved chunks strongly affects speed.

Hardware guidance is concrete: MPT-30B is recommended with at least 32GB RAM, and smaller MPT variants may be needed otherwise.

Licensing is explicitly called out: the MPT chat variant is non-commercial, while base/instruct variants can be used commercially.

Topics

Private Document Chat
Retrieval-Augmented Generation
Open-Source LLMs
Local Embeddings
MPT-30B

Mentioned

LLM
API
IP
DB
PDF
TXT
CSV
DOC
ggml