Customer Support Chatbot using Custom Knowledge Base with LangChain and Private LLM
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Convert FAQ Q&A pairs into a structured text format and store them as individual documents for indexing.
Briefing
A practical blueprint for building a customer-support chatbot from a custom knowledge base hinges on one design choice: retrieve the most relevant FAQ snippets with embeddings, then answer using a private/open LLM—while streaming tokens so users don’t wait for a full response. The workflow starts by turning a curated set of Q&A pairs (pulled from Skyscanner’s help center FAQ) into plain-text documents, then indexing those documents in a vector database (Chroma) using an open embedding model. At runtime, the system performs similarity search to fetch the best-matching passages and feeds them as context into a Hugging Face text-generation pipeline wrapped by LangChain, producing responses grounded in the knowledge base rather than free-form guessing.
The build begins with dataset preparation: about 12 Skyscanner help-center questions and their answers are written into separate text files using a consistent “Question: … Answer: …” format. Those files are then loaded into LangChain documents, split with a character-based text splitter to respect model context limits (the transcript notes a 2048-token constraint), and embedded into Chroma. This retrieval layer is the core of the chatbot’s reliability—when a user asks something like how to search for flights on a specific date, the similarity search surfaces the relevant FAQ content, and the LLM answers using that retrieved context.
For the language model, the implementation uses a quantized Hugging Face model named “nus hair mask 13B” (loaded via AutoGPTQ with CUDA acceleration). The transcript emphasizes that the model was selected based on leaderboard performance (including references to the Chatbot Arena and LMsys-style comparisons) and that it’s intended for commercial use, with generation tuned for reproducibility (temperature set to 0 in the pipeline). A manual inference demo shows the model can answer a simple question about beginner programming languages, with the output indicating Python as more suitable due to readability.
LangChain integration is handled through a Hugging Face pipeline and a streaming setup. A TextStreamer and a TextTrimmer are used so the system can stream generated tokens while removing prompt text and special tokens from the user-facing output. The model is wrapped into a LangChain “HuggingFacePipeline,” then connected to a retrieval-based QA chain.
A key lesson appears when conversational memory is added. Using a conversational retrieval chain can trigger an unwanted behavior: follow-up questions get rephrased into standalone queries, which the builder doesn’t want. To avoid that, the solution switches to a QA-style chain with buffer memory that preserves the user’s exact wording and still maintains chat history. The final system is packaged into a custom Python class (a “chatbot” wrapper) that constructs the prompt template, builds the Chroma index from the knowledge base, and exposes a simple call interface for interactive use.
In a live demo loop, the chatbot answers questions about flight search and booking issues; when the knowledge base can’t help (e.g., changing an email address tied to an airline booking), it directs users to contact the airline or travel agent. The overall result is a single-GPU, retrieval-augmented support bot that can be extended with new text files and deployed behind an API for production use, with streamed responses for better user experience.
Cornell Notes
The system builds a customer-support chatbot by combining (1) a custom FAQ knowledge base, (2) embedding-based retrieval, and (3) a private/open LLM for generation. Skyscanner help-center Q&A pairs are converted into text files, split to fit context limits, embedded, and indexed in Chroma. At question time, similarity search retrieves the most relevant passages, which are inserted into a prompt template and passed to a Hugging Face text-generation pipeline wrapped by LangChain. Streaming is enabled via a TextStreamer so answers appear token-by-token. A practical pitfall is avoided by using a QA chain rather than a conversational retrieval chain, preventing automatic rephrasing of follow-up questions.
How does the chatbot ensure answers come from the custom knowledge base instead of generic LLM behavior?
Why is the choice of chain type (Conversational vs QA) important for follow-up questions?
What role do embeddings and Chroma play in the pipeline?
How is streaming implemented so users see responses immediately?
What generation settings are used to make outputs more consistent?
How does the final chatbot interface work for interactive use?
Review Questions
- What specific mechanism prevents the model from answering outside the knowledge base, and where does that mechanism plug into the prompt?
- What behavior changes when moving from a conversational retrieval chain to a QA chain with buffer memory, and why does that matter for follow-up questions?
- How do TextStreamer and TextTrimmer work together to produce streamed, user-friendly output?
Key Points
- 1
Convert FAQ Q&A pairs into a structured text format and store them as individual documents for indexing.
- 2
Split knowledge-base text into chunks that fit model context limits before embedding.
- 3
Use embedding-based similarity search in Chroma to retrieve the most relevant FAQ passages for each user question.
- 4
Wrap a quantized Hugging Face text-generation model in a LangChain HuggingFacePipeline and feed retrieved context into a prompt template.
- 5
Enable token streaming with TextStreamer (and clean output with TextTrimmer) so responses appear incrementally.
- 6
Avoid conversational retrieval chains if automatic follow-up question rephrasing is undesirable; prefer a QA chain with buffer memory.
- 7
Package the retrieval + generation logic into a reusable chatbot class to simplify interactive use and future API deployment.