Get AI summaries of any video or article — Sign up free
Customer Support Chatbot using Custom Knowledge Base with LangChain and Private LLM thumbnail

Customer Support Chatbot using Custom Knowledge Base with LangChain and Private LLM

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Convert FAQ Q&A pairs into a structured text format and store them as individual documents for indexing.

Briefing

A practical blueprint for building a customer-support chatbot from a custom knowledge base hinges on one design choice: retrieve the most relevant FAQ snippets with embeddings, then answer using a private/open LLM—while streaming tokens so users don’t wait for a full response. The workflow starts by turning a curated set of Q&A pairs (pulled from Skyscanner’s help center FAQ) into plain-text documents, then indexing those documents in a vector database (Chroma) using an open embedding model. At runtime, the system performs similarity search to fetch the best-matching passages and feeds them as context into a Hugging Face text-generation pipeline wrapped by LangChain, producing responses grounded in the knowledge base rather than free-form guessing.

The build begins with dataset preparation: about 12 Skyscanner help-center questions and their answers are written into separate text files using a consistent “Question: … Answer: …” format. Those files are then loaded into LangChain documents, split with a character-based text splitter to respect model context limits (the transcript notes a 2048-token constraint), and embedded into Chroma. This retrieval layer is the core of the chatbot’s reliability—when a user asks something like how to search for flights on a specific date, the similarity search surfaces the relevant FAQ content, and the LLM answers using that retrieved context.

For the language model, the implementation uses a quantized Hugging Face model named “nus hair mask 13B” (loaded via AutoGPTQ with CUDA acceleration). The transcript emphasizes that the model was selected based on leaderboard performance (including references to the Chatbot Arena and LMsys-style comparisons) and that it’s intended for commercial use, with generation tuned for reproducibility (temperature set to 0 in the pipeline). A manual inference demo shows the model can answer a simple question about beginner programming languages, with the output indicating Python as more suitable due to readability.

LangChain integration is handled through a Hugging Face pipeline and a streaming setup. A TextStreamer and a TextTrimmer are used so the system can stream generated tokens while removing prompt text and special tokens from the user-facing output. The model is wrapped into a LangChain “HuggingFacePipeline,” then connected to a retrieval-based QA chain.

A key lesson appears when conversational memory is added. Using a conversational retrieval chain can trigger an unwanted behavior: follow-up questions get rephrased into standalone queries, which the builder doesn’t want. To avoid that, the solution switches to a QA-style chain with buffer memory that preserves the user’s exact wording and still maintains chat history. The final system is packaged into a custom Python class (a “chatbot” wrapper) that constructs the prompt template, builds the Chroma index from the knowledge base, and exposes a simple call interface for interactive use.

In a live demo loop, the chatbot answers questions about flight search and booking issues; when the knowledge base can’t help (e.g., changing an email address tied to an airline booking), it directs users to contact the airline or travel agent. The overall result is a single-GPU, retrieval-augmented support bot that can be extended with new text files and deployed behind an API for production use, with streamed responses for better user experience.

Cornell Notes

The system builds a customer-support chatbot by combining (1) a custom FAQ knowledge base, (2) embedding-based retrieval, and (3) a private/open LLM for generation. Skyscanner help-center Q&A pairs are converted into text files, split to fit context limits, embedded, and indexed in Chroma. At question time, similarity search retrieves the most relevant passages, which are inserted into a prompt template and passed to a Hugging Face text-generation pipeline wrapped by LangChain. Streaming is enabled via a TextStreamer so answers appear token-by-token. A practical pitfall is avoided by using a QA chain rather than a conversational retrieval chain, preventing automatic rephrasing of follow-up questions.

How does the chatbot ensure answers come from the custom knowledge base instead of generic LLM behavior?

It uses retrieval-augmented generation. The FAQ text files are embedded with an open embedding model and stored in Chroma. For each user question, the system runs similarity search against the vector index, selects the top matching documents, and injects those retrieved passages into the prompt as “context.” The LLM then generates an answer constrained by that context, and the prompt instructs it to reply with “I don’t know” when the answer isn’t present.

Why is the choice of chain type (Conversational vs QA) important for follow-up questions?

Conversational retrieval chains can rephrase follow-up questions into standalone queries using chat history. In the transcript, that rephrasing is considered undesirable. Switching to a QA chain with buffer memory keeps the user’s follow-up wording intact while still maintaining chat history, so the model doesn’t rewrite the question before retrieval and answering.

What role do embeddings and Chroma play in the pipeline?

Embeddings convert each FAQ chunk into a vector representation so semantic similarity can be measured. Chroma stores those vectors and supports similarity search at runtime. The retrieved chunks (including their source metadata) become the context fed into the LLM, which is why questions like “how do I search for flights on Skyscanner” return relevant FAQ content.

How is streaming implemented so users see responses immediately?

A Hugging Face TextStreamer streams generated tokens as the model produces them. A TextTrimmer is used to skip the prompt and remove special tokens (like beginning/end-of-sequence) from what the user sees. The streamer is attached to the text-generation pipeline, and LangChain’s pipeline wrapper passes generation output through that streaming mechanism.

What generation settings are used to make outputs more consistent?

The transcript sets temperature to 0 in the Hugging Face text-generation pipeline for reproducibility. It also uses max length constraints (noted as up to 2048 tokens for the model) and standard token IDs (beginning-of-sequence, end-of-sequence, padding) via the generation configuration.

How does the final chatbot interface work for interactive use?

A custom Python class (“chatbot”) wraps the prompt template, the retrieval/indexing setup, and the QA chain. The class exposes a simple call method (e.g., chatbot(user_input)) that performs similarity search, runs the chain with retrieved documents and the user question, and returns the generated response. An interactive loop reads user input until the user types a termination phrase like “bye.”

Review Questions

  1. What specific mechanism prevents the model from answering outside the knowledge base, and where does that mechanism plug into the prompt?
  2. What behavior changes when moving from a conversational retrieval chain to a QA chain with buffer memory, and why does that matter for follow-up questions?
  3. How do TextStreamer and TextTrimmer work together to produce streamed, user-friendly output?

Key Points

  1. 1

    Convert FAQ Q&A pairs into a structured text format and store them as individual documents for indexing.

  2. 2

    Split knowledge-base text into chunks that fit model context limits before embedding.

  3. 3

    Use embedding-based similarity search in Chroma to retrieve the most relevant FAQ passages for each user question.

  4. 4

    Wrap a quantized Hugging Face text-generation model in a LangChain HuggingFacePipeline and feed retrieved context into a prompt template.

  5. 5

    Enable token streaming with TextStreamer (and clean output with TextTrimmer) so responses appear incrementally.

  6. 6

    Avoid conversational retrieval chains if automatic follow-up question rephrasing is undesirable; prefer a QA chain with buffer memory.

  7. 7

    Package the retrieval + generation logic into a reusable chatbot class to simplify interactive use and future API deployment.

Highlights

The chatbot’s reliability comes from retrieval: similarity search pulls the right FAQ chunks from Chroma, and the LLM answers using that context.
Conversational retrieval can silently rephrase follow-up questions; switching to a QA chain preserves the user’s wording while keeping chat history.
Streaming isn’t an afterthought—TextStreamer and TextTrimmer are wired into the Hugging Face pipeline so tokens appear as they’re generated.
A single-GPU setup is emphasized by using CUDA loading and quantized model execution via AutoGPTQ.

Topics