Get AI summaries of any video or article — Sign up free
8-Building Gen AI Powered App Using Langchain And Huggingface And Mistral thumbnail

8-Building Gen AI Powered App Using Langchain And Huggingface And Mistral

Krish Naik·
5 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Load PDFs from a directory, then split them into overlapping chunks (example: chunk_size=1000, chunk_overlap=200) before embedding.

Briefing

A practical end-to-end recipe for building an open-source RAG (retrieval-augmented generation) Q&A app comes together by chaining LangChain document processing, Hugging Face embeddings, and a Hugging Face-hosted Mistral model. The workflow turns a folder of PDFs into searchable vector representations, retrieves the most relevant chunks for a user question, and then prompts Mistral to answer strictly from that retrieved context—so responses can ground themselves in the uploaded documents rather than generic internet knowledge.

The process starts with ingestion and chunking. PDFs stored in a local directory are loaded using LangChain’s PDF directory loader, then split into manageable text chunks with a recursive character text splitter. The example uses a chunk size of 1,000 characters with an overlap of 200, producing a set of final document chunks (the transcript reports roughly 316 chunks across four PDFs). This chunking step is crucial because embeddings and similarity search operate on smaller passages, not entire documents.

Next comes embeddings and vector search. The embedding model is selected from Hugging Face—specifically via Hugging Face embeddings backed by sentence-transformers. The embedding pipeline runs on CPU (the transcript sets device to CPU), and the embedding output is verified by checking the vector shape (reported as 384 dimensions). Those vectors are then stored in a vector database using FAISS. To keep runtime manageable during testing, the transcript demonstrates building the vector store using a subset of chunks (e.g., 120 records), then notes that scaling up increases embedding time (tens of seconds in the example).

With FAISS in place, the system performs similarity search. A sample question—“what is the health insurance coverage”—is embedded and matched against the vector store to retrieve the most relevant chunks. The transcript shows both a direct similarity_search call and a retriever abstraction configured with search_type="similarity" and k=3, returning the top three most relevant passages. Those retrieved passages are then used as the context for generation.

For the language model, the app uses Hugging Face Hub through LangChain’s HuggingFaceHub wrapper, pulling an open-source Mistral model by repo ID (Mistral 7B is referenced) and configuring generation parameters like temperature=0.1 and max_length=500. A Hugging Face access token is required to download the model from the Hub, and the transcript demonstrates setting it via an environment variable (with a caution not to hardcode tokens).

Finally, the RAG layer is assembled using LangChain’s RetrievalQA chain. A prompt template instructs the model to answer using only the provided context (“provide the answers only based on the context”). The chain type is set to “stuff,” the retriever supplies the context chunks, and the system returns answers along with source documents. When tested, the model produces grounded responses about health insurance coverage and can answer follow-up questions like differences in uninsured rates by state, with the retrieved context driving the output.

Cornell Notes

The transcript builds a grounded Q&A system using RAG: PDFs are loaded, split into chunks, embedded with Hugging Face sentence-transformer embeddings, and indexed in a FAISS vector store. A retriever then performs similarity search to fetch the top-k relevant chunks (k=3) for a user question. A Hugging Face-hosted Mistral model (via LangChain’s HuggingFaceHub) generates an answer using a prompt template that restricts responses to the retrieved context. The final step wires everything together with LangChain’s RetrievalQA chain (chain type “stuff”) and can return source documents, making answers traceable to the PDFs. This matters because it replaces generic LLM responses with document-grounded answers.

How do the PDFs become something the model can search?

PDFs are loaded from a directory using LangChain’s PDF directory loader, then split into overlapping chunks using RecursiveCharacterTextSplitter. The example uses chunk_size=1000 and chunk_overlap=200, producing a few hundred chunks (about 316 across four PDFs). These chunks become the units embedded and retrieved later.

What embedding setup is used, and how is it validated?

Embeddings come from Hugging Face via Hugging Face embeddings backed by sentence-transformers. The transcript sets the embedding device to CPU and checks that the embedding vector has a consistent dimensionality—reported as shape (384). It also demonstrates embedding a chunk’s page_content and converting the result to a NumPy array to inspect shape.

Why use FAISS, and how does similarity search work here?

FAISS stores the embedding vectors so the system can quickly find the most similar chunks to a query. The transcript builds a FAISS vector store from the embedded document chunks, then runs vector_store.similarity_search(query) to retrieve relevant passages. It also shows a retriever abstraction with search_type="similarity" and k=3 to return the top three matching chunks.

How does the system ensure answers come from the PDFs rather than general knowledge?

A prompt template instructs the model to use only the provided context: “provide the answers only based on the context.” The RetrievalQA chain feeds the retrieved chunks into the prompt as context, so Mistral generates responses grounded in those passages. The chain can also return source documents for traceability.

How is Mistral connected to LangChain in this setup?

Mistral is accessed through Hugging Face Hub using LangChain’s HuggingFaceHub wrapper. The transcript references using Mistral 7B via its Hugging Face repo ID, sets temperature=0.1, and max_length=500, and requires a Hugging Face access token (set as an environment variable) to download the model from the Hub.

What does the RetrievalQA chain do, and what parameters matter most?

RetrievalQA combines retrieval and generation. Key parameters include llm (the Hugging FaceHub model), chain_type="stuff" (how context is combined), retriever (the FAISS-backed similarity retriever), return_source_documents=True (for citations), and prompt (the context-restricted instruction template). The chain then answers a user query using the retrieved context.

Review Questions

  1. If chunk_size and chunk_overlap change, how might it affect retrieval quality and answer accuracy in a FAISS+RAG pipeline?
  2. What is the difference between calling vector_store.similarity_search directly and using a retriever with k=3 inside RetrievalQA?
  3. How does the prompt template wording (“only based on the context”) influence the groundedness of Mistral’s answers?

Key Points

  1. 1

    Load PDFs from a directory, then split them into overlapping chunks (example: chunk_size=1000, chunk_overlap=200) before embedding.

  2. 2

    Generate embeddings using Hugging Face sentence-transformer-based embeddings and verify vector dimensionality (example: 384).

  3. 3

    Index embedded chunks in FAISS to enable fast similarity search over document passages.

  4. 4

    Retrieve the top-k relevant chunks for a question using either similarity_search or a retriever configured with k=3.

  5. 5

    Connect an open-source Mistral model from Hugging Face Hub via LangChain’s HuggingFaceHub, using temperature=0.1 and max_length=500.

  6. 6

    Use a prompt template that forces answers to rely only on retrieved context, then assemble everything with LangChain’s RetrievalQA chain (chain type “stuff”).

  7. 7

    Enable return_source_documents to make outputs auditable against the PDF-derived context.

Highlights

The pipeline grounds LLM answers by retrieving the most relevant PDF chunks with FAISS and injecting them into a context-restricted prompt for Mistral.
Embeddings are validated by inspecting the vector shape (reported as 384), confirming the embedding model is producing consistent dimensions.
A retriever with search_type="similarity" and k=3 supplies the context for generation, making follow-up questions like uninsured-rate differences by state work off the same document base.

Topics

Mentioned