8-Building Gen AI Powered App Using Langchain And Huggingface And Mistral
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Load PDFs from a directory, then split them into overlapping chunks (example: chunk_size=1000, chunk_overlap=200) before embedding.
Briefing
A practical end-to-end recipe for building an open-source RAG (retrieval-augmented generation) Q&A app comes together by chaining LangChain document processing, Hugging Face embeddings, and a Hugging Face-hosted Mistral model. The workflow turns a folder of PDFs into searchable vector representations, retrieves the most relevant chunks for a user question, and then prompts Mistral to answer strictly from that retrieved context—so responses can ground themselves in the uploaded documents rather than generic internet knowledge.
The process starts with ingestion and chunking. PDFs stored in a local directory are loaded using LangChain’s PDF directory loader, then split into manageable text chunks with a recursive character text splitter. The example uses a chunk size of 1,000 characters with an overlap of 200, producing a set of final document chunks (the transcript reports roughly 316 chunks across four PDFs). This chunking step is crucial because embeddings and similarity search operate on smaller passages, not entire documents.
Next comes embeddings and vector search. The embedding model is selected from Hugging Face—specifically via Hugging Face embeddings backed by sentence-transformers. The embedding pipeline runs on CPU (the transcript sets device to CPU), and the embedding output is verified by checking the vector shape (reported as 384 dimensions). Those vectors are then stored in a vector database using FAISS. To keep runtime manageable during testing, the transcript demonstrates building the vector store using a subset of chunks (e.g., 120 records), then notes that scaling up increases embedding time (tens of seconds in the example).
With FAISS in place, the system performs similarity search. A sample question—“what is the health insurance coverage”—is embedded and matched against the vector store to retrieve the most relevant chunks. The transcript shows both a direct similarity_search call and a retriever abstraction configured with search_type="similarity" and k=3, returning the top three most relevant passages. Those retrieved passages are then used as the context for generation.
For the language model, the app uses Hugging Face Hub through LangChain’s HuggingFaceHub wrapper, pulling an open-source Mistral model by repo ID (Mistral 7B is referenced) and configuring generation parameters like temperature=0.1 and max_length=500. A Hugging Face access token is required to download the model from the Hub, and the transcript demonstrates setting it via an environment variable (with a caution not to hardcode tokens).
Finally, the RAG layer is assembled using LangChain’s RetrievalQA chain. A prompt template instructs the model to answer using only the provided context (“provide the answers only based on the context”). The chain type is set to “stuff,” the retriever supplies the context chunks, and the system returns answers along with source documents. When tested, the model produces grounded responses about health insurance coverage and can answer follow-up questions like differences in uninsured rates by state, with the retrieved context driving the output.
Cornell Notes
The transcript builds a grounded Q&A system using RAG: PDFs are loaded, split into chunks, embedded with Hugging Face sentence-transformer embeddings, and indexed in a FAISS vector store. A retriever then performs similarity search to fetch the top-k relevant chunks (k=3) for a user question. A Hugging Face-hosted Mistral model (via LangChain’s HuggingFaceHub) generates an answer using a prompt template that restricts responses to the retrieved context. The final step wires everything together with LangChain’s RetrievalQA chain (chain type “stuff”) and can return source documents, making answers traceable to the PDFs. This matters because it replaces generic LLM responses with document-grounded answers.
How do the PDFs become something the model can search?
What embedding setup is used, and how is it validated?
Why use FAISS, and how does similarity search work here?
How does the system ensure answers come from the PDFs rather than general knowledge?
How is Mistral connected to LangChain in this setup?
What does the RetrievalQA chain do, and what parameters matter most?
Review Questions
- If chunk_size and chunk_overlap change, how might it affect retrieval quality and answer accuracy in a FAISS+RAG pipeline?
- What is the difference between calling vector_store.similarity_search directly and using a retriever with k=3 inside RetrievalQA?
- How does the prompt template wording (“only based on the context”) influence the groundedness of Mistral’s answers?
Key Points
- 1
Load PDFs from a directory, then split them into overlapping chunks (example: chunk_size=1000, chunk_overlap=200) before embedding.
- 2
Generate embeddings using Hugging Face sentence-transformer-based embeddings and verify vector dimensionality (example: 384).
- 3
Index embedded chunks in FAISS to enable fast similarity search over document passages.
- 4
Retrieve the top-k relevant chunks for a question using either similarity_search or a retriever configured with k=3.
- 5
Connect an open-source Mistral model from Hugging Face Hub via LangChain’s HuggingFaceHub, using temperature=0.1 and max_length=500.
- 6
Use a prompt template that forces answers to rely only on retrieved context, then assemble everything with LangChain’s RetrievalQA chain (chain type “stuff”).
- 7
Enable return_source_documents to make outputs auditable against the PDF-derived context.