End To End Document Q&A RAG App With Gemma And Groq API
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use Groq Cloud’s API to run “gamma 7B it” for low-latency LLM responses, avoiding local inference infrastructure.
Briefing
An end-to-end document Q&A chatbot is built by pairing Google’s open embedding models with Groq’s fast inference for the LLM layer—then wiring both into a Streamlit app that answers questions using retrieved text from local PDFs. The core payoff is speed: Groq’s LPU-based inference is used to generate answers quickly (the demo reports roughly ~10 seconds end-to-end for embedding plus answering), while the retrieval pipeline ensures responses stay grounded in the document content rather than drifting into generic answers.
The workflow starts with model selection and infrastructure choices. The chatbot uses Groq Cloud’s API to run Google’s Gemma family, specifically “gamma 7B it,” while embeddings come from Google Generative AI embeddings (configured with a Google API key). The transcript also outlines Gemma variants—Code Gemma for coding help, “PMA” for open vision-language tasks, and “Recurrent Gemma”—but the implementation focuses on the standard Gamma 7B model.
Groq Cloud is positioned as the solution to a common LLM bottleneck: inference latency. Groq’s platform uses an “LPU” (Language Processing Unit), described as faster than GPUs for LLM workloads because it targets compute density and memory bandwidth limits. In the demo, Groq’s generation speed is illustrated with a token rate figure (shown as ~78830 tokens per second for a selected Gamma configuration), and pricing is compared across several model sizes and context lengths. The practical takeaway is that the app can avoid building and hosting its own inference stack by calling Groq’s hosted API.
On the application side, the build is done from scratch. A Python 3.12 environment is created with a requirements file, then environment variables are set for both Groq (Grok API key) and Google (Google API key). The Streamlit UI is set up, and the LLM is instantiated via LangChain’s Groq chat wrapper.
For retrieval-augmented generation (RAG), PDFs are loaded from a local folder (“us sensus”), split into chunks using a recursive character text splitter (chunk size 1000, overlap 200), embedded into vectors using Google’s embedding model (“model/embedding-001”), and stored in a FAISS vector database. A retriever interface is then created from the FAISS index, and a “stuff” document chain combines retrieved context with a prompt template instructing the model to answer only based on the provided context.
Finally, the app exposes two user actions: a button to generate the vector store (embedding step) and a text input for questions. When a question is submitted, the retrieval chain fetches relevant chunks and feeds them to Gamma 7B for a grounded response. The demo queries about health insurance coverage and differences in uninsured rates by state (2022), returning answers attributed to the retrieved PDF context, along with an optional display of the supporting page content.
Cornell Notes
The project builds a Streamlit document Q&A chatbot using RAG. It uses Groq Cloud’s API to run “gamma 7B it” for fast text generation, while Google Generative AI embeddings convert PDF text chunks into vectors. Local PDFs are loaded, split into overlapping chunks (chunk size 1000, overlap 200), embedded with “model/embedding-001,” and stored in a FAISS vector database. A retriever pulls the most relevant chunks for each user question, and a LangChain “stuff” document chain combines that context with a prompt that demands answers based only on the provided material. This setup keeps responses grounded in the documents and demonstrates low latency thanks to Groq’s LPU inference approach.
Why does the build use Groq Cloud (LPU inference) instead of running the LLM locally?
What role do embeddings and FAISS play in the chatbot’s accuracy?
How are PDFs transformed into retrievable context?
What prompt constraint keeps answers grounded in the retrieved text?
What happens when the user clicks “documents embedding” and then asks a question?
Which API keys are required and where are they used?
Review Questions
- How do chunk size and chunk overlap (1000 and 200) affect retrieval quality in a PDF-based RAG system?
- Describe the end-to-end flow from PDF upload to final answer generation in this Streamlit app.
- What is the purpose of a retriever and a “stuff” document chain in LangChain, and how do they interact with the prompt template?
Key Points
- 1
Use Groq Cloud’s API to run “gamma 7B it” for low-latency LLM responses, avoiding local inference infrastructure.
- 2
Convert PDF text into embeddings with Google Generative AI embeddings (“model/embedding-001”) to enable semantic similarity search.
- 3
Chunk documents with a recursive character splitter (chunk size 1000, overlap 200) to preserve context across boundaries.
- 4
Store embeddings in a FAISS vector database and use it as the retriever to fetch the most relevant chunks per question.
- 5
Combine retrieved context with a prompt that requires answering based only on provided context to reduce hallucinations.
- 6
Build the UI in Streamlit with a button to generate the vector store and an input box to run the retrieval + generation chain.
- 7
Keep Groq and Google credentials in environment variables (.env) and load them at runtime for secure API access.