Get AI summaries of any video or article — Sign up free
End To End Document Q&A RAG App With Gemma And Groq API thumbnail

End To End Document Q&A RAG App With Gemma And Groq API

Krish Naik·
5 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Use Groq Cloud’s API to run “gamma 7B it” for low-latency LLM responses, avoiding local inference infrastructure.

Briefing

An end-to-end document Q&A chatbot is built by pairing Google’s open embedding models with Groq’s fast inference for the LLM layer—then wiring both into a Streamlit app that answers questions using retrieved text from local PDFs. The core payoff is speed: Groq’s LPU-based inference is used to generate answers quickly (the demo reports roughly ~10 seconds end-to-end for embedding plus answering), while the retrieval pipeline ensures responses stay grounded in the document content rather than drifting into generic answers.

The workflow starts with model selection and infrastructure choices. The chatbot uses Groq Cloud’s API to run Google’s Gemma family, specifically “gamma 7B it,” while embeddings come from Google Generative AI embeddings (configured with a Google API key). The transcript also outlines Gemma variants—Code Gemma for coding help, “PMA” for open vision-language tasks, and “Recurrent Gemma”—but the implementation focuses on the standard Gamma 7B model.

Groq Cloud is positioned as the solution to a common LLM bottleneck: inference latency. Groq’s platform uses an “LPU” (Language Processing Unit), described as faster than GPUs for LLM workloads because it targets compute density and memory bandwidth limits. In the demo, Groq’s generation speed is illustrated with a token rate figure (shown as ~78830 tokens per second for a selected Gamma configuration), and pricing is compared across several model sizes and context lengths. The practical takeaway is that the app can avoid building and hosting its own inference stack by calling Groq’s hosted API.

On the application side, the build is done from scratch. A Python 3.12 environment is created with a requirements file, then environment variables are set for both Groq (Grok API key) and Google (Google API key). The Streamlit UI is set up, and the LLM is instantiated via LangChain’s Groq chat wrapper.

For retrieval-augmented generation (RAG), PDFs are loaded from a local folder (“us sensus”), split into chunks using a recursive character text splitter (chunk size 1000, overlap 200), embedded into vectors using Google’s embedding model (“model/embedding-001”), and stored in a FAISS vector database. A retriever interface is then created from the FAISS index, and a “stuff” document chain combines retrieved context with a prompt template instructing the model to answer only based on the provided context.

Finally, the app exposes two user actions: a button to generate the vector store (embedding step) and a text input for questions. When a question is submitted, the retrieval chain fetches relevant chunks and feeds them to Gamma 7B for a grounded response. The demo queries about health insurance coverage and differences in uninsured rates by state (2022), returning answers attributed to the retrieved PDF context, along with an optional display of the supporting page content.

Cornell Notes

The project builds a Streamlit document Q&A chatbot using RAG. It uses Groq Cloud’s API to run “gamma 7B it” for fast text generation, while Google Generative AI embeddings convert PDF text chunks into vectors. Local PDFs are loaded, split into overlapping chunks (chunk size 1000, overlap 200), embedded with “model/embedding-001,” and stored in a FAISS vector database. A retriever pulls the most relevant chunks for each user question, and a LangChain “stuff” document chain combines that context with a prompt that demands answers based only on the provided material. This setup keeps responses grounded in the documents and demonstrates low latency thanks to Groq’s LPU inference approach.

Why does the build use Groq Cloud (LPU inference) instead of running the LLM locally?

Groq Cloud is used to reduce inference latency for LLM responses. The transcript attributes speed to Groq’s “LPU” (Language Processing Unit), designed to overcome LLM bottlenecks tied to compute density and memory bandwidth. In practice, the app calls Groq’s hosted API rather than provisioning a local inference server, and the demo reports fast generation (including an end-to-end time around ~10 seconds for the shown workflow).

What role do embeddings and FAISS play in the chatbot’s accuracy?

Embeddings turn text chunks from PDFs into vectors so similarity search can find relevant passages. The transcript uses Google Generative AI embeddings (configured with “model/embedding-001”) to embed chunked text, then stores those vectors in FAISS. When a user asks a question, the FAISS-backed retriever selects the most similar chunks, which become the context fed to the LLM—reducing hallucinations and keeping answers tied to the documents.

How are PDFs transformed into retrievable context?

PDFs from a local folder (“us sensus”) are loaded with a PDF directory loader, then split using a recursive character text splitter. The chunking settings are chunk_size=1000 and chunk_overlap=200, meaning adjacent chunks share 200 characters to preserve context across boundaries. The resulting “final documents” are embedded and indexed in FAISS.

What prompt constraint keeps answers grounded in the retrieved text?

A chat prompt template instructs the model: “answer question based on the provided context only” and to “provide the most accurate response based on the question context.” This prompt is paired with LangChain’s create_stuff_documents_chain, which injects the retrieved document chunks as the context variable used by the LLM.

What happens when the user clicks “documents embedding” and then asks a question?

Clicking the embedding button triggers the vectorization pipeline: load PDFs → split into chunks → embed chunks → build the FAISS vector store. After the vector store is ready, submitting a question runs the retrieval chain: the retriever pulls relevant chunks from FAISS, the document chain combines them with the prompt, and the Groq-hosted “gamma 7B it” model generates the final answer. The demo also shows the retrieved context/page content being displayed.

Which API keys are required and where are they used?

Two environment variables are set. One is the Groq API key (used to call Groq’s chat model endpoint for “gamma 7B it”). The other is the Google API key (used to access Google Generative AI embeddings for converting text chunks into vectors). Both keys are loaded from an .env file via environment variable access in Python.

Review Questions

  1. How do chunk size and chunk overlap (1000 and 200) affect retrieval quality in a PDF-based RAG system?
  2. Describe the end-to-end flow from PDF upload to final answer generation in this Streamlit app.
  3. What is the purpose of a retriever and a “stuff” document chain in LangChain, and how do they interact with the prompt template?

Key Points

  1. 1

    Use Groq Cloud’s API to run “gamma 7B it” for low-latency LLM responses, avoiding local inference infrastructure.

  2. 2

    Convert PDF text into embeddings with Google Generative AI embeddings (“model/embedding-001”) to enable semantic similarity search.

  3. 3

    Chunk documents with a recursive character splitter (chunk size 1000, overlap 200) to preserve context across boundaries.

  4. 4

    Store embeddings in a FAISS vector database and use it as the retriever to fetch the most relevant chunks per question.

  5. 5

    Combine retrieved context with a prompt that requires answering based only on provided context to reduce hallucinations.

  6. 6

    Build the UI in Streamlit with a button to generate the vector store and an input box to run the retrieval + generation chain.

  7. 7

    Keep Groq and Google credentials in environment variables (.env) and load them at runtime for secure API access.

Highlights

Groq’s LPU-based inference is used to speed up LLM generation, with the demo emphasizing fast token generation and quick end-to-end response time.
RAG is implemented by embedding chunked PDFs into FAISS and retrieving the most similar passages before calling “gamma 7B it.”
The prompt template explicitly restricts answers to the retrieved context, helping the chatbot stay grounded in the source documents.

Topics