Get AI summaries of any video or article — Sign up free
7-End To End Advanced RAG Project using Open Source LLM Models And Groq Inferencing engine thumbnail

7-End To End Advanced RAG Project using Open Source LLM Models And Groq Inferencing engine

Krish Naik·
5 min read

Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Scrape a target website with LangChain’s web loader, then chunk it using RecursiveCharacterTextSplitter before embedding.

Briefing

The core takeaway is an end-to-end RAG (retrieval-augmented generation) app built with open-source LLMs, where web content is scraped, chunked, embedded, indexed into a vector store, and then queried through a LangChain retrieval pipeline—served as a Streamlit web app. The project’s practical twist is using Groq’s inference engine (via Groq’s API and LangChain’s ChatGroq integration) to speed up LLM responses while keeping the rest of the stack open-source.

Implementation starts with setting up dependencies and API access. The workflow uses Streamlit for the interface, BeautifulSoup for scraping, and LangChain components for loading web documents, splitting text, embedding, and retrieval. Groq’s “self-serve” API access is used to generate an API key, which is then stored in environment variables (the code reads it via os.environ). Groq’s inference engine is described as using an LPU (Language Processing Unit) designed to reduce LLM bottlenecks tied to compute density and memory bandwidth—positioned as faster than GPU-based inference for compute-heavy workloads.

On the RAG side, the app uses LangChain’s web loader to fetch content from a target website (the transcript references LangChain documentation pages as the example source). The loaded documents are then chunked using a RecursiveCharacterTextSplitter with a large chunk size (around 1,000 to 10,000 in the described code) and overlap (about 200) to preserve context across boundaries. For embeddings, the project uses an open-source embedding model from LangChain Community (named “Ollama embedding” in the transcript as “AMA embedding”), avoiding OpenAI embeddings entirely.

Those chunks are converted into vectors and stored in a vector database interface provided by LangChain Community (the transcript uses FAISS). The app caches key objects in Streamlit session state—embeddings, loaded documents, split text, and the vector index—so repeated queries don’t redo the entire indexing step.

For generation, the pipeline builds a prompt template that instructs the model to answer using only the retrieved context. A document-combining chain (create_stuff_document_chain) and a retrieval chain (create_retrieval_chain) are assembled so that user questions trigger similarity search over the FAISS index, then feed the most relevant context into the Groq-hosted open model.

Model choice becomes a key performance and quality lever. The transcript notes that Gamma 7B has a relatively limited context length (8K mentioned), which can reduce summarization quality when retrieved context is long. To address this, the workflow suggests trying other Groq models with larger context windows (the transcript mentions Mistral as an alternative) to improve accuracy and responsiveness.

Finally, the app measures response time and uses a Streamlit expander to display retrieved context, making it easier to debug retrieval quality. The first run is slower because indexing and embedding happen on page load; subsequent queries are faster due to cached session state. The result is a working template for an open-source RAG system with Groq acceleration, plus clear guidance on where speed and answer quality can break—indexing cost and context-length limits.

Cornell Notes

The project builds an end-to-end RAG application that scrapes a website, splits the text into chunks, embeds those chunks with an open-source embedding model, and indexes them in FAISS. When a user asks a question in a Streamlit interface, LangChain retrieves the most relevant chunks from the vector store and feeds them into a prompt template for a Groq-hosted model via ChatGroq. Groq’s inference engine (using an LPU) is used to improve inference speed, while the rest of the pipeline stays open-source. The transcript highlights two practical issues: the first run is slow due to indexing/embedding, and answer quality can suffer if the chosen model’s context length is too small (Gamma 7B is cited with 8K).

How does the app turn a website into something a model can answer from?

It loads the website content using LangChain’s web-based document loader, then chunks the text with a RecursiveCharacterTextSplitter (chunk size on the order of ~1,000 to ~10,000 and overlap around 200). Those chunks are embedded using an open-source embedding model (named “Ollama embedding” / “AMA embedding” in the transcript) and stored as vectors in a FAISS vector store. This vector index becomes the retrieval layer for later questions.

What exactly happens when a user submits a question?

The retrieval chain performs similarity search over the FAISS index to find relevant chunks. A document-combining chain (create_stuff_document_chain) then injects the retrieved context into a prompt template that instructs the model to answer using the provided context only. The final response is generated by ChatGroq using the Groq API key from environment variables.

Why does the first run feel slower than later queries?

Indexing and embedding happen during page load. The transcript notes that the initial run must embed/split/index the website content to create the vector store, which takes noticeable time. After that, Streamlit session state caches embeddings, documents, and the FAISS index, so subsequent questions reuse the existing vectors instead of rebuilding them.

How does model context length affect summarization quality in this setup?

The transcript points out that Gamma 7B has a limited context length (8K mentioned). If the retrieved context plus prompt exceeds what the model can handle effectively, summarization can degrade. The suggested fix is to try another Groq model with a larger context window (the transcript mentions Mistral) to improve accuracy when more context is needed.

What role does Streamlit session state play in the architecture?

Session state stores intermediate artifacts—embeddings, loaded documents, chunked text, and the FAISS vector store—so the app doesn’t redo expensive steps on every interaction. The transcript also uses Streamlit UI elements like an expander to show retrieved context, which helps verify whether retrieval is returning the right passages.

How is Groq integrated into the LangChain RAG pipeline?

Groq is used only for inference: the app creates a Groq API key via Groq’s self-serve playground, stores it in environment variables, and passes it into LangChain’s ChatGroq. The model name is set to a Groq-hosted open model (Gamma 7B is used first in the transcript), and the RAG logic remains LangChain-driven.

Review Questions

  1. Where in the pipeline does the system spend most time on the first run, and what caching mechanism reduces that cost later?
  2. How do the prompt instructions and retrieval chain work together to keep answers grounded in retrieved context?
  3. What symptoms would suggest that the chosen Groq model’s context length is too small, and how would you test a fix?

Key Points

  1. 1

    Scrape a target website with LangChain’s web loader, then chunk it using RecursiveCharacterTextSplitter before embedding.

  2. 2

    Use open-source embeddings (via LangChain Community’s Ollama embedding/“AMA embedding” in the transcript) to avoid OpenAI embeddings.

  3. 3

    Index embedded chunks in FAISS and cache the vector store in Streamlit session state to prevent repeated indexing.

  4. 4

    Build a retrieval chain that injects retrieved context into a prompt template instructing the model to answer using only that context.

  5. 5

    Integrate Groq for fast inference through LangChain’s ChatGroq, using a Groq API key stored in environment variables.

  6. 6

    Model choice matters: limited context length (e.g., Gamma 7B at 8K mentioned) can reduce summarization quality; try models with larger context windows like Mistral.

  7. 7

    Expect slower first-page load due to embedding/indexing; subsequent queries should be faster because cached vectors are reused.

Highlights

The RAG pipeline is fully end-to-end: web loading → chunking → open-source embeddings → FAISS indexing → retrieval-augmented answering via ChatGroq.
Groq’s speed-up is attributed to its LPU-based inference approach aimed at reducing LLM bottlenecks tied to compute density and memory bandwidth.
Answer quality can hinge on context length; Gamma 7B’s 8K window is flagged as a reason summarization may not match expectations.
Streamlit session state is used to cache embeddings and the vector store, making repeated queries much faster than the initial indexing run.

Topics

Mentioned