Get AI summaries of any video or article — Sign up free
LangChain & Supabase Tutorial: How to Build a ChatGPT Chatbot For Your Website thumbnail

LangChain & Supabase Tutorial: How to Build a ChatGPT Chatbot For Your Website

Chat with data·
5 min read

Based on Chat with data's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Scrape selected website pages, convert them to raw text, and split the text into smaller chunks before embedding.

Briefing

A practical blueprint for turning a website into a ChatGPT-style chatbot hinges on one move: retrieve the most relevant chunks of your site’s text using vector similarity, then feed that retrieved context—along with the ongoing chat history—into a language model to generate answers grounded in your content. Instead of a one-off Q&A, the system keeps a running conversation state so follow-up questions can reference earlier exchanges, mirroring how ChatGPT maintains context.

At a high level, the workflow starts by scraping selected pages from a site (the tutorial intentionally avoids deep crawling). Each page is converted from HTML into raw text, then split into smaller chunks. Those chunks are embedded into vectors—numerical representations in high-dimensional space—so a computer can match questions to semantically similar passages. The resulting vectors and their associated metadata are stored in a vector database.

When a user asks something like “How do I become more productive?”, the system doesn’t immediately send the question to the model. It first uses the chat history to create a “standalone question.” That step matters because follow-ups often rely on earlier context; by rewriting the user’s latest message into an independent query, the retrieval step can search the vector store accurately. The standalone question is embedded, compared against stored vectors using similarity search, and the top matching chunks are pulled back as context. The language model then receives both the standalone question and the retrieved text, producing an answer that can be tailored to the website’s content.

For the demo, the chatbot is built around Thomas Frank’s productivity guides (notably a large Notion-focused set). Rather than crawling everything automatically, the tutorial manually selects URLs and inspects the page structure using browser developer tools. It looks for consistent HTML elements—such as a title tag (e.g., an H1 with a stable class) and a content container (e.g., a div holding the article body)—and avoids pages where content loads via JavaScript. For static pages, it uses Cheerio to parse HTML after a GET request, extract the relevant sections, clean them, and package them into LangChain “Document” objects containing page content plus metadata.

The vector storage layer uses Supabase with the PG Vector extension. The setup includes enabling the vector extension, creating a table for documents, and defining a similarity function that compares query vectors to stored vectors using cosine similarity (with indexing for speed). After scraping and embedding, the system can be tested with a simplified “demo query” that retrieves similar documents and returns a model response.

Finally, the tutorial demonstrates the end-to-end chatbot experience in a small website UI: users ask questions about Notion topics, request clarifications (“What do you mean by…?”), and get answers that reflect the underlying guide content. The result is a basic but complete pattern for building a website-specific assistant: scrape → chunk → embed → store in PG Vector → retrieve with chat-history-aware standalone questions → generate responses with grounded context.

Cornell Notes

The core idea is to build a website chatbot by grounding answers in the site’s own text. Pages are scraped, converted to raw text, split into chunks, embedded into vectors, and stored in a vector database. When users ask questions, the system rewrites the latest message into a standalone question using the full chat history, then retrieves the most relevant chunks via vector similarity search. Those retrieved chunks become context for the language model, which generates the final response. This approach supports multi-turn conversations because follow-ups are handled through chat-history-aware question rewriting and retrieval.

Why split scraped website text into chunks before embedding?

Embedding entire pages often dilutes relevance and increases cost. Chunking turns long documents into smaller passages that can be matched to specific parts of a user’s question. The tutorial describes splitting raw page text into chunks, embedding each chunk, and storing the chunk content plus metadata so retrieval can return the most relevant segments rather than an entire page.

How does chat history improve retrieval for follow-up questions?

Follow-ups like “What do you mean by that?” depend on earlier context. The tutorial uses a step that asks the language model to create a standalone question from the chat history plus the latest user message. That standalone question is then embedded and used to query the vector store, making similarity search more accurate than embedding the follow-up verbatim.

What does the system send to the language model to generate an answer?

It sends the standalone question along with the retrieved relevant document chunks as context. The retrieved text is pulled from the vector store using similarity search, then combined with the question in the prompt. The model’s output is therefore constrained by the website-derived context rather than being purely generic.

How does Supabase with PG Vector fit into the architecture?

Supabase serves as the vector storage layer. PG Vector extends Supabase/PostgreSQL to store embedding vectors and run similarity search. The tutorial describes enabling the PG Vector extension, creating a table for documents, and using a function that compares query vectors to stored vectors with cosine similarity (and indexing to speed up matching). Retrieved chunks come from this similarity search.

Why does the tutorial prefer static pages and Cheerio over JavaScript-heavy pages?

Cheerio works by parsing HTML returned from a GET request. If a page’s content is loaded dynamically via JavaScript, the HTML fetched initially may not contain the needed text. The tutorial notes that for JavaScript-loaded pages, tools like Puppeteer could be used, but static pages make extraction simpler and more reliable for this baseline setup.

What metadata is stored alongside embeddings, and why?

Each chunk is stored with metadata that points back to where it came from (for example, references to the source content). The tutorial describes storing chunk content plus metadata in the vector database. This metadata can support traceability and references in responses, even if the retrieval primarily uses vector similarity.

Review Questions

  1. What problem does the “standalone question” step solve in a multi-turn chatbot, and how does it affect vector retrieval?
  2. Walk through the data pipeline from scraping to answering: which components handle scraping, chunking, embedding, storage, retrieval, and generation?
  3. How does PG Vector’s similarity search (cosine similarity) determine which chunks are returned as context?

Key Points

  1. 1

    Scrape selected website pages, convert them to raw text, and split the text into smaller chunks before embedding.

  2. 2

    Embed each chunk into high-dimensional vectors and store vectors plus chunk metadata in a vector database.

  3. 3

    Use chat history to rewrite the latest user message into a standalone question so follow-ups retrieve the right content.

  4. 4

    Retrieve the most similar chunks from the vector store using vector similarity search, then pass those chunks as context to the language model.

  5. 5

    Ground answers by combining the standalone question with retrieved website text rather than sending the question alone.

  6. 6

    Implement vector storage with Supabase plus the PG Vector extension, including cosine-similarity-based matching and indexing for speed.

  7. 7

    For reliable extraction in this baseline, target pages with consistent static HTML structure and parse them with Cheerio.

Highlights

The chatbot’s retrieval step depends on rewriting follow-up messages into a standalone question using prior chat history, improving semantic search accuracy.
Answers are generated from retrieved website chunks: the system embeds the standalone question, pulls the closest vectors from PG Vector, and feeds that text into the language model.
Supabase plus PG Vector provides both embedding storage and similarity search via cosine similarity, making the retrieval pipeline straightforward.
The tutorial’s extraction approach relies on consistent HTML elements (like stable title and content containers) and avoids JavaScript-rendered pages for simplicity.

Mentioned