LangChain & Supabase Tutorial: How to Build a ChatGPT Chatbot For Your Website
Based on Chat with data's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Scrape selected website pages, convert them to raw text, and split the text into smaller chunks before embedding.
Briefing
A practical blueprint for turning a website into a ChatGPT-style chatbot hinges on one move: retrieve the most relevant chunks of your site’s text using vector similarity, then feed that retrieved context—along with the ongoing chat history—into a language model to generate answers grounded in your content. Instead of a one-off Q&A, the system keeps a running conversation state so follow-up questions can reference earlier exchanges, mirroring how ChatGPT maintains context.
At a high level, the workflow starts by scraping selected pages from a site (the tutorial intentionally avoids deep crawling). Each page is converted from HTML into raw text, then split into smaller chunks. Those chunks are embedded into vectors—numerical representations in high-dimensional space—so a computer can match questions to semantically similar passages. The resulting vectors and their associated metadata are stored in a vector database.
When a user asks something like “How do I become more productive?”, the system doesn’t immediately send the question to the model. It first uses the chat history to create a “standalone question.” That step matters because follow-ups often rely on earlier context; by rewriting the user’s latest message into an independent query, the retrieval step can search the vector store accurately. The standalone question is embedded, compared against stored vectors using similarity search, and the top matching chunks are pulled back as context. The language model then receives both the standalone question and the retrieved text, producing an answer that can be tailored to the website’s content.
For the demo, the chatbot is built around Thomas Frank’s productivity guides (notably a large Notion-focused set). Rather than crawling everything automatically, the tutorial manually selects URLs and inspects the page structure using browser developer tools. It looks for consistent HTML elements—such as a title tag (e.g., an H1 with a stable class) and a content container (e.g., a div holding the article body)—and avoids pages where content loads via JavaScript. For static pages, it uses Cheerio to parse HTML after a GET request, extract the relevant sections, clean them, and package them into LangChain “Document” objects containing page content plus metadata.
The vector storage layer uses Supabase with the PG Vector extension. The setup includes enabling the vector extension, creating a table for documents, and defining a similarity function that compares query vectors to stored vectors using cosine similarity (with indexing for speed). After scraping and embedding, the system can be tested with a simplified “demo query” that retrieves similar documents and returns a model response.
Finally, the tutorial demonstrates the end-to-end chatbot experience in a small website UI: users ask questions about Notion topics, request clarifications (“What do you mean by…?”), and get answers that reflect the underlying guide content. The result is a basic but complete pattern for building a website-specific assistant: scrape → chunk → embed → store in PG Vector → retrieve with chat-history-aware standalone questions → generate responses with grounded context.
Cornell Notes
The core idea is to build a website chatbot by grounding answers in the site’s own text. Pages are scraped, converted to raw text, split into chunks, embedded into vectors, and stored in a vector database. When users ask questions, the system rewrites the latest message into a standalone question using the full chat history, then retrieves the most relevant chunks via vector similarity search. Those retrieved chunks become context for the language model, which generates the final response. This approach supports multi-turn conversations because follow-ups are handled through chat-history-aware question rewriting and retrieval.
Why split scraped website text into chunks before embedding?
How does chat history improve retrieval for follow-up questions?
What does the system send to the language model to generate an answer?
How does Supabase with PG Vector fit into the architecture?
Why does the tutorial prefer static pages and Cheerio over JavaScript-heavy pages?
What metadata is stored alongside embeddings, and why?
Review Questions
- What problem does the “standalone question” step solve in a multi-turn chatbot, and how does it affect vector retrieval?
- Walk through the data pipeline from scraping to answering: which components handle scraping, chunking, embedding, storage, retrieval, and generation?
- How does PG Vector’s similarity search (cosine similarity) determine which chunks are returned as context?
Key Points
- 1
Scrape selected website pages, convert them to raw text, and split the text into smaller chunks before embedding.
- 2
Embed each chunk into high-dimensional vectors and store vectors plus chunk metadata in a vector database.
- 3
Use chat history to rewrite the latest user message into a standalone question so follow-ups retrieve the right content.
- 4
Retrieve the most similar chunks from the vector store using vector similarity search, then pass those chunks as context to the language model.
- 5
Ground answers by combining the standalone question with retrieved website text rather than sending the question alone.
- 6
Implement vector storage with Supabase plus the PG Vector extension, including cosine-similarity-based matching and indexing for speed.
- 7
For reliable extraction in this baseline, target pages with consistent static HTML structure and parse them with Cheerio.