Advanced RAG 05 - HyDE - Hypothetical Document Embeddings
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
HyDE improves dense retrieval by embedding an LLM-generated hypothetical answer rather than the original query.
Briefing
HyDE (Hypothetical Document Embeddings) improves retrieval in RAG by using a large language model to draft a “hypothetical answer,” embedding that generated text, and then running similarity search against document chunks using the embedding of the answer—not the embedding of the original query. The practical payoff is straightforward: when user questions are vague, missing key nouns, or otherwise hard to match semantically, the LLM can supply the missing topical anchors (entities and concepts) so the vector search lands on the right material more reliably.
The core mechanism works like this. A user query goes into an LLM, which produces a short passage that would answer the question. That passage is never meant for the user; it exists to create a better embedding target. Instead of comparing “query embedding → chunk embeddings,” HyDE compares “hypothetical answer embedding → chunk embeddings.” Even when the hypothetical answer is imperfect, it often still contains the right terms and relationships—enough for embedding similarity to retrieve relevant chunks.
A concrete example centers on a query like “What is McDonald’s best items?” The query itself doesn’t explicitly mention food, burgers, or specific menu items. Directly embedding that query and searching a vector store can underperform because the nearest chunks may not share the missing nouns. HyDE changes the workflow: the LLM generates an answer mentioning fast food and likely bestsellers such as the Big Mac and other menu items. The resulting embedding is therefore aligned with the vocabulary present in the documents, increasing the chance that chunks about those items are retrieved.
HyDE also supports generating multiple hypothetical answers. Rather than relying on a single draft, the system can produce several candidate passages, embed each one, and combine them—such as by averaging embeddings—to create a more robust retrieval signal. This helps when the LLM’s first guess is incomplete or when the question’s intent is underspecified; the combined representation tends to preserve key concepts that recur across generations.
Prompting matters. The transcript highlights that customizing the LLM prompt can steer the hypothetical text toward the retrieval goal. For instance, a prompt can instruct the model to recommend a single food item (useful when the question expects one best-selling product), producing a shorter, more targeted hypothetical document. That targeted output then yields an embedding that better matches the intended document sections.
The implementation uses OpenAI for the LLM and BGE embeddings for vectorization, though the approach is model-agnostic: any embedding system can be swapped in, including local or quantized embeddings. The transcript includes an example where HyDE can still work even if the hypothetical answer is wrong in details, as long as it mentions the right concepts. It also flags a key limitation: if the topic is entirely unfamiliar to the LLM, the hypothetical text may hallucinate unrelated content, harming retrieval. In those cases, HyDE should be used cautiously.
Overall, HyDE is presented as a simple but powerful upgrade to dense retrieval inside RAG—especially for short, noun-light, or ambiguous queries—by converting “hard-to-search questions” into “search-friendly hypothetical answers” before embedding and retrieval.
Cornell Notes
HyDE (Hypothetical Document Embeddings) boosts RAG retrieval by embedding a hypothetical answer generated by an LLM, then searching document chunks using that answer embedding. This helps when user queries are vague or omit key nouns—because the LLM can supply entities and topical terms that make vector similarity search more effective. The method can generate one hypothetical passage or multiple passages; multiple embeddings can be combined (e.g., averaged) to strengthen the retrieval signal. HyDE works best when the LLM has enough knowledge to produce a plausible, concept-aligned answer; if the topic is too unfamiliar, hallucinated content can mislead retrieval. Prompt customization is central: it can steer the hypothetical text toward the exact form and level of specificity needed for the downstream search.
How does HyDE change the standard dense retrieval workflow in RAG?
Why can HyDE outperform “query embedding → chunk embedding” similarity search?
What role do multiple hypothetical generations play?
How does prompt design affect HyDE’s retrieval quality?
When should HyDE be avoided or used cautiously?
Review Questions
- In HyDE, what gets embedded for retrieval—the original query or the LLM-generated hypothetical passage—and why does that matter for noun-light questions?
- How would you adapt HyDE prompting if the user question expects a single specific item rather than a broad list?
- What failure mode occurs when the LLM lacks knowledge about the topic, and how would that show up in retrieval results?
Key Points
- 1
HyDE improves dense retrieval by embedding an LLM-generated hypothetical answer rather than the original query.
- 2
The hypothetical passage is used only to create an embedding target; it is not meant to be shown to the user.
- 3
HyDE is especially useful for short or vague queries that omit key nouns and entities needed for effective vector similarity search.
- 4
Generating multiple hypothetical answers and combining their embeddings (e.g., averaging) can make retrieval more robust.
- 5
Prompt customization steers what concepts appear in the hypothetical text, which directly affects what chunks are retrieved.
- 6
HyDE can fail when the LLM hallucinates unrelated content due to insufficient knowledge of the topic.
- 7
The approach is flexible: OpenAI can be used for the LLM and BGE embeddings for vectorization, but other embedding systems (including local/quantized) can be substituted.