HyDE

The query-document asymmetry problem
How HyDE works
Implementation
HyDE with LangChain
When HyDE helps vs hurts
Gotchas

SECTION 01

The query-document asymmetry problem

Queries and documents are linguistically very different. A user asks: "What causes inflation?" A relevant document says: "Inflation results from excess money supply relative to goods production, typically triggered by loose monetary policy or supply shocks."

Embedding a 6-word question produces a very different vector from embedding a detailed explanatory paragraph — even if they're semantically related. Dense retrieval struggles with this gap because the two texts have different length, style, and vocabulary density.

HyDE's insight: instead of embedding the short query, use an LLM to generate a hypothetical long-form answer, then embed that. A verbose answer has a vector distribution much closer to actual documents in the corpus.

SECTION 02

How HyDE works

The algorithm in three steps:

Generate: send the query to an LLM with a prompt like "Write a paragraph that answers this question: {query}". The LLM produces a hypothetical document — it may be factually wrong, but its embedding will be stylistically and semantically close to real documents on the topic.
Embed: embed the hypothetical document (not the original query).
Retrieve: use the hypothetical document's embedding to find nearest-neighbour real documents in the corpus.

The factual correctness of the hypothetical document doesn't matter — only its embedding does. A wrong answer that's on-topic produces a useful embedding.

SECTION 03

Implementation

import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = anthropic.Anthropic()
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
QUERY_PREFIX = "Represent this sentence for searching relevant passages: "

def generate_hypothetical_doc(query: str) -> str:
    '''Ask the LLM to write a document that answers the query.'''
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": (
                f"Write a concise paragraph (2-3 sentences) that directly answers "
                f"the following question. Focus on being on-topic; accuracy is secondary.\n\n"
                f"Question: {query}"
            )
        }]
    )
    return response.content[0].text

def hyde_retrieve(query: str, corpus: list[str], k: int = 5) -> list[str]:
    # Standard dense retrieval for comparison
    q_emb = model.encode(QUERY_PREFIX + query, normalize_embeddings=True)

    # HyDE: generate hypothetical doc and embed it
    hyp_doc = generate_hypothetical_doc(query)
    hyp_emb = model.encode(hyp_doc, normalize_embeddings=True)

    # Retrieve using hypothetical embedding
    corpus_embs = model.encode(corpus, normalize_embeddings=True)
    hyde_scores = corpus_embs @ hyp_emb
    return [corpus[i] for i in np.argsort(hyde_scores)[::-1][:k]]

# Test
corpus = [
    "Inflation is caused by excess money supply and supply shortages.",
    "The Federal Reserve controls interest rates to manage inflation.",
    "Python is a high-level interpreted programming language.",
    "Supply chain disruptions in 2021-2022 contributed to inflation.",
]
results = hyde_retrieve("What causes prices to go up?", corpus)
for r in results[:2]:
    print(r)

SECTION 04

HyDE with LangChain

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.schema import Document

# Base embeddings (will be used for the hypothetical doc)
base_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# HyDE wraps base embeddings: generates a hypothetical doc, then embeds it
llm = OpenAI(temperature=0)   # or use AzureOpenAI, Anthropic wrapper, etc.
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    embeddings=base_embeddings,
    chain_type="stuff"   # single hypothetical doc
)

# Use HyDE embeddings in your vector store
corpus = [
    Document(page_content="Inflation results from money supply exceeding goods."),
    Document(page_content="The Fed raises rates to cool inflation."),
]
vectorstore = Chroma.from_documents(corpus, base_embeddings)  # corpus embedded normally

# At query time, the retriever uses HyDE to embed the query
retriever = vectorstore.as_retriever()
# Override the embedding function for querying
hyde_vectorstore = Chroma(embedding_function=hyde_embeddings, persist_directory="./chroma_db")
results = hyde_vectorstore.similarity_search("What causes prices to rise?", k=3)

SECTION 05

When HyDE helps vs hurts

HyDE helps when:

Queries are short and the corpus documents are long/verbose (the typical user-question vs. documentation gap)
Domain is specialised and users phrase questions colloquially while docs use formal terminology
Queries are open-ended ("explain why X happens") rather than lookup queries

HyDE hurts when:

Queries are already document-like (e.g., "Python asyncio documentation")
Factual precision is critical and the LLM's hallucinated context introduces noise (hallucinated proper nouns, wrong dates)
Latency matters — HyDE adds an extra LLM call before every retrieval

SECTION 06

Gotchas

LLM hallucinations can harm retrieval. If the hypothetical document contains a confident wrong answer (e.g., wrong year, wrong person name), the embedding may pull toward the wrong documents. Use a lower temperature (0.0–0.3) and a factual prompt to minimise this.

Cost doubles. Every query now requires an LLM call for hypothesis generation plus the embedding call. For high-QPS systems, the latency and cost of the extra LLM call may outweigh the retrieval improvement.

Cache hypothetical docs. For repeated or similar queries, cache the (query → hypothetical_doc) mapping. This eliminates redundant LLM calls for common questions.

Multi-hypothesis variant. Generate 3–5 hypothetical documents and average their embeddings before retrieval. This reduces the variance of any single hallucinated response, often improving recall further.

HyDE evaluation and when it helps

HyDE's effectiveness is highly query-dependent. Queries that are short, underspecified, or phrased differently from how documents are written benefit most from HyDE because the hypothetical document bridges the vocabulary and style gap. Queries that are already phrased in document-like language, or queries for rare facts where the generative model might hallucinate plausible-sounding but incorrect content, may perform worse with HyDE than with direct retrieval. Empirical evaluation on a labeled query-document dataset is necessary before deploying HyDE, as the improvement is not universal.

HyDE variants and extensions

Multi-HyDE generates multiple hypothetical documents per query and averages their embeddings before retrieval, reducing the impact of any single hallucinated document on retrieval quality. Generating 3–5 hypothetical documents and averaging their embeddings produces more robust retrieval than a single hypothetical document, at the cost of 3–5x more generation calls per query. The averaging operation effectively creates an ensemble query embedding that covers more of the relevant document space than any single hypothetical document, improving recall for ambiguous queries that could be answered by multiple types of documents.

Approach	Latency	Quality	Best for
Direct query embedding	~5ms	Baseline	Document-like queries
HyDE (single)	+500ms	Better for short queries	Underspecified, short queries
Multi-HyDE (3–5 docs)	+1500ms	Most robust	Ambiguous, diverse queries
Query expansion + HyDE	+600ms	High	Keyword + semantic gap

from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np

client = OpenAI()
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")

def hyde_retrieve(query: str, corpus_embeddings: np.ndarray,
                  corpus_texts: list, top_k: int = 5) -> list:
    # Step 1: generate hypothetical document
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content":
            f"Write a short passage that directly answers: {query}"}],
        max_tokens=200
    )
    hyp_doc = resp.choices[0].message.content

    # Step 2: embed the hypothetical document (not the query)
    hyp_embedding = embedder.encode([hyp_doc], normalize_embeddings=True)

    # Step 3: retrieve by similarity to hypothetical document
    scores = corpus_embeddings @ hyp_embedding.T
    top_idx = np.argsort(scores.flatten())[::-1][:top_k]
    return [corpus_texts[i] for i in top_idx]

HyDE is most effective for queries in domains where short queries are very different in style from the documents that answer them — for instance, a user asking "headache after coffee" where relevant documents discuss "caffeine withdrawal syndrome." The generative step transforms the colloquial query into a passage resembling the style and vocabulary of documents in the corpus, enabling the embedding model to find relevant documents based on passage-to-passage similarity rather than query-to-passage similarity. For queries that already use the vocabulary of the document corpus, HyDE provides little benefit over direct query embedding because the vocabulary gap it addresses does not exist.

HyDE can be combined with reranking in a two-stage pipeline where HyDE retrieves a wider candidate set and a cross-encoder reranker selects the final top-k. Using HyDE with a candidate set of 50–100 documents followed by cross-encoder reranking to top-5 consistently outperforms direct embedding retrieval with the same reranking step on query distributions with high style variation. The HyDE stage improves first-stage recall by increasing the probability that the true relevant document appears in the candidate set, and the reranker provides precision by precisely scoring the candidates against the original query.

HyDE's latency overhead from the generation call can be partially mitigated by using a smaller, faster model for hypothesis generation than for final answer generation. A 70B model used for final answer generation can be paired with a 7B model for HyDE hypothesis generation, since hypothesis quality requirements are lower than final answer quality — the hypothesis only needs to be plausible enough to retrieve the right documents, not necessarily factually correct. Using gpt-4o-mini for hypothesis generation and gpt-4o for final answer synthesis reduces the latency overhead of HyDE by 3–5x compared to using the same model for both steps.

HyDE's sensitivity to hypothesis quality creates an interesting calibration challenge. If the generative model produces hypothetical documents that are stylistically similar to the corpus but factually incorrect, the retrieved passages will be relevant to the topic but may not directly answer the query. Factual correctness of the hypothesis is less important than topical and stylistic similarity to the corpus — the embedding similarity step naturally selects for retrieved documents that match the topic of the hypothesis regardless of whether the hypothesis's specific claims are accurate. This means that even using a smaller, less accurate model for hypothesis generation retains most of HyDE's retrieval benefit.

Table of Contents