Hypothetical Document Embeddings: generate a hypothetical answer to the query, embed it, then use that embedding for retrieval instead of the query itself.
Queries and documents are linguistically very different. A user asks: "What causes inflation?" A relevant document says: "Inflation results from excess money supply relative to goods production, typically triggered by loose monetary policy or supply shocks."
Embedding a 6-word question produces a very different vector from embedding a detailed explanatory paragraph — even if they're semantically related. Dense retrieval struggles with this gap because the two texts have different length, style, and vocabulary density.
HyDE's insight: instead of embedding the short query, use an LLM to generate a hypothetical long-form answer, then embed that. A verbose answer has a vector distribution much closer to actual documents in the corpus.
The algorithm in three steps:
The factual correctness of the hypothetical document doesn't matter — only its embedding does. A wrong answer that's on-topic produces a useful embedding.
import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = anthropic.Anthropic()
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
QUERY_PREFIX = "Represent this sentence for searching relevant passages: "
def generate_hypothetical_doc(query: str) -> str:
'''Ask the LLM to write a document that answers the query.'''
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
messages=[{
"role": "user",
"content": (
f"Write a concise paragraph (2-3 sentences) that directly answers "
f"the following question. Focus on being on-topic; accuracy is secondary.\n\n"
f"Question: {query}"
)
}]
)
return response.content[0].text
def hyde_retrieve(query: str, corpus: list[str], k: int = 5) -> list[str]:
# Standard dense retrieval for comparison
q_emb = model.encode(QUERY_PREFIX + query, normalize_embeddings=True)
# HyDE: generate hypothetical doc and embed it
hyp_doc = generate_hypothetical_doc(query)
hyp_emb = model.encode(hyp_doc, normalize_embeddings=True)
# Retrieve using hypothetical embedding
corpus_embs = model.encode(corpus, normalize_embeddings=True)
hyde_scores = corpus_embs @ hyp_emb
return [corpus[i] for i in np.argsort(hyde_scores)[::-1][:k]]
# Test
corpus = [
"Inflation is caused by excess money supply and supply shortages.",
"The Federal Reserve controls interest rates to manage inflation.",
"Python is a high-level interpreted programming language.",
"Supply chain disruptions in 2021-2022 contributed to inflation.",
]
results = hyde_retrieve("What causes prices to go up?", corpus)
for r in results[:2]:
print(r)
from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain.schema import Document
# Base embeddings (will be used for the hypothetical doc)
base_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# HyDE wraps base embeddings: generates a hypothetical doc, then embeds it
llm = OpenAI(temperature=0) # or use AzureOpenAI, Anthropic wrapper, etc.
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=llm,
embeddings=base_embeddings,
chain_type="stuff" # single hypothetical doc
)
# Use HyDE embeddings in your vector store
corpus = [
Document(page_content="Inflation results from money supply exceeding goods."),
Document(page_content="The Fed raises rates to cool inflation."),
]
vectorstore = Chroma.from_documents(corpus, base_embeddings) # corpus embedded normally
# At query time, the retriever uses HyDE to embed the query
retriever = vectorstore.as_retriever()
# Override the embedding function for querying
hyde_vectorstore = Chroma(embedding_function=hyde_embeddings, persist_directory="./chroma_db")
results = hyde_vectorstore.similarity_search("What causes prices to rise?", k=3)
HyDE helps when:
HyDE hurts when:
LLM hallucinations can harm retrieval. If the hypothetical document contains a confident wrong answer (e.g., wrong year, wrong person name), the embedding may pull toward the wrong documents. Use a lower temperature (0.0–0.3) and a factual prompt to minimise this.
Cost doubles. Every query now requires an LLM call for hypothesis generation plus the embedding call. For high-QPS systems, the latency and cost of the extra LLM call may outweigh the retrieval improvement.
Cache hypothetical docs. For repeated or similar queries, cache the (query → hypothetical_doc) mapping. This eliminates redundant LLM calls for common questions.
Multi-hypothesis variant. Generate 3–5 hypothetical documents and average their embeddings before retrieval. This reduces the variance of any single hallucinated response, often improving recall further.
HyDE's effectiveness is highly query-dependent. Queries that are short, underspecified, or phrased differently from how documents are written benefit most from HyDE because the hypothetical document bridges the vocabulary and style gap. Queries that are already phrased in document-like language, or queries for rare facts where the generative model might hallucinate plausible-sounding but incorrect content, may perform worse with HyDE than with direct retrieval. Empirical evaluation on a labeled query-document dataset is necessary before deploying HyDE, as the improvement is not universal.
Multi-HyDE generates multiple hypothetical documents per query and averages their embeddings before retrieval, reducing the impact of any single hallucinated document on retrieval quality. Generating 3–5 hypothetical documents and averaging their embeddings produces more robust retrieval than a single hypothetical document, at the cost of 3–5x more generation calls per query. The averaging operation effectively creates an ensemble query embedding that covers more of the relevant document space than any single hypothetical document, improving recall for ambiguous queries that could be answered by multiple types of documents.
| Approach | Latency | Quality | Best for |
|---|---|---|---|
| Direct query embedding | ~5ms | Baseline | Document-like queries |
| HyDE (single) | +500ms | Better for short queries | Underspecified, short queries |
| Multi-HyDE (3–5 docs) | +1500ms | Most robust | Ambiguous, diverse queries |
| Query expansion + HyDE | +600ms | High | Keyword + semantic gap |
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np
client = OpenAI()
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
def hyde_retrieve(query: str, corpus_embeddings: np.ndarray,
corpus_texts: list, top_k: int = 5) -> list:
# Step 1: generate hypothetical document
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content":
f"Write a short passage that directly answers: {query}"}],
max_tokens=200
)
hyp_doc = resp.choices[0].message.content
# Step 2: embed the hypothetical document (not the query)
hyp_embedding = embedder.encode([hyp_doc], normalize_embeddings=True)
# Step 3: retrieve by similarity to hypothetical document
scores = corpus_embeddings @ hyp_embedding.T
top_idx = np.argsort(scores.flatten())[::-1][:top_k]
return [corpus_texts[i] for i in top_idx]
HyDE is most effective for queries in domains where short queries are very different in style from the documents that answer them — for instance, a user asking "headache after coffee" where relevant documents discuss "caffeine withdrawal syndrome." The generative step transforms the colloquial query into a passage resembling the style and vocabulary of documents in the corpus, enabling the embedding model to find relevant documents based on passage-to-passage similarity rather than query-to-passage similarity. For queries that already use the vocabulary of the document corpus, HyDE provides little benefit over direct query embedding because the vocabulary gap it addresses does not exist.
HyDE can be combined with reranking in a two-stage pipeline where HyDE retrieves a wider candidate set and a cross-encoder reranker selects the final top-k. Using HyDE with a candidate set of 50–100 documents followed by cross-encoder reranking to top-5 consistently outperforms direct embedding retrieval with the same reranking step on query distributions with high style variation. The HyDE stage improves first-stage recall by increasing the probability that the true relevant document appears in the candidate set, and the reranker provides precision by precisely scoring the candidates against the original query.
HyDE's latency overhead from the generation call can be partially mitigated by using a smaller, faster model for hypothesis generation than for final answer generation. A 70B model used for final answer generation can be paired with a 7B model for HyDE hypothesis generation, since hypothesis quality requirements are lower than final answer quality — the hypothesis only needs to be plausible enough to retrieve the right documents, not necessarily factually correct. Using gpt-4o-mini for hypothesis generation and gpt-4o for final answer synthesis reduces the latency overhead of HyDE by 3–5x compared to using the same model for both steps.
HyDE's sensitivity to hypothesis quality creates an interesting calibration challenge. If the generative model produces hypothetical documents that are stylistically similar to the corpus but factually incorrect, the retrieved passages will be relevant to the topic but may not directly answer the query. Factual correctness of the hypothesis is less important than topical and stylistic similarity to the corpus — the embedding similarity step naturally selects for retrieved documents that match the topic of the hypothesis regardless of whether the hypothesis's specific claims are accurate. This means that even using a smaller, less accurate model for hypothesis generation retains most of HyDE's retrieval benefit.