A retrieval technique that generates multiple paraphrased versions of the user query using an LLM, retrieves documents for each, and merges results to improve recall.
You ask "How do I cancel my subscription?" and the most relevant document says "To terminate your account, visit Account Settings." Standard dense retrieval embeds your query as a single vector and looks for nearby documents — but "cancel subscription" and "terminate account" may not be close enough in embedding space to score highly.
This is the vocabulary mismatch problem: your phrasing differs from the document's phrasing, and a single query embedding can only cover one "view" of the semantic space.
Multi-query retrieval solves this by generating 3–5 paraphrased versions of the original query, retrieving documents for each, and merging the results. Each paraphrase covers a different corner of the semantic space, dramatically improving recall.
The key insight: documents that consistently appear across multiple differently-worded retrievals are very likely to be genuinely relevant — multi-query acts as a confidence boost.
import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = anthropic.Anthropic()
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
def generate_query_variants(query: str, n: int = 4) -> list[str]:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
messages=[{"role": "user", "content": f'''Generate {n} different phrasings of the following question.
Each phrasing should approach the topic from a different angle.
Return ONLY the {n} questions, one per line, no numbering or explanation.
Original question: {query}'''}]
)
variants = [q.strip() for q in response.content[0].text.strip().split('\n') if q.strip()]
return [query] + variants[:n] # include original + n variants
def multi_query_retrieve(query: str, corpus: list[dict], k: int = 5) -> list[dict]:
# Generate query variants
variants = generate_query_variants(query, n=4)
print(f"Query variants: {variants}")
# Embed corpus once (offline in practice)
texts = [d["text"] for d in corpus]
prefix = "Represent this sentence for searching relevant passages: "
corpus_embs = model.encode(texts, normalize_embeddings=True)
# Retrieve for each variant, track doc appearances
doc_scores: dict[str, list[float]] = {}
for variant in variants:
q_emb = model.encode(prefix + variant, normalize_embeddings=True)
scores = corpus_embs @ q_emb
top_k_idx = np.argsort(scores)[::-1][:k]
for rank, idx in enumerate(top_k_idx):
doc_id = corpus[idx]["id"]
doc_scores.setdefault(doc_id, []).append(1 / (60 + rank + 1)) # RRF
# Merge by RRF score
merged = sorted(doc_scores.items(), key=lambda x: sum(x[1]), reverse=True)
id_to_doc = {d["id"]: d for d in corpus}
return [id_to_doc[doc_id] for doc_id, _ in merged[:k]]
# Test
corpus = [
{"id": "1", "text": "To cancel your subscription, go to Account Settings > Billing > Cancel Plan."},
{"id": "2", "text": "Terminate your account by visiting the subscription management portal."},
{"id": "3", "text": "Free shipping applies to orders over fifty dollars."},
{"id": "4", "text": "Your subscription renews automatically on the billing date unless cancelled."},
]
results = multi_query_retrieve("How do I stop my subscription?", corpus)
for r in results[:3]:
print(r["text"][:60])
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
import logging
# Set up logging to see the generated queries
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
# Build the base vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
docs = [
Document(page_content="To cancel your subscription, go to Account Settings."),
Document(page_content="Terminate your account via the subscription portal."),
Document(page_content="Your plan auto-renews unless you cancel beforehand."),
Document(page_content="Free shipping on all orders above $50."),
]
vectorstore = Chroma.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Wrap with MultiQueryRetriever — uses LLM to generate variants
llm = ChatAnthropic(model="claude-3-5-haiku-20241022")
multi_retriever = MultiQueryRetriever.from_llm(
retriever=base_retriever,
llm=llm,
include_original=True # always include original query
)
# Retrieve — generates ~3 variants automatically, merges results
results = multi_retriever.invoke("How do I stop paying for the service?")
for doc in results:
print(doc.page_content)
# Logs show the generated queries; results include docs from all variants
The quality of multi-query retrieval depends on generating diverse variants. Default LLM paraphrasing works, but targeted strategies work better:
DIVERSITY_PROMPT = '''Generate 4 alternative search queries for the original question below.
Use these strategies:
1. Rephrase with synonyms (e.g. "cancel" -> "terminate", "subscription" -> "plan")
2. Ask from a different perspective (e.g. "how to..." -> "what are the steps to...")
3. Use more technical or formal language
4. Use simpler, everyday language
Original: {query}
Return only the 4 alternative queries, one per line.'''
You can also generate queries that target specific document types:
TARGETED_PROMPT = '''Given the question, generate:
1. A version for finding step-by-step instructions
2. A version for finding policy documents
3. A version for finding FAQ answers
4. A version with technical terminology
Original: {query}
'''
| Approach | Recall | LLM calls | Latency |
|---|---|---|---|
| Single query (baseline) | ~70% | 0 | Fastest |
| 2 queries (original + 1) | ~82% | 1 | +200ms |
| 5 queries (original + 4) | ~91% | 1 | +300ms |
| 10 queries | ~95% | 1 | +500ms |
Note that all query variants can be generated in a single LLM call. The extra latency comes from running the retriever N times, not N LLM calls. Use asyncio.gather to parallelise retrieval across variants:
import asyncio
async def retrieve_variant(variant: str) -> list:
# Your async retrieval function
return await async_vector_search(variant, k=10)
async def multi_query_async(variants: list[str]) -> list:
results = await asyncio.gather(*[retrieve_variant(v) for v in variants])
# Merge results with RRF
return merge_with_rrf(results)
Generated queries can be off-target. LLMs occasionally produce variants that drift from the original intent. Use a lower temperature (0.0–0.3) for query generation and include the original query in the result set unconditionally.
Deduplication is essential. Multiple variants often retrieve the same documents. Without deduplication, you'll pass the same chunk to the LLM multiple times, wasting context tokens. Use document IDs or content hashing to deduplicate.
Combining with HyDE is powerful. Multi-query generates diverse question phrasings; HyDE generates document-style hypothetical answers. Combining both (generate 3 HyDE documents + 3 query variants = 6 retrieval passes) can push recall above 95% on most benchmarks — at the cost of 7 LLM calls per query.
Don't use multi-query for simple keyword lookups. If the user types an exact product code or error message, generating paraphrases introduces noise. Apply multi-query selectively for natural-language questions, not for exact-match lookups.
| Strategy | Query Variants | Recall Gain | Cost Multiplier |
|---|---|---|---|
| Perspective expansion | 3–5 rephrasings | +20–35% | 3–5× |
| Hypothetical document | 1 HyDE variant | +15–25% | 2× |
| Sub-question decomposition | 2–4 sub-questions | +30–50% on multi-hop | 3–5× |
| Step-back abstraction | 1 generalised query | +10–20% | 2× |
In production, gate multi-query retrieval behind a complexity classifier. Simple factual queries ("What is the capital of France?") gain nothing from query expansion and waste embedding + retrieval budget. Classify queries by complexity using a fast Haiku call or a lightweight fine-tuned classifier, then apply multi-query only to complex, multi-hop, or ambiguous questions. This reduces multi-query overhead by 40–60% in typical production traffic while preserving recall gains where they matter.
Deduplication of retrieved chunks is critical: running 5 query variants typically returns 3–4× more chunks than a single query, with significant overlap. Use a cosine similarity threshold (0.92+) or exact chunk-ID deduplication before passing context to the synthesis LLM to avoid wasting context window on near-duplicate passages.
For evaluation of multi-query retrieval systems, measure both recall improvement and answer faithfulness. Higher recall sometimes introduces noisy passages that confuse the synthesis model and reduce answer quality. Run end-to-end evaluation: compare final answer accuracy with and without query expansion on your domain-specific test set. The goal is not maximum recall but maximum answer quality — and sometimes a targeted single query outperforms an expanded set if your retrieval corpus is noisy or your topic requires precision over recall.
Implement a result diversity check after merging multi-query results. If top-5 retrieved chunks are all from the same source document, your retrieval set is over-concentrated. Use maximum marginal relevance (MMR) re-ranking to enforce source diversity while maintaining relevance. LlamaIndex and LangChain both implement MMR natively with a diversity_bias parameter between 0.3 and 0.5.