Multi-Query Retrieval

The single-query blind spot
How multi-query retrieval works
Implementation from scratch
LangChain MultiQueryRetriever
Query diversity strategies
Cost vs recall tradeoff
Gotchas

SECTION 01

The single-query blind spot

You ask "How do I cancel my subscription?" and the most relevant document says "To terminate your account, visit Account Settings." Standard dense retrieval embeds your query as a single vector and looks for nearby documents — but "cancel subscription" and "terminate account" may not be close enough in embedding space to score highly.

This is the vocabulary mismatch problem: your phrasing differs from the document's phrasing, and a single query embedding can only cover one "view" of the semantic space.

Multi-query retrieval solves this by generating 3–5 paraphrased versions of the original query, retrieving documents for each, and merging the results. Each paraphrase covers a different corner of the semantic space, dramatically improving recall.

SECTION 02

How multi-query retrieval works

Expand: send the original query to an LLM with a prompt asking for N diverse paraphrases ("Generate 4 different phrasings of this question").
Retrieve: run a standard retrieval (dense, BM25, or hybrid) for each paraphrase independently. Get top-k results for each.
Merge and deduplicate: combine all result sets. If the same document appears in multiple result lists, keep it once (Reciprocal Rank Fusion handles ranking).
Rerank optionally: apply a cross-encoder reranker on the merged set to get the final top-N before passing to the LLM.

The key insight: documents that consistently appear across multiple differently-worded retrievals are very likely to be genuinely relevant — multi-query acts as a confidence boost.

SECTION 03

Implementation from scratch

import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

client = anthropic.Anthropic()
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def generate_query_variants(query: str, n: int = 4) -> list[str]:
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=300,
        messages=[{"role": "user", "content": f'''Generate {n} different phrasings of the following question.
Each phrasing should approach the topic from a different angle.
Return ONLY the {n} questions, one per line, no numbering or explanation.

Original question: {query}'''}]
    )
    variants = [q.strip() for q in response.content[0].text.strip().split('\n') if q.strip()]
    return [query] + variants[:n]   # include original + n variants

def multi_query_retrieve(query: str, corpus: list[dict], k: int = 5) -> list[dict]:
    # Generate query variants
    variants = generate_query_variants(query, n=4)
    print(f"Query variants: {variants}")

    # Embed corpus once (offline in practice)
    texts = [d["text"] for d in corpus]
    prefix = "Represent this sentence for searching relevant passages: "
    corpus_embs = model.encode(texts, normalize_embeddings=True)

    # Retrieve for each variant, track doc appearances
    doc_scores: dict[str, list[float]] = {}
    for variant in variants:
        q_emb = model.encode(prefix + variant, normalize_embeddings=True)
        scores = corpus_embs @ q_emb
        top_k_idx = np.argsort(scores)[::-1][:k]
        for rank, idx in enumerate(top_k_idx):
            doc_id = corpus[idx]["id"]
            doc_scores.setdefault(doc_id, []).append(1 / (60 + rank + 1))  # RRF

    # Merge by RRF score
    merged = sorted(doc_scores.items(), key=lambda x: sum(x[1]), reverse=True)
    id_to_doc = {d["id"]: d for d in corpus}
    return [id_to_doc[doc_id] for doc_id, _ in merged[:k]]

# Test
corpus = [
    {"id": "1", "text": "To cancel your subscription, go to Account Settings > Billing > Cancel Plan."},
    {"id": "2", "text": "Terminate your account by visiting the subscription management portal."},
    {"id": "3", "text": "Free shipping applies to orders over fifty dollars."},
    {"id": "4", "text": "Your subscription renews automatically on the billing date unless cancelled."},
]
results = multi_query_retrieve("How do I stop my subscription?", corpus)
for r in results[:3]:
    print(r["text"][:60])

SECTION 04

LangChain MultiQueryRetriever

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
import logging

# Set up logging to see the generated queries
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

# Build the base vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
docs = [
    Document(page_content="To cancel your subscription, go to Account Settings."),
    Document(page_content="Terminate your account via the subscription portal."),
    Document(page_content="Your plan auto-renews unless you cancel beforehand."),
    Document(page_content="Free shipping on all orders above $50."),
]
vectorstore = Chroma.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Wrap with MultiQueryRetriever — uses LLM to generate variants
llm = ChatAnthropic(model="claude-3-5-haiku-20241022")
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=base_retriever,
    llm=llm,
    include_original=True   # always include original query
)

# Retrieve — generates ~3 variants automatically, merges results
results = multi_retriever.invoke("How do I stop paying for the service?")
for doc in results:
    print(doc.page_content)
# Logs show the generated queries; results include docs from all variants

SECTION 05

Query diversity strategies

The quality of multi-query retrieval depends on generating diverse variants. Default LLM paraphrasing works, but targeted strategies work better:

DIVERSITY_PROMPT = '''Generate 4 alternative search queries for the original question below.
Use these strategies:
1. Rephrase with synonyms (e.g. "cancel" -> "terminate", "subscription" -> "plan")
2. Ask from a different perspective (e.g. "how to..." -> "what are the steps to...")
3. Use more technical or formal language
4. Use simpler, everyday language

Original: {query}

Return only the 4 alternative queries, one per line.'''

You can also generate queries that target specific document types:

TARGETED_PROMPT = '''Given the question, generate:
1. A version for finding step-by-step instructions
2. A version for finding policy documents
3. A version for finding FAQ answers
4. A version with technical terminology

Original: {query}
'''

SECTION 06

Cost vs recall tradeoff

Approach	Recall	LLM calls	Latency
Single query (baseline)	~70%	0	Fastest
2 queries (original + 1)	~82%	1	+200ms
5 queries (original + 4)	~91%	1	+300ms
10 queries	~95%	1	+500ms

Note that all query variants can be generated in a single LLM call. The extra latency comes from running the retriever N times, not N LLM calls. Use asyncio.gather to parallelise retrieval across variants:

import asyncio

async def retrieve_variant(variant: str) -> list:
    # Your async retrieval function
    return await async_vector_search(variant, k=10)

async def multi_query_async(variants: list[str]) -> list:
    results = await asyncio.gather(*[retrieve_variant(v) for v in variants])
    # Merge results with RRF
    return merge_with_rrf(results)

SECTION 07

Gotchas

Generated queries can be off-target. LLMs occasionally produce variants that drift from the original intent. Use a lower temperature (0.0–0.3) for query generation and include the original query in the result set unconditionally.

Deduplication is essential. Multiple variants often retrieve the same documents. Without deduplication, you'll pass the same chunk to the LLM multiple times, wasting context tokens. Use document IDs or content hashing to deduplicate.

Combining with HyDE is powerful. Multi-query generates diverse question phrasings; HyDE generates document-style hypothetical answers. Combining both (generate 3 HyDE documents + 3 query variants = 6 retrieval passes) can push recall above 95% on most benchmarks — at the cost of 7 LLM calls per query.

Don't use multi-query for simple keyword lookups. If the user types an exact product code or error message, generating paraphrases introduces noise. Apply multi-query selectively for natural-language questions, not for exact-match lookups.

SECTION 08

Multi-Query Retrieval in Production

Strategy	Query Variants	Recall Gain	Cost Multiplier
Perspective expansion	3–5 rephrasings	+20–35%	3–5×
Hypothetical document	1 HyDE variant	+15–25%	2×
Sub-question decomposition	2–4 sub-questions	+30–50% on multi-hop	3–5×
Step-back abstraction	1 generalised query	+10–20%	2×

In production, gate multi-query retrieval behind a complexity classifier. Simple factual queries ("What is the capital of France?") gain nothing from query expansion and waste embedding + retrieval budget. Classify queries by complexity using a fast Haiku call or a lightweight fine-tuned classifier, then apply multi-query only to complex, multi-hop, or ambiguous questions. This reduces multi-query overhead by 40–60% in typical production traffic while preserving recall gains where they matter.

Deduplication of retrieved chunks is critical: running 5 query variants typically returns 3–4× more chunks than a single query, with significant overlap. Use a cosine similarity threshold (0.92+) or exact chunk-ID deduplication before passing context to the synthesis LLM to avoid wasting context window on near-duplicate passages.

For evaluation of multi-query retrieval systems, measure both recall improvement and answer faithfulness. Higher recall sometimes introduces noisy passages that confuse the synthesis model and reduce answer quality. Run end-to-end evaluation: compare final answer accuracy with and without query expansion on your domain-specific test set. The goal is not maximum recall but maximum answer quality — and sometimes a targeted single query outperforms an expanded set if your retrieval corpus is noisy or your topic requires precision over recall.

Implement a result diversity check after merging multi-query results. If top-5 retrieved chunks are all from the same source document, your retrieval set is over-concentrated. Use maximum marginal relevance (MMR) re-ranking to enforce source diversity while maintaining relevance. LlamaIndex and LangChain both implement MMR natively with a diversity_bias parameter between 0.3 and 0.5.