Combining sparse (BM25) and dense (vector) retrieval to get the precision of keyword matching with the recall of semantic search.
BM25 finds "Python asyncio bug CVE-2024-1234" precisely but misses "async Python issue related to event loops". Dense retrieval finds "event loop problems in Python async code" but misses the exact CVE number. Each system has a blind spot that the other covers.
Hybrid search runs both retrievers independently, then merges their ranked result lists. Empirically, hybrid search improves recall by 15–30% over either method alone, with no quality degradation — making it the default recommendation for production RAG systems.
Reciprocal Rank Fusion (RRF) — rank-based, no score normalisation needed:
RRF_score(doc) = Σ 1 / (k + rank_i(doc))
where k=60 is a smoothing constant that prevents top-ranked documents from dominating. Simple, robust, works even when BM25 and dense scores are on different scales.
Normalised linear combination — score-based, requires compatible scales:
hybrid_score = α × normalised_dense + (1-α) × normalised_sparse
α=0.5 balances equally. Tune α based on your query distribution (higher α for semantic queries, lower for keyword queries).
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
from nltk.tokenize import word_tokenize
def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
'''
rankings: list of ranked document ID lists (one per retriever)
Returns: merged list of document IDs by RRF score (best first)
'''
scores: dict[str, float] = {}
for ranked_list in rankings:
for rank, doc_id in enumerate(ranked_list, 1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
def hybrid_retrieve(query: str, docs: list[dict], k: int = 5) -> list[dict]:
texts = [d["text"] for d in docs]
# BM25 ranking
tokenised = [word_tokenize(t.lower()) for t in texts]
bm25 = BM25Okapi(tokenised)
bm25_scores = bm25.get_scores(word_tokenize(query.lower()))
bm25_ranking = [docs[i]["id"] for i in np.argsort(bm25_scores)[::-1]]
# Dense ranking
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
q_emb = model.encode(
"Represent this sentence for searching relevant passages: " + query,
normalize_embeddings=True
)
doc_embs = model.encode(texts, normalize_embeddings=True)
dense_scores = doc_embs @ q_emb
dense_ranking = [docs[i]["id"] for i in np.argsort(dense_scores)[::-1]]
# RRF fusion
fused_ids = reciprocal_rank_fusion([bm25_ranking, dense_ranking], k=60)[:k]
id_to_doc = {d["id"]: d for d in docs}
return [id_to_doc[doc_id] for doc_id in fused_ids if doc_id in id_to_doc]
import numpy as np
def min_max_normalise(scores: np.ndarray) -> np.ndarray:
min_s, max_s = scores.min(), scores.max()
if max_s == min_s:
return np.zeros_like(scores)
return (scores - min_s) / (max_s - min_s)
def weighted_hybrid_search(query, docs, alpha=0.5):
'''alpha: weight for dense (0=BM25 only, 1=dense only)'''
# ... compute bm25_scores and dense_scores as above ...
bm25_norm = min_max_normalise(bm25_scores)
dense_norm = min_max_normalise(dense_scores)
hybrid_scores = alpha * dense_norm + (1 - alpha) * bm25_norm
top_k = np.argsort(hybrid_scores)[::-1][:5]
return [docs[i] for i in top_k]
Min-max normalisation maps both score distributions to [0,1] before combining. The risk: a single outlier can compress all other scores. Robust alternatives use percentile normalisation or z-score.
from qdrant_client import QdrantClient
from qdrant_client.models import (
SparseVector, NamedVector, NamedSparseVector,
Prefetch, FusionQuery, Fusion
)
qdrant = QdrantClient("localhost", port=6333)
# Assumes collection has both "dense" and "sparse" vector spaces
results = qdrant.query_points(
collection_name="hybrid_docs",
prefetch=[
Prefetch(query=dense_query_vector, using="dense", limit=20),
Prefetch(query=NamedSparseVector(name="sparse", vector=SparseVector(
indices=sparse_query_indices,
values=sparse_query_values
)), limit=20),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=5,
with_payload=True
)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
docs = [
Document(page_content="30-day return policy with receipt."),
Document(page_content="Free shipping over $50."),
Document(page_content="Support hours: Mon-Fri 9-5."),
]
# Build BM25 retriever
bm25_retriever = BM25Retriever.from_documents(docs, k=3)
# Build dense retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(docs, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Combine with EnsembleRetriever (uses RRF internally)
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.5, 0.5] # equal weight
)
results = hybrid_retriever.invoke("return policy refund")
for r in results:
print(r.page_content)
RRF k parameter is sensitive. The standard k=60 works well in most cases, but if you have very long retrieval lists (top-1000), increasing k to 120 gives smoother rank blending. Don't overthink it — k=60 is a sensible default.
BM25 needs the same tokenisation at index and query time. If you index with stemming but query without it, recall drops. Keep tokenisation logic in one place and use it for both.
Deduplication after fusion. Both retrievers may return the same document. Deduplicate before returning to the LLM — identical passages waste context window space.
Tune alpha on held-out data. The optimal alpha varies by domain. Evaluate on a labelled set: keyword-heavy domains (legal, medical) often want higher BM25 weight; general semantic queries benefit from more dense weight.
Choosing between retrieval strategies depends on query characteristics, data distribution, and latency requirements. Sparse methods excel at exact keyword matching and work well for queries containing rare or domain-specific terms. Dense methods capture semantic similarity and handle synonym resolution, paraphrase, and concept-level queries. Hybrid search combines both, generally dominating on heterogeneous query distributions at the cost of running two retrieval paths and a fusion step.
| Strategy | Strengths | Weaknesses | Best for |
|---|---|---|---|
| BM25 (sparse) | Fast, no GPU required, exact term match | No semantic understanding | Keyword-heavy, precise queries |
| Dense (bi-encoder) | Semantic similarity, handles paraphrase | GPU required, misses rare terms | Conversational, conceptual queries |
| Hybrid (RRF) | Best of both, robust across query types | Higher latency, two indexes | Mixed production workloads |
| Reranking + hybrid | Highest relevance quality | Latency overhead of cross-encoder | High-stakes retrieval |
Benchmarks on BEIR consistently show hybrid search outperforming either BM25 or dense retrieval alone across the majority of datasets. The RRF fusion approach requires no score calibration and is robust to the score distribution differences between sparse and dense systems, making it the default choice for new hybrid deployments.
Weight tuning for hybrid search fusion is an important calibration step that is often skipped in initial deployments. The default equal weighting of BM25 and dense scores after RRF normalization is a reasonable starting point but rarely optimal for a specific domain. Offline evaluation using a labeled query-document relevance dataset — computing nDCG@10 across a sweep of alpha values from 0.0 (pure dense) to 1.0 (pure sparse) — identifies the weight that maximizes retrieval quality for the target distribution. Many teams find that domain-specific corpora with precise terminology benefit from alpha values in the 0.3–0.4 range (more BM25 weight), while conversational or semantically diverse corpora benefit from lower alpha values.
Index synchronization between the sparse and dense indexes is a common operational challenge in production hybrid search systems. When documents are added or deleted, both the inverted index (for BM25) and the vector index (for dense retrieval) must be updated consistently. Partial updates — where a document appears in one index but not the other — produce retrieval results where documents can receive scores from only one component, distorting the fusion output. Atomic update pipelines that write to both indexes within the same transaction boundary, or eventual-consistency monitoring that detects and repairs index divergence, are necessary for maintaining reliable hybrid search in systems with continuous document ingestion.
Hybrid search query preprocessing should handle the different requirements of sparse and dense components separately. BM25 benefits from stemming, stop-word removal, and query expansion for rare terms. Dense retrieval benefits from query reformulation that expands abbreviations and adds semantic context. Running separate preprocessing pipelines for each component — rather than feeding the same cleaned query to both — typically improves retrieval quality compared to a unified preprocessing approach that must balance the conflicting requirements of sparse and dense matching.
Hybrid search latency profiling commonly reveals that the BM25 component is faster than the dense retrieval component by a factor of 5–10x, making parallel execution the standard deployment pattern. Issuing both retrieval requests concurrently and fusing the results when both complete reduces end-to-end retrieval latency to approximately the latency of the slower (dense) component rather than their sum. Asynchronous implementations using asyncio or thread pools for the parallel retrievals add minimal coordination overhead and should be the default approach for any production hybrid search service where retrieval latency is a user-facing concern.
Sparse index maintenance for hybrid search systems requires keeping BM25 term statistics (document frequency, inverse document frequency) up to date as the corpus grows. Most production BM25 implementations recompute IDF statistics periodically — daily or weekly — rather than updating them on every document insertion, accepting slight staleness in favor of operational simplicity. The impact of stale IDF statistics on retrieval quality is typically minor unless the corpus composition changes dramatically, such as adding a large new document category that introduces many new high-frequency terms.