Retrieval

Hybrid Search

Combining sparse (BM25) and dense (vector) retrieval to get the precision of keyword matching with the recall of semantic search.

Best of both
Sparse + Dense
RRF fusion
Standard method
+15-30%
Recall improvement

Table of Contents

SECTION 01

Why neither BM25 nor dense alone is enough

BM25 finds "Python asyncio bug CVE-2024-1234" precisely but misses "async Python issue related to event loops". Dense retrieval finds "event loop problems in Python async code" but misses the exact CVE number. Each system has a blind spot that the other covers.

Hybrid search runs both retrievers independently, then merges their ranked result lists. Empirically, hybrid search improves recall by 15–30% over either method alone, with no quality degradation — making it the default recommendation for production RAG systems.

SECTION 02

Fusion strategies

Reciprocal Rank Fusion (RRF) — rank-based, no score normalisation needed:

RRF_score(doc) = Σ 1 / (k + rank_i(doc))

where k=60 is a smoothing constant that prevents top-ranked documents from dominating. Simple, robust, works even when BM25 and dense scores are on different scales.

Normalised linear combination — score-based, requires compatible scales:

hybrid_score = α × normalised_dense + (1-α) × normalised_sparse

α=0.5 balances equally. Tune α based on your query distribution (higher α for semantic queries, lower for keyword queries).

SECTION 03

Reciprocal Rank Fusion implementation

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
from nltk.tokenize import word_tokenize

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> list[str]:
    '''
    rankings: list of ranked document ID lists (one per retriever)
    Returns: merged list of document IDs by RRF score (best first)
    '''
    scores: dict[str, float] = {}
    for ranked_list in rankings:
        for rank, doc_id in enumerate(ranked_list, 1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

def hybrid_retrieve(query: str, docs: list[dict], k: int = 5) -> list[dict]:
    texts = [d["text"] for d in docs]

    # BM25 ranking
    tokenised = [word_tokenize(t.lower()) for t in texts]
    bm25 = BM25Okapi(tokenised)
    bm25_scores = bm25.get_scores(word_tokenize(query.lower()))
    bm25_ranking = [docs[i]["id"] for i in np.argsort(bm25_scores)[::-1]]

    # Dense ranking
    model = SentenceTransformer("BAAI/bge-large-en-v1.5")
    q_emb = model.encode(
        "Represent this sentence for searching relevant passages: " + query,
        normalize_embeddings=True
    )
    doc_embs = model.encode(texts, normalize_embeddings=True)
    dense_scores = doc_embs @ q_emb
    dense_ranking = [docs[i]["id"] for i in np.argsort(dense_scores)[::-1]]

    # RRF fusion
    fused_ids = reciprocal_rank_fusion([bm25_ranking, dense_ranking], k=60)[:k]
    id_to_doc = {d["id"]: d for d in docs}
    return [id_to_doc[doc_id] for doc_id in fused_ids if doc_id in id_to_doc]
SECTION 04

Score normalisation fusion

import numpy as np

def min_max_normalise(scores: np.ndarray) -> np.ndarray:
    min_s, max_s = scores.min(), scores.max()
    if max_s == min_s:
        return np.zeros_like(scores)
    return (scores - min_s) / (max_s - min_s)

def weighted_hybrid_search(query, docs, alpha=0.5):
    '''alpha: weight for dense (0=BM25 only, 1=dense only)'''
    # ... compute bm25_scores and dense_scores as above ...
    bm25_norm   = min_max_normalise(bm25_scores)
    dense_norm  = min_max_normalise(dense_scores)
    hybrid_scores = alpha * dense_norm + (1 - alpha) * bm25_norm
    top_k = np.argsort(hybrid_scores)[::-1][:5]
    return [docs[i] for i in top_k]

Min-max normalisation maps both score distributions to [0,1] before combining. The risk: a single outlier can compress all other scores. Robust alternatives use percentile normalisation or z-score.

SECTION 05

Hybrid search with Qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import (
    SparseVector, NamedVector, NamedSparseVector,
    Prefetch, FusionQuery, Fusion
)

qdrant = QdrantClient("localhost", port=6333)

# Assumes collection has both "dense" and "sparse" vector spaces
results = qdrant.query_points(
    collection_name="hybrid_docs",
    prefetch=[
        Prefetch(query=dense_query_vector, using="dense", limit=20),
        Prefetch(query=NamedSparseVector(name="sparse", vector=SparseVector(
            indices=sparse_query_indices,
            values=sparse_query_values
        )), limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),
    limit=5,
    with_payload=True
)
SECTION 06

Hybrid search with LangChain

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

docs = [
    Document(page_content="30-day return policy with receipt."),
    Document(page_content="Free shipping over $50."),
    Document(page_content="Support hours: Mon-Fri 9-5."),
]

# Build BM25 retriever
bm25_retriever = BM25Retriever.from_documents(docs, k=3)

# Build dense retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(docs, embeddings)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Combine with EnsembleRetriever (uses RRF internally)
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.5, 0.5]   # equal weight
)

results = hybrid_retriever.invoke("return policy refund")
for r in results:
    print(r.page_content)
SECTION 07

Gotchas

RRF k parameter is sensitive. The standard k=60 works well in most cases, but if you have very long retrieval lists (top-1000), increasing k to 120 gives smoother rank blending. Don't overthink it — k=60 is a sensible default.

BM25 needs the same tokenisation at index and query time. If you index with stemming but query without it, recall drops. Keep tokenisation logic in one place and use it for both.

Deduplication after fusion. Both retrievers may return the same document. Deduplicate before returning to the LLM — identical passages waste context window space.

Tune alpha on held-out data. The optimal alpha varies by domain. Evaluate on a labelled set: keyword-heavy domains (legal, medical) often want higher BM25 weight; general semantic queries benefit from more dense weight.

Retrieval strategy comparison

Choosing between retrieval strategies depends on query characteristics, data distribution, and latency requirements. Sparse methods excel at exact keyword matching and work well for queries containing rare or domain-specific terms. Dense methods capture semantic similarity and handle synonym resolution, paraphrase, and concept-level queries. Hybrid search combines both, generally dominating on heterogeneous query distributions at the cost of running two retrieval paths and a fusion step.

StrategyStrengthsWeaknessesBest for
BM25 (sparse)Fast, no GPU required, exact term matchNo semantic understandingKeyword-heavy, precise queries
Dense (bi-encoder)Semantic similarity, handles paraphraseGPU required, misses rare termsConversational, conceptual queries
Hybrid (RRF)Best of both, robust across query typesHigher latency, two indexesMixed production workloads
Reranking + hybridHighest relevance qualityLatency overhead of cross-encoderHigh-stakes retrieval

Benchmarks on BEIR consistently show hybrid search outperforming either BM25 or dense retrieval alone across the majority of datasets. The RRF fusion approach requires no score calibration and is robust to the score distribution differences between sparse and dense systems, making it the default choice for new hybrid deployments.

Weight tuning for hybrid search fusion is an important calibration step that is often skipped in initial deployments. The default equal weighting of BM25 and dense scores after RRF normalization is a reasonable starting point but rarely optimal for a specific domain. Offline evaluation using a labeled query-document relevance dataset — computing nDCG@10 across a sweep of alpha values from 0.0 (pure dense) to 1.0 (pure sparse) — identifies the weight that maximizes retrieval quality for the target distribution. Many teams find that domain-specific corpora with precise terminology benefit from alpha values in the 0.3–0.4 range (more BM25 weight), while conversational or semantically diverse corpora benefit from lower alpha values.

Index synchronization between the sparse and dense indexes is a common operational challenge in production hybrid search systems. When documents are added or deleted, both the inverted index (for BM25) and the vector index (for dense retrieval) must be updated consistently. Partial updates — where a document appears in one index but not the other — produce retrieval results where documents can receive scores from only one component, distorting the fusion output. Atomic update pipelines that write to both indexes within the same transaction boundary, or eventual-consistency monitoring that detects and repairs index divergence, are necessary for maintaining reliable hybrid search in systems with continuous document ingestion.

Hybrid search query preprocessing should handle the different requirements of sparse and dense components separately. BM25 benefits from stemming, stop-word removal, and query expansion for rare terms. Dense retrieval benefits from query reformulation that expands abbreviations and adds semantic context. Running separate preprocessing pipelines for each component — rather than feeding the same cleaned query to both — typically improves retrieval quality compared to a unified preprocessing approach that must balance the conflicting requirements of sparse and dense matching.

Hybrid search latency profiling commonly reveals that the BM25 component is faster than the dense retrieval component by a factor of 5–10x, making parallel execution the standard deployment pattern. Issuing both retrieval requests concurrently and fusing the results when both complete reduces end-to-end retrieval latency to approximately the latency of the slower (dense) component rather than their sum. Asynchronous implementations using asyncio or thread pools for the parallel retrievals add minimal coordination overhead and should be the default approach for any production hybrid search service where retrieval latency is a user-facing concern.

Sparse index maintenance for hybrid search systems requires keeping BM25 term statistics (document frequency, inverse document frequency) up to date as the corpus grows. Most production BM25 implementations recompute IDF statistics periodically — daily or weekly — rather than updating them on every document insertion, accepting slight staleness in favor of operational simplicity. The impact of stale IDF statistics on retrieval quality is typically minor unless the corpus composition changes dramatically, such as adding a large new document category that introduces many new high-frequency terms.