Retrieval

BM25

Best Match 25 โ€” a probabilistic ranking function based on term frequency and inverse document frequency, the gold-standard for keyword-based document retrieval.

No training
Required
Keyword-exact
Matching
Baseline for
All retrievers

Table of Contents

SECTION 01

Why BM25 still matters

Dense retrieval gets all the excitement, but BM25 is the cockroach of information retrieval โ€” it's been around since 1994 and still outperforms dense methods on keyword-heavy queries. If someone searches "Python 3.11 asyncio bug CVE-2023-1234", BM25 finds the exact document in milliseconds. Dense retrieval might find "a blog post about async programming" instead.

For production RAG systems, BM25 is your first baseline and your hybrid-search companion. Build it first; add dense retrieval on top.

SECTION 02

The BM25 formula

BM25 scores a document D for query Q by summing over query terms:

score(D,Q) = ฮฃ IDF(qแตข) ร— f(qแตข,D) ร— (kโ‚+1)
                              โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                              f(qแตข,D) + kโ‚ร—(1-b + bร—|D|/avgdl)

No training, no embeddings, no GPU โ€” just counting and division.

SECTION 03

BM25 in Python with rank_bm25

pip install rank-bm25 nltk
from rank_bm25 import BM25Okapi
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import word_tokenize

# Documents
corpus = [
    "Refunds are accepted within 30 days of purchase with original receipt.",
    "Free shipping is available on all orders above fifty dollars.",
    "Customer support is available Monday through Friday, 9 AM to 5 PM.",
    "We accept Visa, Mastercard, American Express, and PayPal.",
]

# Tokenise (lowercase + basic tokenisation)
tokenised = [word_tokenize(doc.lower()) for doc in corpus]

# Build BM25 index
bm25 = BM25Okapi(tokenised)

# Query
query = "return refund receipt"
tokenised_query = word_tokenize(query.lower())
scores = bm25.get_scores(tokenised_query)

# Get top-k results
import numpy as np
top_k_idx = np.argsort(scores)[::-1][:3]
for i in top_k_idx:
    print(f"Score {scores[i]:.3f}: {corpus[i]}")
SECTION 04

BM25 in Elasticsearch/OpenSearch

Elasticsearch uses BM25 as its default relevance scorer. A simple search query leverages it automatically:

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

# Index a document
es.index(index="docs", id="1", document={
    "content": "Refunds accepted within 30 days with original receipt.",
    "category": "policy"
})

# BM25 search (default relevance scorer)
results = es.search(
    index="docs",
    body={
        "query": {
            "match": {
                "content": "return refund receipt"   # BM25 scoring
            }
        }
    }
)
for hit in results["hits"]["hits"]:
    print(f"Score {hit['_score']:.3f}: {hit['_source']['content']}")

Elasticsearch's BM25 is production-grade: it handles multi-field search, language analysers (stemming, stopwords), and horizontal scaling across shards.

SECTION 05

When BM25 beats dense retrieval

BM25 consistently outperforms dense retrieval when:

SECTION 06

BM25 as a RAG fallback

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
import nltk
from nltk.tokenize import word_tokenize

class HybridRetriever:
    def __init__(self, docs):
        self.docs = docs
        tokenised = [word_tokenize(d.lower()) for d in docs]
        self.bm25 = BM25Okapi(tokenised)
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.doc_embs = self.model.encode(docs, convert_to_tensor=True)

    def retrieve(self, query: str, k: int = 5) -> list[str]:
        # BM25
        bm25_scores = self.bm25.get_scores(word_tokenize(query.lower()))
        # Dense
        q_emb = self.model.encode(query, convert_to_tensor=True)
        dense_scores = util.cos_sim(q_emb, self.doc_embs)[0].cpu().numpy()
        # Normalise and combine (RRF)
        import numpy as np
        combined = (dense_scores / (dense_scores.max() + 1e-9) +
                    bm25_scores / (bm25_scores.max() + 1e-9))
        top_k = np.argsort(combined)[::-1][:k]
        return [self.docs[i] for i in top_k]
SECTION 07

Gotchas

BM25 is case and form sensitive. Lowercase and stem your corpus and query consistently. "Refund" and "refunds" are different tokens unless you apply stemming (use NLTK's PorterStemmer or a spaCy pipeline).

rank_bm25 builds in RAM. The in-memory BM25Okapi index is rebuilt every process restart. For persistence, use Elasticsearch/OpenSearch, or pickle the BM25 object (though the pickle may be large for big corpora).

Short documents inflate TF scores. The b parameter penalises shorter documents less โ€” by default, a 10-word document with one match scores comparably to a 1000-word document with five matches. Tune b if your corpus has very variable document lengths.

BM25 parameter tuning and variants

BM25's two parameters k1 and b control term saturation and document length normalization respectively. The default values (k1=1.5, b=0.75) work well for general web-document retrieval but may be suboptimal for specific domains. Short-query, long-document corpora (like searching a technical documentation site) often benefit from lower b values (0.3โ€“0.5) that reduce the penalty for longer documents. High-frequency-term corpora benefit from lower k1 values (1.0โ€“1.2) that more aggressively cap the contribution of repeated terms.

ParameterDefaultEffect of increasingWhen to adjust
k11.5Less saturation of term frequencyLower for repetitive docs
b0.75Stronger length normalizationLower for long-doc corpora
k30 (BM25+)Prevents zero scores for unmatched query termsUse BM25+ for sparse queries

BM25+ (BM25 with a lower bound on term contribution) prevents BM25 from assigning zero score to documents that match only some query terms when other terms are very rare. Standard BM25 can assign zero scores to documents containing rare query terms that appear nowhere in the corpus's IDF precomputation, while BM25+ ensures a minimum positive contribution for each matched term. For long-tail or rare-term queries common in technical domains, BM25+ consistently outperforms standard BM25.

Advanced relevance calibration

While BM25's default parameters work across most domains, production systems often require careful tuning for specific corpus characteristics. The relationship between k1 and document length reveals why: when documents vary wildly in length (from 100 to 10,000 words), the b parameter becomes crucial for maintaining retrieval fairness. Academic research shows that b=0.5 to b=0.6 typically outperforms the default 0.75 for long-document corpora (legal contracts, scientific papers), while b=0.9+ works better for short-form content (tweets, product descriptions). Understanding these trade-offs allows teams to implement adaptive BM25 variants that adjust parameters per-domain, significantly improving end-to-end retrieval quality without changing the core algorithm.

Combining BM25 with semantic signals

Modern retrieval systems recognize that lexical matching and semantic understanding are complementary. Rather than replacing BM25 with dense embeddings, winning systems perform parallel searches and intelligently fuse results using learned ranking (LambdaMART, XGBoost) or reciprocal rank fusion (RRF). This hybrid approach captures BM25's strength on rare terms and exact matches while leveraging embedding models' understanding of synonymy and paraphrase. When fusing scores, normalization is critical: raw BM25 and cosine similarity scores have different ranges and distributions, so techniques like min-max normalization or z-score standardization ensure neither signal dominates. Companies running production RAG systems consistently report that hybrid retrieval with proper fusion outperforms any single-signal approach, achieving 10-20% improvements in NDCG scores on benchmark datasets.

Debugging poor BM25 performance

BM25 can appear broken when corpus preprocessing is inconsistent. Common pitfalls include: (1) tokenizing the query differently from corpus documents, (2) applying stemming to documents but not queries (or vice versa), (3) failing to handle domain-specific punctuation (URLs, code snippets, technical notation), and (4) indexing HTML tags or metadata alongside actual content. When BM25 returns low-scoring results for queries you expect to match, first verify that query terms appear verbatim in the corpus using exact string matching. Then systematically apply preprocessing changes (lowercasing, stemming, stopword removal) to see which transformation maximizes the match. For technical corpora, consider preserving certain tokens (Python function names, error codes) that stemming would incorrectly merge. Tools like Elasticsearch with interactive tokenizer analysis and rank_bm25 with debugging output help identify preprocessing mismatches quickly.

Advanced relevance calibration

While BM25's default parameters work across most domains, production systems often require careful tuning for specific corpus characteristics. The relationship between k1 and document length reveals why: when documents vary wildly in length (from 100 to 10,000 words), the b parameter becomes crucial for maintaining retrieval fairness. Academic research shows that b=0.5 to b=0.6 typically outperforms the default 0.75 for long-document corpora (legal contracts, scientific papers), while b=0.9+ works better for short-form content (tweets, product descriptions). Understanding these trade-offs allows teams to implement adaptive BM25 variants that adjust parameters per-domain, significantly improving end-to-end retrieval quality without changing the core algorithm.

Combining BM25 with semantic signals

Modern retrieval systems recognize that lexical matching and semantic understanding are complementary. Rather than replacing BM25 with dense embeddings, winning systems perform parallel searches and intelligently fuse results using learned ranking (LambdaMART, XGBoost) or reciprocal rank fusion (RRF). This hybrid approach captures BM25's strength on rare terms and exact matches while leveraging embedding models' understanding of synonymy and paraphrase. When fusing scores, normalization is critical: raw BM25 and cosine similarity scores have different ranges and distributions, so techniques like min-max normalization or z-score standardization ensure neither signal dominates. Companies running production RAG systems consistently report that hybrid retrieval with proper fusion outperforms any single-signal approach, achieving 10-20% improvements in NDCG scores on benchmark datasets.