Best Match 25 โ a probabilistic ranking function based on term frequency and inverse document frequency, the gold-standard for keyword-based document retrieval.
Dense retrieval gets all the excitement, but BM25 is the cockroach of information retrieval โ it's been around since 1994 and still outperforms dense methods on keyword-heavy queries. If someone searches "Python 3.11 asyncio bug CVE-2023-1234", BM25 finds the exact document in milliseconds. Dense retrieval might find "a blog post about async programming" instead.
For production RAG systems, BM25 is your first baseline and your hybrid-search companion. Build it first; add dense retrieval on top.
BM25 scores a document D for query Q by summing over query terms:
score(D,Q) = ฮฃ IDF(qแตข) ร f(qแตข,D) ร (kโ+1)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
f(qแตข,D) + kโร(1-b + bร|D|/avgdl)
No training, no embeddings, no GPU โ just counting and division.
pip install rank-bm25 nltk
from rank_bm25 import BM25Okapi
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import word_tokenize
# Documents
corpus = [
"Refunds are accepted within 30 days of purchase with original receipt.",
"Free shipping is available on all orders above fifty dollars.",
"Customer support is available Monday through Friday, 9 AM to 5 PM.",
"We accept Visa, Mastercard, American Express, and PayPal.",
]
# Tokenise (lowercase + basic tokenisation)
tokenised = [word_tokenize(doc.lower()) for doc in corpus]
# Build BM25 index
bm25 = BM25Okapi(tokenised)
# Query
query = "return refund receipt"
tokenised_query = word_tokenize(query.lower())
scores = bm25.get_scores(tokenised_query)
# Get top-k results
import numpy as np
top_k_idx = np.argsort(scores)[::-1][:3]
for i in top_k_idx:
print(f"Score {scores[i]:.3f}: {corpus[i]}")
Elasticsearch uses BM25 as its default relevance scorer. A simple search query leverages it automatically:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
# Index a document
es.index(index="docs", id="1", document={
"content": "Refunds accepted within 30 days with original receipt.",
"category": "policy"
})
# BM25 search (default relevance scorer)
results = es.search(
index="docs",
body={
"query": {
"match": {
"content": "return refund receipt" # BM25 scoring
}
}
}
)
for hit in results["hits"]["hits"]:
print(f"Score {hit['_score']:.3f}: {hit['_source']['content']}")
Elasticsearch's BM25 is production-grade: it handles multi-field search, language analysers (stemming, stopwords), and horizontal scaling across shards.
BM25 consistently outperforms dense retrieval when:
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer, util
import nltk
from nltk.tokenize import word_tokenize
class HybridRetriever:
def __init__(self, docs):
self.docs = docs
tokenised = [word_tokenize(d.lower()) for d in docs]
self.bm25 = BM25Okapi(tokenised)
self.model = SentenceTransformer("all-MiniLM-L6-v2")
self.doc_embs = self.model.encode(docs, convert_to_tensor=True)
def retrieve(self, query: str, k: int = 5) -> list[str]:
# BM25
bm25_scores = self.bm25.get_scores(word_tokenize(query.lower()))
# Dense
q_emb = self.model.encode(query, convert_to_tensor=True)
dense_scores = util.cos_sim(q_emb, self.doc_embs)[0].cpu().numpy()
# Normalise and combine (RRF)
import numpy as np
combined = (dense_scores / (dense_scores.max() + 1e-9) +
bm25_scores / (bm25_scores.max() + 1e-9))
top_k = np.argsort(combined)[::-1][:k]
return [self.docs[i] for i in top_k]
BM25 is case and form sensitive. Lowercase and stem your corpus and query consistently. "Refund" and "refunds" are different tokens unless you apply stemming (use NLTK's PorterStemmer or a spaCy pipeline).
rank_bm25 builds in RAM. The in-memory BM25Okapi index is rebuilt every process restart. For persistence, use Elasticsearch/OpenSearch, or pickle the BM25 object (though the pickle may be large for big corpora).
Short documents inflate TF scores. The b parameter penalises shorter documents less โ by default, a 10-word document with one match scores comparably to a 1000-word document with five matches. Tune b if your corpus has very variable document lengths.
BM25's two parameters k1 and b control term saturation and document length normalization respectively. The default values (k1=1.5, b=0.75) work well for general web-document retrieval but may be suboptimal for specific domains. Short-query, long-document corpora (like searching a technical documentation site) often benefit from lower b values (0.3โ0.5) that reduce the penalty for longer documents. High-frequency-term corpora benefit from lower k1 values (1.0โ1.2) that more aggressively cap the contribution of repeated terms.
| Parameter | Default | Effect of increasing | When to adjust |
|---|---|---|---|
| k1 | 1.5 | Less saturation of term frequency | Lower for repetitive docs |
| b | 0.75 | Stronger length normalization | Lower for long-doc corpora |
| k3 | 0 (BM25+) | Prevents zero scores for unmatched query terms | Use BM25+ for sparse queries |
BM25+ (BM25 with a lower bound on term contribution) prevents BM25 from assigning zero score to documents that match only some query terms when other terms are very rare. Standard BM25 can assign zero scores to documents containing rare query terms that appear nowhere in the corpus's IDF precomputation, while BM25+ ensures a minimum positive contribution for each matched term. For long-tail or rare-term queries common in technical domains, BM25+ consistently outperforms standard BM25.
While BM25's default parameters work across most domains, production systems often require careful tuning for specific corpus characteristics. The relationship between k1 and document length reveals why: when documents vary wildly in length (from 100 to 10,000 words), the b parameter becomes crucial for maintaining retrieval fairness. Academic research shows that b=0.5 to b=0.6 typically outperforms the default 0.75 for long-document corpora (legal contracts, scientific papers), while b=0.9+ works better for short-form content (tweets, product descriptions). Understanding these trade-offs allows teams to implement adaptive BM25 variants that adjust parameters per-domain, significantly improving end-to-end retrieval quality without changing the core algorithm.
Modern retrieval systems recognize that lexical matching and semantic understanding are complementary. Rather than replacing BM25 with dense embeddings, winning systems perform parallel searches and intelligently fuse results using learned ranking (LambdaMART, XGBoost) or reciprocal rank fusion (RRF). This hybrid approach captures BM25's strength on rare terms and exact matches while leveraging embedding models' understanding of synonymy and paraphrase. When fusing scores, normalization is critical: raw BM25 and cosine similarity scores have different ranges and distributions, so techniques like min-max normalization or z-score standardization ensure neither signal dominates. Companies running production RAG systems consistently report that hybrid retrieval with proper fusion outperforms any single-signal approach, achieving 10-20% improvements in NDCG scores on benchmark datasets.
BM25 can appear broken when corpus preprocessing is inconsistent. Common pitfalls include: (1) tokenizing the query differently from corpus documents, (2) applying stemming to documents but not queries (or vice versa), (3) failing to handle domain-specific punctuation (URLs, code snippets, technical notation), and (4) indexing HTML tags or metadata alongside actual content. When BM25 returns low-scoring results for queries you expect to match, first verify that query terms appear verbatim in the corpus using exact string matching. Then systematically apply preprocessing changes (lowercasing, stemming, stopword removal) to see which transformation maximizes the match. For technical corpora, consider preserving certain tokens (Python function names, error codes) that stemming would incorrectly merge. Tools like Elasticsearch with interactive tokenizer analysis and rank_bm25 with debugging output help identify preprocessing mismatches quickly.
While BM25's default parameters work across most domains, production systems often require careful tuning for specific corpus characteristics. The relationship between k1 and document length reveals why: when documents vary wildly in length (from 100 to 10,000 words), the b parameter becomes crucial for maintaining retrieval fairness. Academic research shows that b=0.5 to b=0.6 typically outperforms the default 0.75 for long-document corpora (legal contracts, scientific papers), while b=0.9+ works better for short-form content (tweets, product descriptions). Understanding these trade-offs allows teams to implement adaptive BM25 variants that adjust parameters per-domain, significantly improving end-to-end retrieval quality without changing the core algorithm.
Modern retrieval systems recognize that lexical matching and semantic understanding are complementary. Rather than replacing BM25 with dense embeddings, winning systems perform parallel searches and intelligently fuse results using learned ranking (LambdaMART, XGBoost) or reciprocal rank fusion (RRF). This hybrid approach captures BM25's strength on rare terms and exact matches while leveraging embedding models' understanding of synonymy and paraphrase. When fusing scores, normalization is critical: raw BM25 and cosine similarity scores have different ranges and distributions, so techniques like min-max normalization or z-score standardization ensure neither signal dominates. Companies running production RAG systems consistently report that hybrid retrieval with proper fusion outperforms any single-signal approach, achieving 10-20% improvements in NDCG scores on benchmark datasets.