RETRIEVAL SYSTEMS

Retrieval Technology

Dense vs sparse vs hybrid — FAISS, pgvector, HNSW, BM25, and how to choose

dense + sparse = hybrid the production default
HNSW index dominant ANN algorithm
ANN vs exact speed vs recall tradeoff
Contents
  1. The retrieval problem
  2. Sparse: BM25
  3. Dense: vector search
  4. HNSW algorithm
  5. Hybrid search
  6. Vector databases
  7. Scaling & production
01 — Problem Definition

The Retrieval Problem

You have a corpus of N documents (N = 10K to 100M). A user submits a query. You need to return the top-k most relevant documents in milliseconds.

The naive approach: scan all N documents, score each one, sort, return top-k. This is O(N×d) — too slow for any corpus larger than a few thousand documents.

Two fundamental approaches exist: Sparse retrieval (keyword/BM25) and dense retrieval (vector/semantic). Each has strengths and blindspots. Modern production systems use both.

The ANN problem: Exact nearest neighbor search in high-dimensional spaces is prohibitively expensive. Approximate Nearest Neighbor (ANN) indices trade a small recall loss for 100–1000× speedup. Most production systems accept 95–99% recall in exchange for real-time latency.

💡 Key insight: Retrieval is a two-axis problem: recall (did we find the right documents?) and latency (how fast?). ANN indices sacrifice a tiny bit of recall for dramatic latency gains.
02 — Keyword Search

Sparse Retrieval: BM25

BM25 (Best Match 25) is a probabilistic term-frequency model that scores documents by how well query terms match. It's been the standard for keyword search since the late 1990s.

How it works: Each query term is scored based on (1) how often it appears in the document (TF), (2) how rare the term is across the corpus (IDF), (3) saturation so common terms don't dominate, and (4) document-length normalization so longer documents don't win automatically.

Strengths: Exact keyword match, handles rare/technical terms, no embedding required, fast on CPU, results are interpretable (you can explain why document A ranked above B).

Weaknesses: Vocabulary mismatch (synonym problem — "fast" and "quick" are different tokens), no semantic understanding, requires tokenization, fails on out-of-vocabulary terms.

Example: BM25 with rank_bm25

# pip install rank-bm25 from rank_bm25 import BM25Okapi corpus = [ "The quick brown fox jumps over the lazy dog", "A fast auburn fox leaps above the idle hound", "Python is a popular programming language", ] tokenized = [doc.split() for doc in corpus] bm25 = BM25Okapi(tokenized) query = "quick fox jumps" scores = bm25.get_scores(query.split()) # scores: [2.87, 1.23, 0.0] # Doc 0 wins (exact terms); Doc 1 misses "quick"/"jumps"
⚠️ When to use BM25: Queries with specific product names, error codes, medical terminology, or any case where exact string matching matters. Dense retrieval will fail to find these.
03 — Vector Search

Dense Retrieval: Vector Search

Embed the query and all documents with a bi-encoder (embedding model). Find the documents whose embeddings are closest to the query embedding. This captures semantic similarity.

Example: Query "fast brown dog" matches "quick auburn fox" semantically. BM25 fails here (no exact term overlap). Dense retrieval succeeds because "fast" and "quick" live near each other in the vector space, as do "brown" and "auburn", and "dog" and "fox".

The challenge: Vector spaces have high dimensionality (768, 1024, or more). Finding the nearest neighbor exactly requires scanning all vectors — O(N×d). ANN indices solve this by organizing vectors into hierarchical structures.

FAISS (Facebook AI Similarity Search): The standard open-source library for ANN search. It supports multiple index types (Flat, IVF, HNSW, PQ) and combinations thereof.

Example: FAISS IndexHNSWFlat

import faiss import numpy as np d = 1024 # embedding dimension M = 32 # HNSW connections per node index = faiss.IndexHNSWFlat(d, M) index.hnsw.efConstruction = 200 # build-time quality # Add 1M vectors vectors = np.random.randn(1_000_000, d).astype('float32') faiss.normalize_L2(vectors) index.add(vectors) # Query (top-10) index.hnsw.efSearch = 64 # query-time quality query = np.random.randn(1, d).astype('float32') faiss.normalize_L2(query) distances, indices = index.search(query, k=10)

FAISS index types:

Index Speed Recall Memory When to use
Flat (exact) Slow 100% Full <100K vectors or ground truth
IVF + Flat Medium 95–99% Full 100K–10M, needs recall
HNSW Fast 95–99% Full + graph Latency-sensitive, fits in RAM
IVF + PQ Fast 85–95% Compressed >10M vectors, memory constrained
04 — Best Algorithm

HNSW: The Dominant ANN Algorithm

HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm that's become the industry standard. It builds a multi-layer graph where each node connects to its M nearest neighbors.

How it works: Start at the top layer (which has very few nodes). Greedily navigate downward, always moving toward the query. When you reach the bottom layer, you have a list of nearby candidates. The ef parameter controls the search quality: higher ef = more candidates considered = better recall but slower.

Parameters:

Use case M efConstruction efSearch Recall
High speed, moderate recall 16 100 32 ~95%
Balanced 32 200 64 ~98%
High recall, moderate speed 64 400 128 ~99.5%
Maximum recall 128 800 256 ~99.9%
Industry default: HNSW is supported natively in pgvector (v0.5+), Qdrant, Weaviate, Chroma, and most vector databases. Use M=32, efConstruction=200, efSearch=64 as your starting point.
05 — Combined

Hybrid Search: Dense + Sparse

Run BM25 and dense retrieval separately, then merge the ranked lists. This covers both vocabulary mismatch (BM25 handles exact terms) and semantic drift (dense handles paraphrases).

Why hybrid works: BM25 is a specialist — it excels at finding exact keywords. Dense retrieval is a generalist — it understands meaning but can miss specific terms. By combining both, you get the best of both worlds.

Merging the lists: You can't just average scores — BM25 scores (0 to ~20) and cosine similarity (0 to 1) are on different scales. The standard solution is Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion (RRF)

For each document, compute: RRF_score = sum(1/(k + rank_i)) across all retrieval systems, where k=60 is a constant. No score normalization needed — the formula handles different scales automatically.

def reciprocal_rank_fusion(rankings: list[list], k: int = 60) -> list: scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking, start=1): scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank) return sorted(scores, key=scores.get, reverse=True) dense_results = ["doc_A", "doc_B", "doc_C", "doc_D"] # from vector search sparse_results = ["doc_C", "doc_E", "doc_A", "doc_F"] # from BM25 fused = reciprocal_rank_fusion([dense_results, sparse_results]) # Result: doc_A (both), doc_C (both), ...

When hybrid outperforms single methods:

Query type Dense BM25 Hybrid
Semantic paraphrase ("fast runner" vs "quick athlete")
Exact product code ("SKU-12345")
Domain acronym ("RLHF", "RAG")
Multi-concept question Partial
Safe default for production: Hybrid search with RRF is the production default. BM25 is cheap to run in parallel with dense search, and the fusion step is trivial. You get coverage for both semantic drift and vocabulary mismatch.
06 — Storage

Vector Databases

You can use FAISS directly in Python, but production systems need durability, concurrent access, and filtering. Enter vector databases.

DB Index support Filtering Cloud Open-source
pgvector HNSW, IVFFlat Full SQL Via Postgres Yes
Qdrant HNSW Rich payload filters Yes Yes
Weaviate HNSW GraphQL + property Yes Yes
Chroma HNSW Metadata filter Self-host Yes
Pinecone Proprietary Metadata Yes (managed) No
Milvus HNSW, IVF, DiskANN Rich Yes Yes
Redis HNSW, Flat RediSearch filter Yes Yes

pgvector is the pragmatic choice: If you're already on Postgres, start here. You get SQL transactions, query planner, backups, and all the tooling you already know. HNSW and IVF indices handle most workloads. Move to a dedicated vector DB only when you need >10M vectors or <5ms P99 latency.

⚠️ Startup advice: Don't pick a vector database based on features — they all do similar things. Pick based on what you already use. If you're on Postgres, use pgvector. If you need cloud-native, use Qdrant or Pinecone. The switching cost is low early.
07 — Operations

Scaling and Production Patterns

Moving beyond a single machine to production introduces new challenges:

Sharding: Partition vectors across nodes by ID range or hash. Each shard handles its own ANN search. A scatter-gather layer combines results from all shards. This scales linearly with the number of machines.

Disk-based indices: DiskANN (Microsoft) keeps the HNSW graph on SSD and loads neighbors on demand. This enables 1B+ vectors without proportional RAM. Latency is slightly higher (~10–20ms instead of 1–5ms), but you get massive scale.

Incremental indexing: Most ANN structures require full rebuild for correctness. Workaround: maintain two indices. The main index has production vectors; a hot buffer has new vectors. Every night, merge them into a new main index. Users query both and see combined results.

Filtering: "Find semantically similar documents WHERE category='finance' AND date>2024-01-01". Three approaches: (1) pre-filter then search (fast but recall drops), (2) post-filter (exact but slow — you might need to search top-1000 to get top-10 after filtering), (3) filtered HNSW (best — only Qdrant/Weaviate support this natively).

Tools and libraries:

Vector Search
FAISS
Open-source library from Meta for ANN search; supports IndexFlat, IVF, HNSW, PQ
Vector DB
pgvector
PostgreSQL extension with HNSW and IVF, full SQL support
Vector DB
Qdrant
Cloud-native vector database with HNSW, rich filtering, managed service
Vector DB
Weaviate
GraphQL API, HNSW index, strong ecosystem for ML pipelines
Vector DB
Chroma
Lightweight, HNSW-backed, good for small-to-medium RAG systems
Vector DB
Pinecone
Fully managed, serverless, best for teams that want zero ops
Vector DB
Milvus
Enterprise-grade, supports DiskANN, handles 1B+ vectors
Sparse Search
rank_bm25
Python library for BM25 ranking; lightweight, no server needed
💡 Architecture pattern: (1) Dense retrieval via ANN (fast), (2) BM25 in parallel (cheap), (3) RRF fusion, (4) optional cross-encoder reranking. This gives you semantic + keyword coverage, sub-100ms latency, and 98%+ recall.
References

Further Reading

Libraries & Official Docs
Leaderboards & Benchmarks
Papers & Algorithms
Implementation & Techniques