Dense vs sparse vs hybrid — FAISS, pgvector, HNSW, BM25, and how to choose
You have a corpus of N documents (N = 10K to 100M). A user submits a query. You need to return the top-k most relevant documents in milliseconds.
The naive approach: scan all N documents, score each one, sort, return top-k. This is O(N×d) — too slow for any corpus larger than a few thousand documents.
Two fundamental approaches exist: Sparse retrieval (keyword/BM25) and dense retrieval (vector/semantic). Each has strengths and blindspots. Modern production systems use both.
The ANN problem: Exact nearest neighbor search in high-dimensional spaces is prohibitively expensive. Approximate Nearest Neighbor (ANN) indices trade a small recall loss for 100–1000× speedup. Most production systems accept 95–99% recall in exchange for real-time latency.
BM25 (Best Match 25) is a probabilistic term-frequency model that scores documents by how well query terms match. It's been the standard for keyword search since the late 1990s.
How it works: Each query term is scored based on (1) how often it appears in the document (TF), (2) how rare the term is across the corpus (IDF), (3) saturation so common terms don't dominate, and (4) document-length normalization so longer documents don't win automatically.
Strengths: Exact keyword match, handles rare/technical terms, no embedding required, fast on CPU, results are interpretable (you can explain why document A ranked above B).
Weaknesses: Vocabulary mismatch (synonym problem — "fast" and "quick" are different tokens), no semantic understanding, requires tokenization, fails on out-of-vocabulary terms.
Embed the query and all documents with a bi-encoder (embedding model). Find the documents whose embeddings are closest to the query embedding. This captures semantic similarity.
Example: Query "fast brown dog" matches "quick auburn fox" semantically. BM25 fails here (no exact term overlap). Dense retrieval succeeds because "fast" and "quick" live near each other in the vector space, as do "brown" and "auburn", and "dog" and "fox".
The challenge: Vector spaces have high dimensionality (768, 1024, or more). Finding the nearest neighbor exactly requires scanning all vectors — O(N×d). ANN indices solve this by organizing vectors into hierarchical structures.
FAISS (Facebook AI Similarity Search): The standard open-source library for ANN search. It supports multiple index types (Flat, IVF, HNSW, PQ) and combinations thereof.
FAISS index types:
| Index | Speed | Recall | Memory | When to use |
|---|---|---|---|---|
| Flat (exact) | Slow | 100% | Full | <100K vectors or ground truth |
| IVF + Flat | Medium | 95–99% | Full | 100K–10M, needs recall |
| HNSW | Fast | 95–99% | Full + graph | Latency-sensitive, fits in RAM |
| IVF + PQ | Fast | 85–95% | Compressed | >10M vectors, memory constrained |
HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm that's become the industry standard. It builds a multi-layer graph where each node connects to its M nearest neighbors.
How it works: Start at the top layer (which has very few nodes). Greedily navigate downward, always moving toward the query. When you reach the bottom layer, you have a list of nearby candidates. The ef parameter controls the search quality: higher ef = more candidates considered = better recall but slower.
Parameters:
| Use case | M | efConstruction | efSearch | Recall |
|---|---|---|---|---|
| High speed, moderate recall | 16 | 100 | 32 | ~95% |
| Balanced | 32 | 200 | 64 | ~98% |
| High recall, moderate speed | 64 | 400 | 128 | ~99.5% |
| Maximum recall | 128 | 800 | 256 | ~99.9% |
Run BM25 and dense retrieval separately, then merge the ranked lists. This covers both vocabulary mismatch (BM25 handles exact terms) and semantic drift (dense handles paraphrases).
Why hybrid works: BM25 is a specialist — it excels at finding exact keywords. Dense retrieval is a generalist — it understands meaning but can miss specific terms. By combining both, you get the best of both worlds.
Merging the lists: You can't just average scores — BM25 scores (0 to ~20) and cosine similarity (0 to 1) are on different scales. The standard solution is Reciprocal Rank Fusion (RRF).
For each document, compute: RRF_score = sum(1/(k + rank_i)) across all retrieval systems, where k=60 is a constant. No score normalization needed — the formula handles different scales automatically.
When hybrid outperforms single methods:
| Query type | Dense | BM25 | Hybrid |
|---|---|---|---|
| Semantic paraphrase ("fast runner" vs "quick athlete") | ✓ | ✗ | ✓ |
| Exact product code ("SKU-12345") | ✗ | ✓ | ✓ |
| Domain acronym ("RLHF", "RAG") | ✗ | ✓ | ✓ |
| Multi-concept question | ✓ | Partial | ✓ |
You can use FAISS directly in Python, but production systems need durability, concurrent access, and filtering. Enter vector databases.
| DB | Index support | Filtering | Cloud | Open-source |
|---|---|---|---|---|
| pgvector | HNSW, IVFFlat | Full SQL | Via Postgres | Yes |
| Qdrant | HNSW | Rich payload filters | Yes | Yes |
| Weaviate | HNSW | GraphQL + property | Yes | Yes |
| Chroma | HNSW | Metadata filter | Self-host | Yes |
| Pinecone | Proprietary | Metadata | Yes (managed) | No |
| Milvus | HNSW, IVF, DiskANN | Rich | Yes | Yes |
| Redis | HNSW, Flat | RediSearch filter | Yes | Yes |
pgvector is the pragmatic choice: If you're already on Postgres, start here. You get SQL transactions, query planner, backups, and all the tooling you already know. HNSW and IVF indices handle most workloads. Move to a dedicated vector DB only when you need >10M vectors or <5ms P99 latency.
Moving beyond a single machine to production introduces new challenges:
Sharding: Partition vectors across nodes by ID range or hash. Each shard handles its own ANN search. A scatter-gather layer combines results from all shards. This scales linearly with the number of machines.
Disk-based indices: DiskANN (Microsoft) keeps the HNSW graph on SSD and loads neighbors on demand. This enables 1B+ vectors without proportional RAM. Latency is slightly higher (~10–20ms instead of 1–5ms), but you get massive scale.
Incremental indexing: Most ANN structures require full rebuild for correctness. Workaround: maintain two indices. The main index has production vectors; a hot buffer has new vectors. Every night, merge them into a new main index. Users query both and see combined results.
Filtering: "Find semantically similar documents WHERE category='finance' AND date>2024-01-01". Three approaches: (1) pre-filter then search (fast but recall drops), (2) post-filter (exact but slow — you might need to search top-1000 to get top-10 after filtering), (3) filtered HNSW (best — only Qdrant/Weaviate support this natively).
Tools and libraries: