Retrieval Technology

Contents

The retrieval problem
Sparse: BM25
Dense: vector search
HNSW algorithm
Hybrid search
Vector databases
Scaling & production

01 — Problem Definition

The Retrieval Problem

You have a corpus of N documents (N = 10K to 100M). A user submits a query. You need to return the top-k most relevant documents in milliseconds.

The naive approach: scan all N documents, score each one, sort, return top-k. This is O(N×d) — too slow for any corpus larger than a few thousand documents.

Two fundamental approaches exist: Sparse retrieval (keyword/BM25) and dense retrieval (vector/semantic). Each has strengths and blindspots. Modern production systems use both.

The ANN problem: Exact nearest neighbor search in high-dimensional spaces is prohibitively expensive. Approximate Nearest Neighbor (ANN) indices trade a small recall loss for 100–1000× speedup. Most production systems accept 95–99% recall in exchange for real-time latency.

💡 Key insight: Retrieval is a two-axis problem: recall (did we find the right documents?) and latency (how fast?). ANN indices sacrifice a tiny bit of recall for dramatic latency gains.

02 — Keyword Search

Sparse Retrieval: BM25

BM25 (Best Match 25) is a probabilistic term-frequency model that scores documents by how well query terms match. It's been the standard for keyword search since the late 1990s.

How it works: Each query term is scored based on (1) how often it appears in the document (TF), (2) how rare the term is across the corpus (IDF), (3) saturation so common terms don't dominate, and (4) document-length normalization so longer documents don't win automatically.

Strengths: Exact keyword match, handles rare/technical terms, no embedding required, fast on CPU, results are interpretable (you can explain why document A ranked above B).

Weaknesses: Vocabulary mismatch (synonym problem — "fast" and "quick" are different tokens), no semantic understanding, requires tokenization, fails on out-of-vocabulary terms.

Example: BM25 with rank_bm25

# pip install rank-bm25 from rank_bm25 import BM25Okapi corpus = [ "The quick brown fox jumps over the lazy dog", "A fast auburn fox leaps above the idle hound", "Python is a popular programming language", ] tokenized = [doc.split() for doc in corpus] bm25 = BM25Okapi(tokenized) query = "quick fox jumps" scores = bm25.get_scores(query.split()) # scores: [2.87, 1.23, 0.0] # Doc 0 wins (exact terms); Doc 1 misses "quick"/"jumps"

⚠️ When to use BM25: Queries with specific product names, error codes, medical terminology, or any case where exact string matching matters. Dense retrieval will fail to find these.

03 — Vector Search

Dense Retrieval: Vector Search

Embed the query and all documents with a bi-encoder (embedding model). Find the documents whose embeddings are closest to the query embedding. This captures semantic similarity.

Example: Query "fast brown dog" matches "quick auburn fox" semantically. BM25 fails here (no exact term overlap). Dense retrieval succeeds because "fast" and "quick" live near each other in the vector space, as do "brown" and "auburn", and "dog" and "fox".

The challenge: Vector spaces have high dimensionality (768, 1024, or more). Finding the nearest neighbor exactly requires scanning all vectors — O(N×d). ANN indices solve this by organizing vectors into hierarchical structures.

FAISS (Facebook AI Similarity Search): The standard open-source library for ANN search. It supports multiple index types (Flat, IVF, HNSW, PQ) and combinations thereof.

Example: FAISS IndexHNSWFlat

import faiss import numpy as np d = 1024 # embedding dimension M = 32 # HNSW connections per node index = faiss.IndexHNSWFlat(d, M) index.hnsw.efConstruction = 200 # build-time quality # Add 1M vectors vectors = np.random.randn(1_000_000, d).astype('float32') faiss.normalize_L2(vectors) index.add(vectors) # Query (top-10) index.hnsw.efSearch = 64 # query-time quality query = np.random.randn(1, d).astype('float32') faiss.normalize_L2(query) distances, indices = index.search(query, k=10)

FAISS index types:

Index	Speed	Recall	Memory	When to use
Flat (exact)	Slow	100%	Full	<100K vectors or ground truth
IVF + Flat	Medium	95–99%	Full	100K–10M, needs recall
HNSW	Fast	95–99%	Full + graph	Latency-sensitive, fits in RAM
IVF + PQ	Fast	85–95%	Compressed	>10M vectors, memory constrained

04 — Best Algorithm

HNSW: The Dominant ANN Algorithm

HNSW (Hierarchical Navigable Small World) is a graph-based ANN algorithm that's become the industry standard. It builds a multi-layer graph where each node connects to its M nearest neighbors.

How it works: Start at the top layer (which has very few nodes). Greedily navigate downward, always moving toward the query. When you reach the bottom layer, you have a list of nearby candidates. The ef parameter controls the search quality: higher ef = more candidates considered = better recall but slower.

Parameters:

M: Number of connections per node. Higher M = denser graph = better recall, slower build/search. Typical: 16–64.
efConstruction: How many candidates to consider during index build. Higher = better recall, slower build. Typical: 100–400.
efSearch: How many candidates to consider during query. Higher = better recall, slower query. Typical: 32–256. You can tune this per-query!

Use case	M	efConstruction	efSearch	Recall
High speed, moderate recall	16	100	32	~95%
Balanced	32	200	64	~98%
High recall, moderate speed	64	400	128	~99.5%
Maximum recall	128	800	256	~99.9%

✓ Industry default: HNSW is supported natively in pgvector (v0.5+), Qdrant, Weaviate, Chroma, and most vector databases. Use M=32, efConstruction=200, efSearch=64 as your starting point.

05 — Combined

Hybrid Search: Dense + Sparse

Run BM25 and dense retrieval separately, then merge the ranked lists. This covers both vocabulary mismatch (BM25 handles exact terms) and semantic drift (dense handles paraphrases).

Why hybrid works: BM25 is a specialist — it excels at finding exact keywords. Dense retrieval is a generalist — it understands meaning but can miss specific terms. By combining both, you get the best of both worlds.

Merging the lists: You can't just average scores — BM25 scores (0 to ~20) and cosine similarity (0 to 1) are on different scales. The standard solution is Reciprocal Rank Fusion (RRF).

Reciprocal Rank Fusion (RRF)

For each document, compute: RRF_score = sum(1/(k + rank_i)) across all retrieval systems, where k=60 is a constant. No score normalization needed — the formula handles different scales automatically.

def reciprocal_rank_fusion(rankings: list[list], k: int = 60) -> list: scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking, start=1): scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank) return sorted(scores, key=scores.get, reverse=True) dense_results = ["doc_A", "doc_B", "doc_C", "doc_D"] # from vector search sparse_results = ["doc_C", "doc_E", "doc_A", "doc_F"] # from BM25 fused = reciprocal_rank_fusion([dense_results, sparse_results]) # Result: doc_A (both), doc_C (both), ...

When hybrid outperforms single methods:

Query type	Dense	BM25	Hybrid
Semantic paraphrase ("fast runner" vs "quick athlete")	✓	✗	✓
Exact product code ("SKU-12345")	✗	✓	✓
Domain acronym ("RLHF", "RAG")	✗	✓	✓
Multi-concept question	✓	Partial	✓

✓ Safe default for production: Hybrid search with RRF is the production default. BM25 is cheap to run in parallel with dense search, and the fusion step is trivial. You get coverage for both semantic drift and vocabulary mismatch.

06 — Storage

Vector Databases

You can use FAISS directly in Python, but production systems need durability, concurrent access, and filtering. Enter vector databases.

DB	Index support	Filtering	Cloud	Open-source
pgvector	HNSW, IVFFlat	Full SQL	Via Postgres	Yes
Qdrant	HNSW	Rich payload filters	Yes	Yes
Weaviate	HNSW	GraphQL + property	Yes	Yes
Chroma	HNSW	Metadata filter	Self-host	Yes
Pinecone	Proprietary	Metadata	Yes (managed)	No
Milvus	HNSW, IVF, DiskANN	Rich	Yes	Yes
Redis	HNSW, Flat	RediSearch filter	Yes	Yes

pgvector is the pragmatic choice: If you're already on Postgres, start here. You get SQL transactions, query planner, backups, and all the tooling you already know. HNSW and IVF indices handle most workloads. Move to a dedicated vector DB only when you need >10M vectors or <5ms P99 latency.

⚠️ Startup advice: Don't pick a vector database based on features — they all do similar things. Pick based on what you already use. If you're on Postgres, use pgvector. If you need cloud-native, use Qdrant or Pinecone. The switching cost is low early.

07 — Operations

Scaling and Production Patterns

Moving beyond a single machine to production introduces new challenges:

Sharding: Partition vectors across nodes by ID range or hash. Each shard handles its own ANN search. A scatter-gather layer combines results from all shards. This scales linearly with the number of machines.

Disk-based indices: DiskANN (Microsoft) keeps the HNSW graph on SSD and loads neighbors on demand. This enables 1B+ vectors without proportional RAM. Latency is slightly higher (~10–20ms instead of 1–5ms), but you get massive scale.

Incremental indexing: Most ANN structures require full rebuild for correctness. Workaround: maintain two indices. The main index has production vectors; a hot buffer has new vectors. Every night, merge them into a new main index. Users query both and see combined results.

Filtering: "Find semantically similar documents WHERE category='finance' AND date>2024-01-01". Three approaches: (1) pre-filter then search (fast but recall drops), (2) post-filter (exact but slow — you might need to search top-1000 to get top-10 after filtering), (3) filtered HNSW (best — only Qdrant/Weaviate support this natively).

Tools and libraries:

Vector Search

FAISS

Open-source library from Meta for ANN search; supports IndexFlat, IVF, HNSW, PQ

Vector DB

pgvector

PostgreSQL extension with HNSW and IVF, full SQL support

Vector DB

Qdrant

Cloud-native vector database with HNSW, rich filtering, managed service

Vector DB

Weaviate

GraphQL API, HNSW index, strong ecosystem for ML pipelines

Vector DB

Chroma

Lightweight, HNSW-backed, good for small-to-medium RAG systems

Vector DB

Pinecone

Fully managed, serverless, best for teams that want zero ops

Vector DB

Milvus

Enterprise-grade, supports DiskANN, handles 1B+ vectors

Sparse Search

rank_bm25

Python library for BM25 ranking; lightweight, no server needed

💡 Architecture pattern: (1) Dense retrieval via ANN (fast), (2) BM25 in parallel (cheap), (3) RRF fusion, (4) optional cross-encoder reranking. This gives you semantic + keyword coverage, sub-100ms latency, and 98%+ recall.

References

Retrieval Technology

The Retrieval Problem

Sparse Retrieval: BM25

Example: BM25 with rank_bm25

Dense Retrieval: Vector Search

Example: FAISS IndexHNSWFlat

HNSW: The Dominant ANN Algorithm

Hybrid Search: Dense + Sparse

Reciprocal Rank Fusion (RRF)

Vector Databases

Scaling and Production Patterns

Further Reading

Related concepts