Dense Retrieval

Sparse vs dense retrieval
Bi-encoder architecture
End-to-end dense retrieval pipeline
ANN search strategies
Chunking strategies for dense retrieval
Evaluating retrieval quality
Gotchas

SECTION 01

Sparse vs dense retrieval

Traditional search (BM25, TF-IDF) is sparse: each document is a bag of words, and retrieval requires exact or partial term overlap. Search for "automobile" and you won't find a document that only uses "car".

Dense retrieval maps text into a continuous vector space where semantically related content lands near each other. "Automobile" and "car" would be close neighbours. This enables synonym handling, paraphrase matching, and cross-lingual retrieval that sparse methods cannot do.

The flip side: dense retrieval sometimes misses obvious keyword matches that sparse retrieval catches trivially. The best production systems combine both — see Hybrid Search.

SECTION 02

Bi-encoder architecture

Dense retrieval uses a bi-encoder: the same (or separate) encoder is run independently on the query and on each document. This is crucial for efficiency — documents can be encoded offline and indexed. At query time, only the query is encoded (milliseconds), then nearest-neighbour search runs against the pre-indexed document vectors.

Contrast with a cross-encoder (used for reranking): it processes the query and document together, giving better accuracy but O(n) inference cost — too slow for initial retrieval over large corpora.

SECTION 03

End-to-end dense retrieval pipeline

from sentence_transformers import SentenceTransformer
import numpy as np, faiss

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# --- Indexing phase (run once, offline) ---
documents = [
    "The return window is 30 days from the date of purchase.",
    "Free shipping applies to orders over $50.",
    "We support Visa, Mastercard, and PayPal.",
    "Customer support is available 24/7 via chat.",
]
doc_embeddings = model.encode(documents, normalize_embeddings=True)   # (N, 1024)

# Build FAISS index for fast ANN search
d = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(d)   # Inner Product (= cosine for normalised vectors)
index.add(doc_embeddings.astype(np.float32))
faiss.write_index(index, "docs.index")

# --- Retrieval phase (per query, online) ---
index = faiss.read_index("docs.index")
QUERY_PREFIX = "Represent this sentence for searching relevant passages: "
query = QUERY_PREFIX + "How long do I have to return something?"
q_emb = model.encode([query], normalize_embeddings=True).astype(np.float32)

scores, indices = index.search(q_emb, k=3)
for score, idx in zip(scores[0], indices[0]):
    print(f"Score {score:.3f}: {documents[idx]}")

SECTION 04

ANN search strategies

Exact nearest-neighbour search (brute force) is O(N×d) per query — too slow for large indices. Approximate Nearest Neighbour (ANN) trades a small accuracy loss for dramatic speed gains:

Method	Library	Build time	Query speed	Recall
Flat (exact)	FAISS	Instant	Slow (O(N))	100%
IVF	FAISS	Minutes	Fast	~95%
HNSW	FAISS, Qdrant, Weaviate	Slow	Very fast	~98%
ScaNN	Google ScaNN	Medium	Very fast	~98%

FAISS HNSW example:

d = 1024
index = faiss.IndexHNSWFlat(d, 32)   # 32 connections per node
index.hnsw.efConstruction = 200
index.add(doc_embeddings)
index.hnsw.efSearch = 64   # higher = better recall, slower query

SECTION 05

Chunking strategies for dense retrieval

Chunk size is often the biggest lever in retrieval quality — more than model choice:

Strategy	Chunk size	Pros	Cons
Fixed token windows	256–512 tokens	Simple, fast	Can split mid-sentence
Sentence-level	1–3 sentences	Semantic units	May be too short for context
Sliding window	512 tokens, 128 overlap	No context loss at boundaries	50% more chunks
Recursive paragraph	Variable	Respects document structure	More complex to implement

Start with 256-token sliding window (50% overlap). If retrieval is finding the right section but missing context, increase chunk size. If it's finding the wrong section, decrease it.

SECTION 06

Evaluating retrieval quality

def evaluate_retrieval(queries, ground_truth_ids, retriever_fn, k=5):
    '''
    queries: list of query strings
    ground_truth_ids: list of lists of relevant doc IDs
    retriever_fn: function(query, k) -> list of doc IDs
    '''
    hits_at_k = 0
    mrr_sum = 0.0
    for query, relevant_ids in zip(queries, ground_truth_ids):
        retrieved = retriever_fn(query, k)
        # Hit@k
        if any(r in relevant_ids for r in retrieved):
            hits_at_k += 1
        # MRR
        for rank, doc_id in enumerate(retrieved, 1):
            if doc_id in relevant_ids:
                mrr_sum += 1 / rank
                break
    n = len(queries)
    return {"Hit@k": hits_at_k / n, "MRR": mrr_sum / n}

metrics = evaluate_retrieval(test_queries, test_labels, my_retriever, k=5)
print(metrics)

SECTION 07

Gotchas

Embedding model and chunk size are coupled. Many models have a 512-token limit. Chunks longer than the model's context are truncated — you lose the tail of every long chunk. Always check model.max_seq_length.

Document updates require re-embedding. Dense retrieval has no efficient way to update a single document's embedding in the index without re-indexing the whole document. For frequently updated corpora, use a vector DB with upsert support (Qdrant, Pinecone) rather than raw FAISS.

Normalise before inner product. If you're using inner product (dot product) similarity, embeddings must be L2-normalised. Otherwise high-magnitude vectors dominate regardless of semantic content. Most models support normalize_embeddings=True.

Embedding model selection

The embedding model is the most consequential architectural decision in a dense retrieval system. Model quality varies significantly across domains, and the MTEB leaderboard provides task-specific benchmark scores that help predict in-domain performance. General-purpose sentence transformers like all-MiniLM-L6-v2 are fast but underperform specialized models on technical or scientific text. Larger models like E5-large or GTE-large provide better retrieval quality at the cost of 3–5x higher inference latency and embedding storage.

Domain-specific fine-tuning of a base embedding model using contrastive learning on in-domain query-document pairs consistently improves retrieval quality over out-of-the-box models. Even small fine-tuning datasets of 1,000–5,000 positive pairs, generated using an LLM to create synthetic queries for existing documents, produce meaningful improvements on nDCG@10 and recall@k metrics. The fine-tuned model captures domain vocabulary and query phrasing patterns that generic training data under-represents.

Cross-encoder rerankers and bi-encoder retrievers serve different roles and are most effective in combination. The bi-encoder retrieves a candidate set (typically top-50 to top-200) at low latency using approximate nearest neighbor search; the cross-encoder reranks the candidates with higher accuracy but at O(k) latency. This two-stage approach achieves near-cross-encoder quality at near-bi-encoder latency for the final top-k results, making it the standard architecture for production RAG systems where retrieval quality is critical.

Approximate nearest neighbor (ANN) index configuration significantly affects both retrieval latency and recall quality. HNSW (Hierarchical Navigable Small World) graphs are the most commonly used ANN index type, with two key parameters: M (the number of connections per node, typically 16–64) and ef_construction (the search width during index building, typically 100–400). Higher M and ef_construction values produce better recall at the cost of slower index building and larger memory footprint. The ef parameter at query time (ef_search) provides a runtime tradeoff between recall and latency — increasing ef_search from 50 to 200 typically improves recall by 1–5 percentage points at the cost of 2–4x higher query latency.

Batch embedding generation for large document corpora requires attention to throughput optimization. GPU-based embedding inference using sentence-transformers achieves 1,000–5,000 embeddings per second depending on model size and sequence length when using optimal batch sizes (typically 32–256). CPU-based inference on quantized models achieves 100–500 embeddings per second, which is sufficient for real-time re-embedding of updated documents but too slow for initial corpus embedding of millions of documents. Cloud embedding APIs like OpenAI text-embedding-3 provide throughput of thousands of embeddings per minute with rate limits, requiring parallel requests and backoff logic for large-scale indexing jobs.

Negative mining strategy for contrastive learning directly affects the quality of fine-tuned retrieval models. Random negatives — sampling documents from the corpus at random as non-relevant examples — produce easy negatives that are already well-separated from the positive in embedding space and provide weak training signal. Hard negatives — selecting the top-k retrieved documents that are not the true positive — provide much stronger training signal because the model must learn to separate genuinely similar but non-relevant documents from the true positive. BM25-mined hard negatives, where BM25 retrieves false positives that share keywords with the query but are not the correct document, are particularly valuable training examples for improving dense retrieval recall.

Retrieval recall at different k values determines the upper bound on downstream answer quality in RAG pipelines. If the relevant document is not in the top-k retrieved results, no reranking or generation strategy can recover it. Measuring recall@5, recall@10, and recall@20 on a labeled evaluation set before tuning generation reveals whether retrieval or generation is the binding constraint on end-to-end quality. Many teams that focus on generation quality improvements discover through this analysis that their recall@5 is 55% — meaning 45% of queries cannot possibly be answered correctly — making retrieval improvement the higher-leverage intervention.

FAISS index persistence and warm loading are operationally important for dense retrieval services that restart frequently. Saving a FAISS index to disk with index.write_index() and loading it with faiss.read_index() on startup restores the full ANN index without recomputing it from scratch. For large indexes (millions of vectors), this saves minutes of index rebuild time on each restart. Production services should checkpoint the index to persistent storage after each major update and load from the checkpoint on startup, treating index rebuild as a fallback for corrupted checkpoint recovery rather than normal operation.