Reranking

Why reranking exists
Bi-encoder vs cross-encoder
Reranking with Cohere
Reranking with a local cross-encoder
LLM reranking
Integrating into a RAG pipeline
Gotchas

SECTION 01

Why reranking exists

Initial retrieval (BM25 or dense) optimises for recall — finding most of the relevant documents in the top-N results. But it sacrifices precision — position 1 might not be the most relevant document, and irrelevant documents might sneak into the top-5.

For RAG, you pass the top-k retrieved chunks as context to the LLM. Passing 20 chunks is expensive (tokens) and noisy (irrelevant context confuses the model). A reranker acts as a second-stage filter: retrieve 20 candidates cheaply, rerank with a slower but more accurate model, pass only the top-3 to the LLM.

SECTION 02

Bi-encoder vs cross-encoder

Retrieval uses a bi-encoder: query and documents are encoded independently, enabling pre-indexing. Reranking uses a cross-encoder: query and document are concatenated and fed through the model together, so the model can compute fine-grained cross-attention between every query token and every document token.

This joint encoding is far more accurate but O(n) in inference cost — you can't pre-encode documents, so you can only apply it to a small set of candidates (typically top-20 to top-100). That's the two-stage design:

Query → Fast retrieval (top-100) → Slow reranker (top-5) → LLM

SECTION 03

Reranking with Cohere

import cohere

co = cohere.Client("your-api-key")

query = "What is the return policy?"
candidates = [
    "Our return window is 30 days from purchase.",
    "We ship to over 50 countries worldwide.",
    "To return an item, email support@example.com with your order ID.",
    "Free shipping on orders over $50.",
    "Refunds are processed in 5 business days.",
]

# Rerank: returns documents sorted by relevance
response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=candidates,
    top_n=3   # return only top 3
)

print("Top reranked results:")
for result in response.results:
    print(f"Score {result.relevance_score:.4f}: {result.document.text}")

Cohere's reranker is an API call — no GPU needed. It scores each (query, document) pair and returns a relevance score between 0 and 1. Use top_n to return only the results you need.

SECTION 04

Reranking with a local cross-encoder

from sentence_transformers.cross_encoder import CrossEncoder

# Load a cross-encoder model (runs locally, no API cost)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)

query = "What is the return policy?"
candidates = [
    "Our return window is 30 days from purchase.",
    "We ship to over 50 countries worldwide.",
    "To return an item, email support with your order ID.",
    "Free shipping on orders over $50.",
    "Refunds are processed in 5 business days.",
]

# Cross-encoder scores each (query, candidate) pair
pairs = [[query, c] for c in candidates]
scores = reranker.predict(pairs)

# Sort by score
ranked = sorted(zip(scores, candidates), reverse=True)
top_3 = [doc for _, doc in ranked[:3]]
print("Top 3 after reranking:")
for score, doc in ranked[:3]:
    print(f"Score {score:.3f}: {doc}")

Other strong local rerankers: BAAI/bge-reranker-large, mixedbread-ai/mxbai-rerank-large-v1.

SECTION 05

LLM reranking

For the highest precision, use the same LLM you're generating with to rerank:

import anthropic, json

client = anthropic.Anthropic()

def llm_rerank(query: str, candidates: list[str], top_n: int = 3) -> list[str]:
    numbered = "\n".join(f"{i+1}. {c}" for i, c in enumerate(candidates))
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=256,
        messages=[{"role": "user", "content": f'''
Given this question: "{query}"

Rank these passages from most to least relevant. Return ONLY a JSON array of
1-based indices in order of relevance, e.g. [3, 1, 5, 2, 4].

Passages:
{numbered}
'''}]
    )
    indices = json.loads(response.content[0].text.strip())
    return [candidates[i-1] for i in indices[:top_n]]

LLM reranking is the most accurate but slowest and most expensive. Use it when precision is paramount and latency allows.

SECTION 06

Integrating into a RAG pipeline

import anthropic
import cohere
from your_retriever import dense_retrieve  # your retrieval function

client = anthropic.Anthropic()
co = cohere.Client("your-cohere-key")

def rag_with_reranking(query: str) -> str:
    # Step 1: Retrieve top-20 candidates (fast, recall-focused)
    candidates = dense_retrieve(query, k=20)

    # Step 2: Rerank to top-3 (slow, precision-focused)
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=3
    )
    context = "\n\n".join(
        candidates[r.index]["text"] for r in reranked.results
    )

    # Step 3: Generate with focused context
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        system="Answer the question using only the provided context.",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}]
    )
    return response.content[0].text

SECTION 07

Gotchas

Reranker max input length. Cross-encoders have a token limit (typically 512 tokens for the combined query+document). Long chunks get truncated. Either chunk smaller or use a reranker with a larger context window (e.g., bge-reranker-large supports up to 512 tokens per chunk).

Retrieve more than you think you need. If you ultimately want top-5 for the LLM, retrieve top-50 to top-100 before reranking. The reranker needs a diverse candidate set to work effectively — reranking top-5 from retrieval produces minimal improvement.

Don't rerank in a tight latency budget. A cross-encoder adds 100–500ms per query depending on hardware. For sub-100ms response requirements, use a smaller cross-encoder model or skip reranking for high-confidence queries.

Reranking method comparison

The choice of reranking method involves tradeoffs between latency, quality, cost, and operational complexity. Cross-encoder models provide the highest relevance quality but add 50–300ms of latency depending on the candidate set size and model size. LLM-based reranking using listwise or pairwise prompts achieves comparable quality to cross-encoders on many tasks but at higher cost and latency. Lightweight cross-encoders like ms-marco-MiniLM provide most of the quality improvement at a fraction of the latency.

Method	Latency (top-50)	Quality	Cost	Deployment
BM25 only (baseline)	N/A	Medium	Free	Simple
MiniLM cross-encoder	~50ms	High	Low (local)	Simple
Cohere Rerank API	~200ms	High	Per-call API	Zero infra
LLM listwise rerank	1–3s	Very high	High (tokens)	Simple
ColBERT late interaction	~100ms	Very high	Low (local)	Complex index

The typical production reranking stack combines dense bi-encoder retrieval (top-100 candidates) with a lightweight cross-encoder reranker (top-10 output). This two-stage approach provides a good quality-latency balance: the bi-encoder handles scale efficiently, and the cross-encoder's 50ms overhead is acceptable for most interactive applications. For very latency-sensitive applications, disabling reranking and improving the initial retrieval quality through better embedding models or hybrid search is often more effective than accepting the reranking latency.

Reranking quality evaluation requires a dedicated labeled dataset distinct from the retrieval evaluation dataset, because reranking operates on the already-retrieved candidate set rather than the full document corpus. The key metric for reranker quality is nDCG@k on the reranked output compared to nDCG@k on the pre-reranked candidate set, measuring how much the reranker improves over the initial retrieval ordering. A well-calibrated reranker should improve nDCG@10 by 5–20 percentage points over the initial bi-encoder ranking on held-out queries. If improvement is below 5 points, the initial retrieval quality is likely the binding constraint and resources are better spent improving the retrieval stage.

Candidate set size for reranking involves a tradeoff between coverage and latency. Reranking a larger candidate set reduces the probability that the true relevant document was not retrieved in the first place (first-stage recall determines the ceiling on reranking recall), but increases reranking latency linearly with candidate count. The optimal candidate set size depends on first-stage recall at different cutoffs — if the bi-encoder achieves 80% recall at top-20 and 85% at top-50, reranking 50 candidates provides marginal recall improvement at 2.5x the latency cost of reranking 20. Measuring first-stage recall at multiple cutoffs empirically identifies the point of diminishing returns for candidate set expansion.

Reranking can be applied selectively to only high-uncertainty queries, reducing average latency while maintaining quality on queries where the initial ranking is unreliable. Confidence scoring on the initial retrieval — using the score gap between the top-1 and top-2 retrieved documents as a proxy for ranking certainty — identifies queries where the initial ordering is ambiguous and reranking is most likely to change the result. Routing only low-confidence queries to the reranker reduces the fraction of requests that incur reranking latency from 100% to 20–40%, with minimal overall quality impact because the reranker provides little benefit for queries where the initial ranking is already confident.

Reranker model fine-tuning on domain-specific relevance data consistently outperforms general-purpose rerankers for specialized applications. A general ms-marco-trained cross-encoder reranker performs well on web-style queries but may underperform on technical documentation, legal text, or scientific literature where query-document relevance patterns differ significantly from web search. Collecting 1,000–5,000 domain-specific query-document relevance labels — using existing user interaction logs, domain expert annotation, or LLM-generated labels — and fine-tuning a cross-encoder on this data typically produces 5–15 percentage point nDCG@10 improvements over the general-purpose reranker on the target domain.

Reranking with listwise LLM prompts — where the full ranked list is presented to the LLM and it returns a reordered list — generally outperforms pointwise scoring (scoring each document independently) because the model can compare documents directly rather than scoring in isolation. The RankGPT approach presents the top-20 retrieved passages in a single prompt and asks the model to rank them by relevance, achieving cross-encoder-level quality at the cost of processing 20 passages worth of input tokens per query. The latency of listwise reranking depends on the total tokens in the 20 passages, typically 3,000–8,000 tokens, making it suitable for applications that prioritize quality over strict latency constraints.