Retrieval gives you candidates; it doesn't guarantee they're the best ones for generation. A dense retriever may return semantically similar documents that are actually tangential. Sparse (BM25) retrievers excel at keyword matching but miss semantic nuance. And the lost-in-the-middle effect (Liu et al. 2023) shows that LLMs attend worst to content in the middle of their context — best at the start and end.
Post-retrieval processing closes this gap: rerank for relevance, compress to remove noise, reorder to put critical context at the edges.
💡Retrieval gets candidates; post-retrieval makes the final cut. Top-k retrieved chunks ≠ best chunks for generation. A second pass adds precision.
Key Distinctions
Relevance vs Faithfulness: A chunk may be relevant to the query but contradict other retrieved chunks. Reranking optimizes for relevance; compression and ordering handle consistency.
02 — Overview
Post-Retrieval Techniques
Each technique addresses a different bottleneck:
Technique
Purpose
Latency added
Quality gain
Complexity
Cross-encoder reranking
Precision boost
+50–200ms
High
Medium
Context compression
Reduce noise/tokens
+100–500ms
Medium
Medium
RRF fusion
Combine sparse+dense
+5ms
Medium
Low
Relevance filtering
Remove irrelevant
+5ms
Low–medium
Low
Ordering optimisation
Reduce lost-in-middle
+1ms
Medium
Low
03 — Precision
Cross-Encoder Reranking
Bi-encoder: Embed query and documents separately, compute similarity via dot product. Fast at retrieval time. But embeddings ignore document context — similarity is shallow.
Cross-encoder: Pass query + document as a pair to the model. The model attends to both, producing a relevance score. Slower (no pre-computed embeddings) but more accurate for reranking.
Popular Cross-Encoder Models
Cohere Rerank API: Managed service, lowest latency, best for production.
BGE Reranker: Open-source BERT-based model; good quality, self-hosted.
FlashRank: Lightweight local model; fast inference on CPU.
Using Cohere Rerank
from cohere import Client
co = Client(api_key="...")
# Retrieve 10 candidates with dense retriever
retrieved = [
"Document A content...",
"Document B content...",
# ... 8 more
]
# Rerank
results = co.rerank(
query="How do I train a language model?",
documents=retrieved,
top_n=3, # Return top 3
model="rerank-english-v2.0"
)
# Results ordered by relevance score
for result in results:
print(f"Index: {result.index}, Score: {result.relevance_score}")
print(retrieved[result.index])
Score Calibration
Reranker scores aren't directly comparable across queries. A score of 0.8 on one query doesn't mean the same relevance as 0.8 on another. Use scores only to rank within a query, not across queries.
Retrieved chunks often contain noise — irrelevant sentences, metadata, boilerplate. Compression removes it, keeping only salient content. Two approaches:
Extractive Compression
Select key sentences from each chunk. Fast (no LLM call), but may break narrative flow.
from sumy.parsers import PlaintextParser
from sumy.nlp.tokenizer import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
def compress_extract(text: str, ratio=0.5) -> str:
"""Extract top ratio of sentences."""
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=int(len(parser.document.sentences) * ratio))
return " ".join(str(s) for s in summary)
Abstractive Compression with LLMLingua
LLMLingua uses targeted token pruning. It identifies tokens that are least important to the query, removes them, and keeps the compressed text coherent.
from llmlingua import LanguageModel
llm_lingua = LanguageModel("gpt-3.5-turbo")
context = "This is a long document that contains..."
query = "What is the main topic?"
# Compress to 50% of original length
compressed = llm_lingua.compress(
context=context,
question=query,
target_token_len=150,
iterative_size=200
)
print(f"Original: {len(context.split())} words")
print(f"Compressed: {len(compressed.split())} words")
print(compressed)
LongLLMLingua for Very Long Contexts
For documents >10K tokens, split into chunks, compress each, then summarize. Avoids exceeding model context limits during compression.
Compression Ratio Tradeoffs
🎯 70% compression
Minimal quality loss
30% token savings
Safe default
⚡ 50% compression
Half the tokens
Some context loss
Use for long docs
05 — Fusion
Reciprocal Rank Fusion
Combine results from multiple retrievers (dense, sparse, metadata search) using Reciprocal Rank Fusion (RRF). Each retriever votes on relevance; documents ranked high in multiple retrievers bubble to top.
RRF Formula
score(d) = Σ 1/(k + rank_i)
where k=60 (constant), rank_i = position in retriever i's results
Python Implementation
def reciprocal_rank_fusion(results_list, k=60):
"""
results_list: list of [doc_ids] from each retriever
Returns: dict of doc_id -> RRF score
"""
rrf_scores = {}
for retriever_idx, results in enumerate(results_list):
for rank, doc_id in enumerate(results, start=1):
score = 1 / (k + rank)
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + score
# Sort by score, descending
return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
# Example: combine BM25 + dense retriever + metadata search
bm25_results = ["doc_5", "doc_12", "doc_3", ...]
dense_results = ["doc_3", "doc_9", "doc_5", ...]
metadata_results = ["doc_9", "doc_1", "doc_12", ...]
fused = reciprocal_rank_fusion(
[bm25_results, dense_results, metadata_results]
)
# Result: [(doc_id, score), ...]
for doc_id, score in fused[:5]:
print(f"{doc_id}: {score:.4f}")
Advantages
No re-retrieval needed; combines existing results
Latency: negligible (~5ms)
Works with any number of retrievers
Empirically improves over single retrievers
06 — Optimization
Context Window Management
Your token budget is fixed: query tokens + history + context + response. Allocate wisely.
The Lost-in-the-Middle Effect
LLMs attend least to content in the middle of context windows. Most important information should be at the start (right after the query) or at the very end, just before the generation prompt.
Chunk Ordering Strategy
Best position (start): Most relevant 1–2 chunks. Middle (worst): Least critical supporting docs. End (strong): Second-most relevant chunks.
def arrange_context(chunks, scores):
"""Order chunks to minimize lost-in-the-middle."""
# Sort by score
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
# Arrange: best, middle, second-best
if len(ranked) == 0:
return []
if len(ranked) == 1:
return [ranked[0][0]]
best = ranked[0][0]
rest = ranked[1:]
# Put best first, rest in ascending then descending score
half = len(rest) // 2
middle = rest[half:]
end = rest[:half]
end.reverse() # Reverse to get descending
return [best] + middle + end
# Usage
context_string = "\n\n".join(arrange_context(
chunks=retrieved_docs,
scores=rerank_scores
))
Metadata Injection
Prepend each chunk with metadata: document title, section, date. Helps the model understand context and filter:
[Document: "AI Safety" | Section: "Alignment" | Date: 2026-03-24]
The alignment problem refers to ensuring AI systems behave
according to human values...
Token Budget Allocation
Example for 4K token limit:
Query + system: 300 tokens
Conversation history: 500 tokens
Retrieved context: 2000 tokens (5–10 chunks)
Response space (reserved): 1200 tokens
max_context_tokens Parameter
Set dynamically based on query complexity:
def estimate_tokens(query: str) -> int:
"""Estimate query complexity; allocate context tokens."""
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
query_tokens = len(enc.encode(query))
# Simple queries get less context (faster inference)
# Complex queries get more (better coverage)
if query_tokens < 20:
return 1500 # Short, likely FAQ
elif query_tokens < 50:
return 2500 # Medium complexity
else:
return 4000 # Deep research question
max_ctx = estimate_tokens("What is machine learning?")
07 — Metrics
Evaluation
Measure the impact of post-retrieval processing on RAG quality.
Ranking Metrics: MRR@k, NDCG@k
MRR (Mean Reciprocal Rank): Position of first relevant result. Rewards putting relevant docs high.
def mrr(retrieved_docs, relevant_doc_ids, k=5):
"""Mean reciprocal rank@k."""
for rank, doc_id in enumerate(retrieved_docs[:k], start=1):
if doc_id in relevant_doc_ids:
return 1.0 / rank
return 0.0
# Example: relevant doc at position 2
mrr(["doc_5", "doc_3", "doc_9", ...], ["doc_3"], k=5)
# Returns 1/2 = 0.5