Post-Retrieval Processing

Contents

Post-retrieval gap
Techniques overview
Cross-encoder reranking
Context compression
Reciprocal rank fusion
Context window management
Evaluation metrics
Tools & resources

01 — The Problem

The Post-Retrieval Gap

Retrieval gives you candidates; it doesn't guarantee they're the best ones for generation. A dense retriever may return semantically similar documents that are actually tangential. Sparse (BM25) retrievers excel at keyword matching but miss semantic nuance. And the lost-in-the-middle effect (Liu et al. 2023) shows that LLMs attend worst to content in the middle of their context — best at the start and end.

Post-retrieval processing closes this gap: rerank for relevance, compress to remove noise, reorder to put critical context at the edges.

💡 Retrieval gets candidates; post-retrieval makes the final cut. Top-k retrieved chunks ≠ best chunks for generation. A second pass adds precision.

Key Distinctions

Relevance vs Faithfulness: A chunk may be relevant to the query but contradict other retrieved chunks. Reranking optimizes for relevance; compression and ordering handle consistency.

02 — Overview

Post-Retrieval Techniques

Each technique addresses a different bottleneck:

Technique	Purpose	Latency added	Quality gain	Complexity
Cross-encoder reranking	Precision boost	+50–200ms	High	Medium
Context compression	Reduce noise/tokens	+100–500ms	Medium	Medium
RRF fusion	Combine sparse+dense	+5ms	Medium	Low
Relevance filtering	Remove irrelevant	+5ms	Low–medium	Low
Ordering optimisation	Reduce lost-in-middle	+1ms	Medium	Low

03 — Precision

Cross-Encoder Reranking

Bi-encoder: Embed query and documents separately, compute similarity via dot product. Fast at retrieval time. But embeddings ignore document context — similarity is shallow.

Cross-encoder: Pass query + document as a pair to the model. The model attends to both, producing a relevance score. Slower (no pre-computed embeddings) but more accurate for reranking.

Popular Cross-Encoder Models

Cohere Rerank API: Managed service, lowest latency, best for production.
BGE Reranker: Open-source BERT-based model; good quality, self-hosted.
FlashRank: Lightweight local model; fast inference on CPU.

Using Cohere Rerank

from cohere import Client co = Client(api_key="...") # Retrieve 10 candidates with dense retriever retrieved = [ "Document A content...", "Document B content...", # ... 8 more ] # Rerank results = co.rerank( query="How do I train a language model?", documents=retrieved, top_n=3, # Return top 3 model="rerank-english-v2.0" ) # Results ordered by relevance score for result in results: print(f"Index: {result.index}, Score: {result.relevance_score}") print(retrieved[result.index])

Score Calibration

Reranker scores aren't directly comparable across queries. A score of 0.8 on one query doesn't mean the same relevance as 0.8 on another. Use scores only to rank within a query, not across queries.

When to Skip Reranking

Reranking adds 50–200ms latency. Skip if: (1) latency budget is tight, (2) retrieval quality is already high, (3) you're doing real-time interactive search (< 100ms required).

04 — Efficiency

Context Compression

Retrieved chunks often contain noise — irrelevant sentences, metadata, boilerplate. Compression removes it, keeping only salient content. Two approaches:

Extractive Compression

Select key sentences from each chunk. Fast (no LLM call), but may break narrative flow.

from sumy.parsers import PlaintextParser from sumy.nlp.tokenizer import Tokenizer from sumy.summarizers.text_rank import TextRankSummarizer def compress_extract(text: str, ratio=0.5) -> str: """Extract top ratio of sentences.""" parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = TextRankSummarizer() summary = summarizer(parser.document, sentences_count=int(len(parser.document.sentences) * ratio)) return " ".join(str(s) for s in summary)

Abstractive Compression with LLMLingua

LLMLingua uses targeted token pruning. It identifies tokens that are least important to the query, removes them, and keeps the compressed text coherent.

from llmlingua import LanguageModel llm_lingua = LanguageModel("gpt-3.5-turbo") context = "This is a long document that contains..." query = "What is the main topic?" # Compress to 50% of original length compressed = llm_lingua.compress( context=context, question=query, target_token_len=150, iterative_size=200 ) print(f"Original: {len(context.split())} words") print(f"Compressed: {len(compressed.split())} words") print(compressed)

LongLLMLingua for Very Long Contexts

For documents >10K tokens, split into chunks, compress each, then summarize. Avoids exceeding model context limits during compression.

Compression Ratio Tradeoffs

🎯 70% compression

Minimal quality loss
30% token savings
Safe default

⚡ 50% compression

Half the tokens
Some context loss
Use for long docs

05 — Fusion

Reciprocal Rank Fusion

Combine results from multiple retrievers (dense, sparse, metadata search) using Reciprocal Rank Fusion (RRF). Each retriever votes on relevance; documents ranked high in multiple retrievers bubble to top.

RRF Formula

score(d) = Σ 1/(k + rank_i)

where k=60 (constant), rank_i = position in retriever i's results

Python Implementation

def reciprocal_rank_fusion(results_list, k=60): """ results_list: list of [doc_ids] from each retriever Returns: dict of doc_id -> RRF score """ rrf_scores = {} for retriever_idx, results in enumerate(results_list): for rank, doc_id in enumerate(results, start=1): score = 1 / (k + rank) rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + score # Sort by score, descending return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) # Example: combine BM25 + dense retriever + metadata search bm25_results = ["doc_5", "doc_12", "doc_3", ...] dense_results = ["doc_3", "doc_9", "doc_5", ...] metadata_results = ["doc_9", "doc_1", "doc_12", ...] fused = reciprocal_rank_fusion( [bm25_results, dense_results, metadata_results] ) # Result: [(doc_id, score), ...] for doc_id, score in fused[:5]: print(f"{doc_id}: {score:.4f}")

Advantages

No re-retrieval needed; combines existing results
Latency: negligible (~5ms)
Works with any number of retrievers
Empirically improves over single retrievers

06 — Optimization

Context Window Management

Your token budget is fixed: query tokens + history + context + response. Allocate wisely.

The Lost-in-the-Middle Effect

LLMs attend least to content in the middle of context windows. Most important information should be at the start (right after the query) or at the very end, just before the generation prompt.

Chunk Ordering Strategy

Best position (start): Most relevant 1–2 chunks. Middle (worst): Least critical supporting docs. End (strong): Second-most relevant chunks.

def arrange_context(chunks, scores): """Order chunks to minimize lost-in-the-middle.""" # Sort by score ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True) # Arrange: best, middle, second-best if len(ranked) == 0: return [] if len(ranked) == 1: return [ranked[0][0]] best = ranked[0][0] rest = ranked[1:] # Put best first, rest in ascending then descending score half = len(rest) // 2 middle = rest[half:] end = rest[:half] end.reverse() # Reverse to get descending return [best] + middle + end # Usage context_string = "\n\n".join(arrange_context( chunks=retrieved_docs, scores=rerank_scores ))

Metadata Injection

Prepend each chunk with metadata: document title, section, date. Helps the model understand context and filter:

[Document: "AI Safety" | Section: "Alignment" | Date: 2026-03-24] The alignment problem refers to ensuring AI systems behave according to human values...

Token Budget Allocation

Example for 4K token limit:

Query + system: 300 tokens
Conversation history: 500 tokens
Retrieved context: 2000 tokens (5–10 chunks)
Response space (reserved): 1200 tokens

max_context_tokens Parameter

Set dynamically based on query complexity:

def estimate_tokens(query: str) -> int: """Estimate query complexity; allocate context tokens.""" import tiktoken enc = tiktoken.get_encoding("cl100k_base") query_tokens = len(enc.encode(query)) # Simple queries get less context (faster inference) # Complex queries get more (better coverage) if query_tokens < 20: return 1500 # Short, likely FAQ elif query_tokens < 50: return 2500 # Medium complexity else: return 4000 # Deep research question max_ctx = estimate_tokens("What is machine learning?")

07 — Metrics

Evaluation

Measure the impact of post-retrieval processing on RAG quality.

Ranking Metrics: MRR@k, NDCG@k

MRR (Mean Reciprocal Rank): Position of first relevant result. Rewards putting relevant docs high.

def mrr(retrieved_docs, relevant_doc_ids, k=5): """Mean reciprocal rank@k.""" for rank, doc_id in enumerate(retrieved_docs[:k], start=1): if doc_id in relevant_doc_ids: return 1.0 / rank return 0.0 # Example: relevant doc at position 2 mrr(["doc_5", "doc_3", "doc_9", ...], ["doc_3"], k=5) # Returns 1/2 = 0.5

NDCG (Normalized Discounted Cumulative Gain): Discounts lower-ranked relevant docs; normalized against perfect ranking.

def ndcg(retrieved_scores, relevant_doc_ids, k=5): """NDCG@k: penalizes relevant docs appearing late.""" dcg = 0.0 for rank, (doc_id, score) in enumerate(retrieved_scores[:k], start=1): if doc_id in relevant_doc_ids: dcg += 1.0 / log2(rank + 1) # Ideal DCG: all relevant docs ranked first idcg = sum(1.0 / log2(i + 1) for i in range(1, min(k, len(relevant_doc_ids)) + 1)) return dcg / idcg if idcg > 0 else 0.0

Faithfulness Score

After compression, evaluate if compressed context still supports the answer:

def check_faithfulness(compressed_context: str, original_context: str, answer: str): """Ask LLM if answer is faithful to both versions.""" prompt = f""" Original context: {original_context} Compressed context: {compressed_context} Generated answer: {answer} Is the answer faithfully supported by the compressed context? Answer: yes/no Confidence: 0-100 """ response = llm(prompt) return response

Ablation Study

Measure incremental gains from each post-retrieval step:

Results on benchmark: answer_relevance_f1 Baseline (retrieval only): 0.72 + Reranking: 0.78 (+6 points) + Reranking + Compression: 0.80 (+2 points) + Reranking + Compression + Reorder: 0.82 (+2 points) Conclusion: Reranking dominates; compression and ordering add marginal but consistent gains.

08 — Ecosystem

Tools & Resources

Post-Retrieval Processing Stack

API Service

Cohere Rerank

Managed cross-encoder reranking; lowest latency; production-ready

Model

BGE Reranker

Open-source BERT cross-encoder; self-hosted; good quality

Model

FlashRank

Lightweight local reranker; runs on CPU; fast inference

Library

LLMLingua

Token-level context compression; selective pruning; preserves meaning

Library

LongLLMLingua

Compression for very long contexts; hierarchical summarization

Framework

LangChain

RAG pipeline orchestration; integrations with retrievers, rerankers, compression

Framework

LlamaIndex

Document indexing and retrieval; post-retrieval node processing

Evaluation

RAGAS

RAG evaluation suite; faithfulness, relevance, answer quality metrics

Papers & Research

Paper Ong, M. et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 — arxiv.org ↗
Paper Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 — arxiv.org ↗

Documentation & Guides

Docs Cohere Rerank API — docs.cohere.com/docs/rerank ↗
Docs LLMLingua — github.com/microsoft/LLMLingua ↗
Docs FlashRank — github.com/PrithivirajDamodaran/FlashRank ↗
Docs RAGAS — docs.ragas.io ↗

Practitioner Guides

Blog FlashRank GitHub. Fast, Efficient Cross-Encoder Reranking. — github.com ↗

Post-Retrieval Processing

The Post-Retrieval Gap

Key Distinctions

Post-Retrieval Techniques

Cross-Encoder Reranking

Popular Cross-Encoder Models

Using Cohere Rerank

Score Calibration

When to Skip Reranking

Context Compression

Extractive Compression

Abstractive Compression with LLMLingua

LongLLMLingua for Very Long Contexts

Compression Ratio Tradeoffs

🎯 70% compression

⚡ 50% compression

Reciprocal Rank Fusion

RRF Formula

Python Implementation

Advantages

Context Window Management

The Lost-in-the-Middle Effect

Chunk Ordering Strategy

Metadata Injection

Token Budget Allocation

max_context_tokens Parameter

Evaluation

Ranking Metrics: MRR@k, NDCG@k

Faithfulness Score

Ablation Study

Tools & Resources

Post-Retrieval Processing Stack

Related concepts