RAG · Retrieval

Post-Retrieval Processing

What happens between retrieval and generation — reranking, context compression, and fusion strategies

4 techniques
6 sections
Python first
Contents
  1. Post-retrieval gap
  2. Techniques overview
  3. Cross-encoder reranking
  4. Context compression
  5. Reciprocal rank fusion
  6. Context window management
  7. Evaluation metrics
  8. Tools & resources
01 — The Problem

The Post-Retrieval Gap

Retrieval gives you candidates; it doesn't guarantee they're the best ones for generation. A dense retriever may return semantically similar documents that are actually tangential. Sparse (BM25) retrievers excel at keyword matching but miss semantic nuance. And the lost-in-the-middle effect (Liu et al. 2023) shows that LLMs attend worst to content in the middle of their context — best at the start and end.

Post-retrieval processing closes this gap: rerank for relevance, compress to remove noise, reorder to put critical context at the edges.

💡 Retrieval gets candidates; post-retrieval makes the final cut. Top-k retrieved chunks ≠ best chunks for generation. A second pass adds precision.

Key Distinctions

Relevance vs Faithfulness: A chunk may be relevant to the query but contradict other retrieved chunks. Reranking optimizes for relevance; compression and ordering handle consistency.

02 — Overview

Post-Retrieval Techniques

Each technique addresses a different bottleneck:

TechniquePurposeLatency addedQuality gainComplexity
Cross-encoder rerankingPrecision boost+50–200msHighMedium
Context compressionReduce noise/tokens+100–500msMediumMedium
RRF fusionCombine sparse+dense+5msMediumLow
Relevance filteringRemove irrelevant+5msLow–mediumLow
Ordering optimisationReduce lost-in-middle+1msMediumLow
03 — Precision

Cross-Encoder Reranking

Bi-encoder: Embed query and documents separately, compute similarity via dot product. Fast at retrieval time. But embeddings ignore document context — similarity is shallow.

Cross-encoder: Pass query + document as a pair to the model. The model attends to both, producing a relevance score. Slower (no pre-computed embeddings) but more accurate for reranking.

Popular Cross-Encoder Models

Using Cohere Rerank

from cohere import Client co = Client(api_key="...") # Retrieve 10 candidates with dense retriever retrieved = [ "Document A content...", "Document B content...", # ... 8 more ] # Rerank results = co.rerank( query="How do I train a language model?", documents=retrieved, top_n=3, # Return top 3 model="rerank-english-v2.0" ) # Results ordered by relevance score for result in results: print(f"Index: {result.index}, Score: {result.relevance_score}") print(retrieved[result.index])

Score Calibration

Reranker scores aren't directly comparable across queries. A score of 0.8 on one query doesn't mean the same relevance as 0.8 on another. Use scores only to rank within a query, not across queries.

When to Skip Reranking

Reranking adds 50–200ms latency. Skip if: (1) latency budget is tight, (2) retrieval quality is already high, (3) you're doing real-time interactive search (< 100ms required).

04 — Efficiency

Context Compression

Retrieved chunks often contain noise — irrelevant sentences, metadata, boilerplate. Compression removes it, keeping only salient content. Two approaches:

Extractive Compression

Select key sentences from each chunk. Fast (no LLM call), but may break narrative flow.

from sumy.parsers import PlaintextParser from sumy.nlp.tokenizer import Tokenizer from sumy.summarizers.text_rank import TextRankSummarizer def compress_extract(text: str, ratio=0.5) -> str: """Extract top ratio of sentences.""" parser = PlaintextParser.from_string(text, Tokenizer("english")) summarizer = TextRankSummarizer() summary = summarizer(parser.document, sentences_count=int(len(parser.document.sentences) * ratio)) return " ".join(str(s) for s in summary)

Abstractive Compression with LLMLingua

LLMLingua uses targeted token pruning. It identifies tokens that are least important to the query, removes them, and keeps the compressed text coherent.

from llmlingua import LanguageModel llm_lingua = LanguageModel("gpt-3.5-turbo") context = "This is a long document that contains..." query = "What is the main topic?" # Compress to 50% of original length compressed = llm_lingua.compress( context=context, question=query, target_token_len=150, iterative_size=200 ) print(f"Original: {len(context.split())} words") print(f"Compressed: {len(compressed.split())} words") print(compressed)

LongLLMLingua for Very Long Contexts

For documents >10K tokens, split into chunks, compress each, then summarize. Avoids exceeding model context limits during compression.

Compression Ratio Tradeoffs

🎯 70% compression

  • Minimal quality loss
  • 30% token savings
  • Safe default

50% compression

  • Half the tokens
  • Some context loss
  • Use for long docs
05 — Fusion

Reciprocal Rank Fusion

Combine results from multiple retrievers (dense, sparse, metadata search) using Reciprocal Rank Fusion (RRF). Each retriever votes on relevance; documents ranked high in multiple retrievers bubble to top.

RRF Formula

score(d) = Σ 1/(k + rank_i)

where k=60 (constant), rank_i = position in retriever i's results

Python Implementation

def reciprocal_rank_fusion(results_list, k=60): """ results_list: list of [doc_ids] from each retriever Returns: dict of doc_id -> RRF score """ rrf_scores = {} for retriever_idx, results in enumerate(results_list): for rank, doc_id in enumerate(results, start=1): score = 1 / (k + rank) rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + score # Sort by score, descending return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) # Example: combine BM25 + dense retriever + metadata search bm25_results = ["doc_5", "doc_12", "doc_3", ...] dense_results = ["doc_3", "doc_9", "doc_5", ...] metadata_results = ["doc_9", "doc_1", "doc_12", ...] fused = reciprocal_rank_fusion( [bm25_results, dense_results, metadata_results] ) # Result: [(doc_id, score), ...] for doc_id, score in fused[:5]: print(f"{doc_id}: {score:.4f}")

Advantages

06 — Optimization

Context Window Management

Your token budget is fixed: query tokens + history + context + response. Allocate wisely.

The Lost-in-the-Middle Effect

LLMs attend least to content in the middle of context windows. Most important information should be at the start (right after the query) or at the very end, just before the generation prompt.

Chunk Ordering Strategy

Best position (start): Most relevant 1–2 chunks. Middle (worst): Least critical supporting docs. End (strong): Second-most relevant chunks.

def arrange_context(chunks, scores): """Order chunks to minimize lost-in-the-middle.""" # Sort by score ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True) # Arrange: best, middle, second-best if len(ranked) == 0: return [] if len(ranked) == 1: return [ranked[0][0]] best = ranked[0][0] rest = ranked[1:] # Put best first, rest in ascending then descending score half = len(rest) // 2 middle = rest[half:] end = rest[:half] end.reverse() # Reverse to get descending return [best] + middle + end # Usage context_string = "\n\n".join(arrange_context( chunks=retrieved_docs, scores=rerank_scores ))

Metadata Injection

Prepend each chunk with metadata: document title, section, date. Helps the model understand context and filter:

[Document: "AI Safety" | Section: "Alignment" | Date: 2026-03-24] The alignment problem refers to ensuring AI systems behave according to human values...

Token Budget Allocation

Example for 4K token limit:

max_context_tokens Parameter

Set dynamically based on query complexity:

def estimate_tokens(query: str) -> int: """Estimate query complexity; allocate context tokens.""" import tiktoken enc = tiktoken.get_encoding("cl100k_base") query_tokens = len(enc.encode(query)) # Simple queries get less context (faster inference) # Complex queries get more (better coverage) if query_tokens < 20: return 1500 # Short, likely FAQ elif query_tokens < 50: return 2500 # Medium complexity else: return 4000 # Deep research question max_ctx = estimate_tokens("What is machine learning?")
07 — Metrics

Evaluation

Measure the impact of post-retrieval processing on RAG quality.

Ranking Metrics: MRR@k, NDCG@k

MRR (Mean Reciprocal Rank): Position of first relevant result. Rewards putting relevant docs high.

def mrr(retrieved_docs, relevant_doc_ids, k=5): """Mean reciprocal rank@k.""" for rank, doc_id in enumerate(retrieved_docs[:k], start=1): if doc_id in relevant_doc_ids: return 1.0 / rank return 0.0 # Example: relevant doc at position 2 mrr(["doc_5", "doc_3", "doc_9", ...], ["doc_3"], k=5) # Returns 1/2 = 0.5

NDCG (Normalized Discounted Cumulative Gain): Discounts lower-ranked relevant docs; normalized against perfect ranking.

def ndcg(retrieved_scores, relevant_doc_ids, k=5): """NDCG@k: penalizes relevant docs appearing late.""" dcg = 0.0 for rank, (doc_id, score) in enumerate(retrieved_scores[:k], start=1): if doc_id in relevant_doc_ids: dcg += 1.0 / log2(rank + 1) # Ideal DCG: all relevant docs ranked first idcg = sum(1.0 / log2(i + 1) for i in range(1, min(k, len(relevant_doc_ids)) + 1)) return dcg / idcg if idcg > 0 else 0.0

Faithfulness Score

After compression, evaluate if compressed context still supports the answer:

def check_faithfulness(compressed_context: str, original_context: str, answer: str): """Ask LLM if answer is faithful to both versions.""" prompt = f""" Original context: {original_context} Compressed context: {compressed_context} Generated answer: {answer} Is the answer faithfully supported by the compressed context? Answer: yes/no Confidence: 0-100 """ response = llm(prompt) return response

Ablation Study

Measure incremental gains from each post-retrieval step:

Results on benchmark: answer_relevance_f1 Baseline (retrieval only): 0.72 + Reranking: 0.78 (+6 points) + Reranking + Compression: 0.80 (+2 points) + Reranking + Compression + Reorder: 0.82 (+2 points) Conclusion: Reranking dominates; compression and ordering add marginal but consistent gains.
08 — Ecosystem

Tools & Resources

Post-Retrieval Processing Stack

API Service
Cohere Rerank
Managed cross-encoder reranking; lowest latency; production-ready
Model
BGE Reranker
Open-source BERT cross-encoder; self-hosted; good quality
Model
FlashRank
Lightweight local reranker; runs on CPU; fast inference
Library
LLMLingua
Token-level context compression; selective pruning; preserves meaning
Library
LongLLMLingua
Compression for very long contexts; hierarchical summarization
Framework
LangChain
RAG pipeline orchestration; integrations with retrievers, rerankers, compression
Framework
LlamaIndex
Document indexing and retrieval; post-retrieval node processing
Evaluation
RAGAS
RAG evaluation suite; faithfulness, relevance, answer quality metrics
Papers & Research
  • Paper Ong, M. et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. arXiv:2310.05736 — arxiv.org ↗
  • Paper Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 — arxiv.org ↗
Documentation & Guides
Practitioner Guides