Context Compression

The noise problem in RAG context
Extractive compression
LLM-based compression
LangChain ContextualCompressionRetriever
When to compress vs rerank
Gotchas

SECTION 01

The noise problem in RAG context

You retrieve the top-5 chunks. Each chunk is 512 tokens. But the user's question is about a single sentence in chunk 3 — the other 4 chunks and 90% of chunk 3 are noise. You're sending 2560 tokens to the LLM when 50 tokens would suffice.

Two problems follow: cost (you pay for all 2560 tokens) and quality ("lost in the middle" — LLMs are less accurate when the relevant content is buried deep in a long context). Context compression solves both by stripping out the irrelevant parts before they reach the LLM.

SECTION 02

Extractive compression

Extract only the sentences or fragments from the retrieved chunks that are relevant to the query, discarding the rest:

import anthropic

client = anthropic.Anthropic()

def extract_relevant_sentences(query: str, passage: str) -> str:
    '''Extract only the sentences from passage that answer the query.'''
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": f'''
Extract only the sentences from the passage below that are directly relevant
to answering the question. If no sentences are relevant, return "IRRELEVANT".
Return only the extracted sentences, nothing else.

Question: {query}

Passage:
{passage}
'''}]
    )
    result = response.content[0].text.strip()
    return "" if result == "IRRELEVANT" else result

# Usage
passage = '''
Our company was founded in 2015 by Jane Smith.
Returns are accepted within 30 days of purchase with the original receipt.
We have offices in New York, London, and Singapore.
Refunds are processed to the original payment method within 5 business days.
Our CEO has over 20 years of retail experience.
'''
compressed = extract_relevant_sentences("How long does a refund take?", passage)
print(compressed)
# "Refunds are processed to the original payment method within 5 business days."

SECTION 03

LLM-based compression

Instead of extracting sentences, ask the LLM to summarise the relevant portions:

import anthropic

client = anthropic.Anthropic()

def compress_to_answer(query: str, passages: list[str]) -> str:
    '''Compress multiple retrieved passages into a focused context.'''
    combined = "\n\n---\n\n".join(passages)
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=300,
        messages=[{"role": "user", "content": f'''
Given the following retrieved passages and a question, extract and condense
only the information relevant to answering the question.
Be concise. If a passage has no relevant information, skip it.

Question: {query}

Retrieved passages:
{combined}

Condensed relevant information:
'''}]
    )
    return response.content[0].text.strip()

query = "What is the refund policy?"
passages = [
    "Our company was founded in 2015. We have 500 employees globally.",
    "Returns are accepted within 30 days. Refunds take 5 business days.",
    "We ship worldwide. Free shipping on orders over $50.",
]
compressed = compress_to_answer(query, passages)
print(compressed)
# "Returns are accepted within 30 days of purchase. Refunds are processed in 5 business days."

SECTION 04

LangChain ContextualCompressionRetriever

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

# Base retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
docs = [
    Document(page_content="Founded 2015, 500 employees globally."),
    Document(page_content="Returns within 30 days. Refunds in 5 business days."),
    Document(page_content="Ships worldwide. Free shipping over $50."),
]
vectorstore = Chroma.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Compression layer
llm = ChatAnthropic(model="claude-3-5-haiku-20241022")
compressor = LLMChainExtractor.from_llm(llm)

# Combine: retrieves 5 docs, compresses each, returns only relevant snippets
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

compressed_docs = compression_retriever.invoke("What is the refund timeline?")
for doc in compressed_docs:
    print(doc.page_content)
# Only: "Refunds in 5 business days."

SECTION 05

When to compress vs rerank

Situation	Recommended approach
Retrieved chunks are long (512+ tokens) but mostly irrelevant	Extractive compression or LLM extraction
Multiple chunks retrieved, only 1-2 are truly relevant	Reranking (cheaper than LLM compression)
High QPS, latency-sensitive	Smaller chunk size at indexing time (avoids runtime cost)
Best quality, latency not critical	Rerank first, then compress the top results
Cost is primary concern	Shorter chunks at indexing + BM25 pre-filter

SECTION 06

Gotchas

Compression adds latency and cost. LLM-based compression requires an LLM call per retrieved chunk. For 5 retrieved chunks, that's 5 extra API calls before the main generation. Use a cheap, fast model (Haiku) for compression, not your main generation model.

Compression can drop critical context. Extractive compression might remove sentences that seem irrelevant but are needed for context. Test on edge cases where the answer requires background context that's not directly about the query.

Chunking at indexing time is the upstream fix. If your chunks are noisy, the root cause is often that chunk size is too large. Smaller, more focused chunks at indexing time reduce the need for runtime compression — and reduce latency and cost at query time.

Filter before compress. Always rerank or filter to top-3 before compressing. Compressing 20 chunks is expensive; compressing 3 well-chosen chunks is reasonable.

Selective context compression strategies

Selective compression passes only the most relevant spans of retrieved context to the generation model rather than compressing uniformly across all retrieved text. Sentence-level extractive compression scores each sentence independently using a cross-encoder relevance model and retains only the top-k sentences by relevance score. This approach is faster and more interpretable than LLM-based compression because it makes discrete inclusion decisions rather than generating paraphrased summaries, and the retained sentences are guaranteed to be verbatim from the source — important for applications where attribution fidelity matters.

Context compression in production RAG pipelines

Integrating context compression into a production RAG pipeline requires benchmarking the quality-latency tradeoff for the specific application. LLM-based compression adds one full generation call to each RAG request, roughly doubling end-to-end latency. This overhead is justified when the compressed context produces measurably better generation quality — for instance, when the retrieval system returns many partially relevant passages that confuse the generation model. If retrieval precision is already high (retrieved passages are mostly relevant), compression overhead rarely pays for itself in quality improvements.

Compression method	Latency overhead	Quality improvement	Best for
Extractive (sentence scoring)	Low (~20ms)	Medium	Long passages with noise
LLM rewrite	High (~500ms+)	High	Complex multi-passage contexts
LLMLingua token pruning	Medium (~100ms)	High	Token budget constraints
No compression (reranking only)	None additional	Low	High-precision retrieval

LLMLingua's token-level compression approach uses a small language model (typically a 125M–7B parameter model) to score each token's importance to the query by measuring perplexity changes when tokens are removed. Tokens with low information content relative to the query — filler words, redundant phrases, tangential sentences — are pruned from the context, preserving only the tokens most relevant to answering the specific question. Compression ratios of 3–5x are achievable with minimal quality degradation, reducing the effective context length fed to the large generation model while preserving the key information needed for accurate generation.

Context compression quality degrades when applied to structured content like code, tables, and lists. Compressing these structured formats by removing "low perplexity" tokens often disrupts the structural integrity of the content — removing a comma from a table row or a bracket from code invalidates the entire structure. Compression systems should detect structured content blocks and apply them with zero compression or alternative strategies like summarizing the structure's schema without individual values. Separate compression policies for unstructured prose versus structured data elements are essential for maintaining content integrity in mixed-content RAG pipelines.

Reranking and compression are complementary techniques that operate on different aspects of the retrieved context quality problem. Reranking addresses the ordering problem — ensuring the most relevant passages appear first in the context window where model attention is strongest. Compression addresses the noise problem — removing irrelevant content from within individual passages to reduce the total tokens the generation model must process. Applying both in sequence — rerank to select top passages, then compress each passage to remove noise — consistently produces better generation quality than either technique alone, with a latency cost of approximately 200–700ms for the combined pipeline.

Selective passage inclusion — choosing which retrieved passages to include in the context at all, rather than compressing all passages — is often more effective than compressing every passage. A relevance threshold that excludes low-scoring passages entirely produces cleaner contexts than including all passages after aggressive compression. The optimal strategy depends on retrieval precision: high-precision retrievers benefit from threshold-based exclusion, while lower-precision retrievers benefit more from compression that reduces noise within included passages. Evaluating both strategies on a quality benchmark before production deployment identifies which approach provides better quality-efficiency tradeoffs for the specific retrieval system.

Token budget constraints in LLM APIs with context limits make compression a practical necessity for long-document RAG applications rather than an optional optimization. A RAG pipeline that retrieves 10 passages of 500 tokens each produces 5,000 input tokens of context, consuming a large fraction of a 8K context limit and leaving little room for conversation history and output. Compression that reduces the retrieved context to 1,500 tokens while preserving the key information enables fitting more retrieved knowledge, longer conversation histories, and more detailed system prompts within the available context window, directly improving answer quality for context-constrained applications.