Techniques to reduce the amount of retrieved context passed to the LLM by extracting only relevant sentences or compressing passages, improving quality and reducing cost.
You retrieve the top-5 chunks. Each chunk is 512 tokens. But the user's question is about a single sentence in chunk 3 — the other 4 chunks and 90% of chunk 3 are noise. You're sending 2560 tokens to the LLM when 50 tokens would suffice.
Two problems follow: cost (you pay for all 2560 tokens) and quality ("lost in the middle" — LLMs are less accurate when the relevant content is buried deep in a long context). Context compression solves both by stripping out the irrelevant parts before they reach the LLM.
Extract only the sentences or fragments from the retrieved chunks that are relevant to the query, discarding the rest:
import anthropic
client = anthropic.Anthropic()
def extract_relevant_sentences(query: str, passage: str) -> str:
'''Extract only the sentences from passage that answer the query.'''
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=512,
messages=[{"role": "user", "content": f'''
Extract only the sentences from the passage below that are directly relevant
to answering the question. If no sentences are relevant, return "IRRELEVANT".
Return only the extracted sentences, nothing else.
Question: {query}
Passage:
{passage}
'''}]
)
result = response.content[0].text.strip()
return "" if result == "IRRELEVANT" else result
# Usage
passage = '''
Our company was founded in 2015 by Jane Smith.
Returns are accepted within 30 days of purchase with the original receipt.
We have offices in New York, London, and Singapore.
Refunds are processed to the original payment method within 5 business days.
Our CEO has over 20 years of retail experience.
'''
compressed = extract_relevant_sentences("How long does a refund take?", passage)
print(compressed)
# "Refunds are processed to the original payment method within 5 business days."
Instead of extracting sentences, ask the LLM to summarise the relevant portions:
import anthropic
client = anthropic.Anthropic()
def compress_to_answer(query: str, passages: list[str]) -> str:
'''Compress multiple retrieved passages into a focused context.'''
combined = "\n\n---\n\n".join(passages)
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=300,
messages=[{"role": "user", "content": f'''
Given the following retrieved passages and a question, extract and condense
only the information relevant to answering the question.
Be concise. If a passage has no relevant information, skip it.
Question: {query}
Retrieved passages:
{combined}
Condensed relevant information:
'''}]
)
return response.content[0].text.strip()
query = "What is the refund policy?"
passages = [
"Our company was founded in 2015. We have 500 employees globally.",
"Returns are accepted within 30 days. Refunds take 5 business days.",
"We ship worldwide. Free shipping on orders over $50.",
]
compressed = compress_to_answer(query, passages)
print(compressed)
# "Returns are accepted within 30 days of purchase. Refunds are processed in 5 business days."
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
# Base retriever
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
docs = [
Document(page_content="Founded 2015, 500 employees globally."),
Document(page_content="Returns within 30 days. Refunds in 5 business days."),
Document(page_content="Ships worldwide. Free shipping over $50."),
]
vectorstore = Chroma.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Compression layer
llm = ChatAnthropic(model="claude-3-5-haiku-20241022")
compressor = LLMChainExtractor.from_llm(llm)
# Combine: retrieves 5 docs, compresses each, returns only relevant snippets
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
compressed_docs = compression_retriever.invoke("What is the refund timeline?")
for doc in compressed_docs:
print(doc.page_content)
# Only: "Refunds in 5 business days."
| Situation | Recommended approach |
|---|---|
| Retrieved chunks are long (512+ tokens) but mostly irrelevant | Extractive compression or LLM extraction |
| Multiple chunks retrieved, only 1-2 are truly relevant | Reranking (cheaper than LLM compression) |
| High QPS, latency-sensitive | Smaller chunk size at indexing time (avoids runtime cost) |
| Best quality, latency not critical | Rerank first, then compress the top results |
| Cost is primary concern | Shorter chunks at indexing + BM25 pre-filter |
Compression adds latency and cost. LLM-based compression requires an LLM call per retrieved chunk. For 5 retrieved chunks, that's 5 extra API calls before the main generation. Use a cheap, fast model (Haiku) for compression, not your main generation model.
Compression can drop critical context. Extractive compression might remove sentences that seem irrelevant but are needed for context. Test on edge cases where the answer requires background context that's not directly about the query.
Chunking at indexing time is the upstream fix. If your chunks are noisy, the root cause is often that chunk size is too large. Smaller, more focused chunks at indexing time reduce the need for runtime compression — and reduce latency and cost at query time.
Filter before compress. Always rerank or filter to top-3 before compressing. Compressing 20 chunks is expensive; compressing 3 well-chosen chunks is reasonable.
Selective compression passes only the most relevant spans of retrieved context to the generation model rather than compressing uniformly across all retrieved text. Sentence-level extractive compression scores each sentence independently using a cross-encoder relevance model and retains only the top-k sentences by relevance score. This approach is faster and more interpretable than LLM-based compression because it makes discrete inclusion decisions rather than generating paraphrased summaries, and the retained sentences are guaranteed to be verbatim from the source — important for applications where attribution fidelity matters.
Integrating context compression into a production RAG pipeline requires benchmarking the quality-latency tradeoff for the specific application. LLM-based compression adds one full generation call to each RAG request, roughly doubling end-to-end latency. This overhead is justified when the compressed context produces measurably better generation quality — for instance, when the retrieval system returns many partially relevant passages that confuse the generation model. If retrieval precision is already high (retrieved passages are mostly relevant), compression overhead rarely pays for itself in quality improvements.
| Compression method | Latency overhead | Quality improvement | Best for |
|---|---|---|---|
| Extractive (sentence scoring) | Low (~20ms) | Medium | Long passages with noise |
| LLM rewrite | High (~500ms+) | High | Complex multi-passage contexts |
| LLMLingua token pruning | Medium (~100ms) | High | Token budget constraints |
| No compression (reranking only) | None additional | Low | High-precision retrieval |
LLMLingua's token-level compression approach uses a small language model (typically a 125M–7B parameter model) to score each token's importance to the query by measuring perplexity changes when tokens are removed. Tokens with low information content relative to the query — filler words, redundant phrases, tangential sentences — are pruned from the context, preserving only the tokens most relevant to answering the specific question. Compression ratios of 3–5x are achievable with minimal quality degradation, reducing the effective context length fed to the large generation model while preserving the key information needed for accurate generation.
Context compression quality degrades when applied to structured content like code, tables, and lists. Compressing these structured formats by removing "low perplexity" tokens often disrupts the structural integrity of the content — removing a comma from a table row or a bracket from code invalidates the entire structure. Compression systems should detect structured content blocks and apply them with zero compression or alternative strategies like summarizing the structure's schema without individual values. Separate compression policies for unstructured prose versus structured data elements are essential for maintaining content integrity in mixed-content RAG pipelines.
Reranking and compression are complementary techniques that operate on different aspects of the retrieved context quality problem. Reranking addresses the ordering problem — ensuring the most relevant passages appear first in the context window where model attention is strongest. Compression addresses the noise problem — removing irrelevant content from within individual passages to reduce the total tokens the generation model must process. Applying both in sequence — rerank to select top passages, then compress each passage to remove noise — consistently produces better generation quality than either technique alone, with a latency cost of approximately 200–700ms for the combined pipeline.
Selective passage inclusion — choosing which retrieved passages to include in the context at all, rather than compressing all passages — is often more effective than compressing every passage. A relevance threshold that excludes low-scoring passages entirely produces cleaner contexts than including all passages after aggressive compression. The optimal strategy depends on retrieval precision: high-precision retrievers benefit from threshold-based exclusion, while lower-precision retrievers benefit more from compression that reduces noise within included passages. Evaluating both strategies on a quality benchmark before production deployment identifies which approach provides better quality-efficiency tradeoffs for the specific retrieval system.
Token budget constraints in LLM APIs with context limits make compression a practical necessity for long-document RAG applications rather than an optional optimization. A RAG pipeline that retrieves 10 passages of 500 tokens each produces 5,000 input tokens of context, consuming a large fraction of a 8K context limit and leaving little room for conversation history and output. Compression that reduces the retrieved context to 1,500 tokens while preserving the key information enables fitting more retrieved knowledge, longer conversation histories, and more detailed system prompts within the available context window, directly improving answer quality for context-constrained applications.