Embed individual sentences for precise retrieval, but return a window of surrounding sentences to the LLM for context. Separates the retrieval unit (sentence) from the synthesis unit (window), giving you both precision and coherence.
Sentence window chunking solves a core RAG tension: small chunks are better for retrieval (more precise semantic match) but worse for synthesis (LLM lacks context). The solution: decouple the embedding unit from the context unit.
Indexing: Split the document into individual sentences. Embed each sentence separately. Store the sentence's position (document + index within document) in metadata.
Retrieval: Query returns the most semantically similar sentences — precise matches to the question.
Context expansion: Before passing to the LLM, replace each retrieved sentence with a window of sentences around it (e.g., ±3 sentences). The LLM sees coherent paragraphs; retrieval used precise sentence matches.
This approach consistently outperforms fixed-size chunking by 5–15% on answer quality benchmarks while using less context than large fixed-size chunks.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.core.query_engine import RetrieverQueryEngine
# Parse documents into sentence nodes with window metadata
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # ±3 sentences around the retrieved sentence
window_metadata_key="window",
original_text_metadata_key="original_text",
)
documents = SimpleDirectoryReader("./docs").load_data()
nodes = node_parser.get_nodes_from_documents(documents)
# Build index on sentence-level nodes
index = VectorStoreIndex(nodes)
retriever = index.as_retriever(similarity_top_k=5)
# MetadataReplacementPostProcessor swaps sentence → window before LLM sees it
postprocessor = MetadataReplacementPostProcessor(target_metadata_key="window")
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[postprocessor],
)
response = query_engine.query("What is attention mechanism in transformers?")
print(response)
import nltk
from typing import List, Tuple
import numpy as np
nltk.download("punkt")
def build_sentence_window_index(text: str, window: int = 3) -> List[dict]:
sentences = nltk.sent_tokenize(text)
nodes = []
for i, sent in enumerate(sentences):
lo = max(0, i - window)
hi = min(len(sentences), i + window + 1)
window_text = " ".join(sentences[lo:hi])
nodes.append({
"sentence": sent,
"window": window_text,
"sentence_idx": i,
"window_range": (lo, hi),
})
return nodes
def retrieve_with_window(query: str, nodes: list, embeddings, top_k: int = 3) -> List[str]:
# Embed sentences, retrieve top_k, return windows
query_emb = embed(query)
sims = [cosine_sim(query_emb, embed(n["sentence"])) for n in nodes]
top_indices = np.argsort(sims)[-top_k:][::-1]
return [nodes[i]["window"] for i in top_indices] # return windows, not sentences
nodes = build_sentence_window_index(long_document, window=3)
contexts = retrieve_with_window("What is self-attention?", nodes, embeddings)
Window size controls the trade-off between precision and context:
For technical documentation, window=2 often works well (code examples and their explanations are usually close together). For narrative text or legal documents, window=4 preserves more necessary context.
Sentence window is particularly effective when: the query targets a specific fact stated in one sentence (exact retrieval + surrounding context for the LLM), documents have consistent sentence-level structure (technical docs, news articles), you need high precision (avoid returning irrelevant context), and the topic distribution across the document is uniform (no very long sections about one topic).
Fixed-size is still preferred when: documents lack natural sentence boundaries (code, structured data, tables), computational cost matters (sentence tokenisation + individual embeddings is slower), or the topic is distributed over long multi-sentence passages that need to be retrieved as units.
Sentence boundary detection: NLTK's sent_tokenize and similar tools make mistakes on technical text (abbreviations like "e.g.", "Fig.", "Dr." trigger false boundaries), code mixed with prose, and bullet-pointed lists. Consider a custom sentence splitter for specialised domains.
Embedding cost: Embedding every sentence means 3–10× more embeddings than fixed-size chunking. For a 1M-word document corpus, this is significant. Mitigate by using a fast, cheap embedding model (text-embedding-3-small at $0.02/1M tokens).
Retrieval deduplication: If two similar sentences are nearby, they may both be retrieved, and their windows will overlap significantly. Add deduplication: if retrieved windows overlap by more than 50%, keep only the higher-scoring one.
Very short sentences score poorly: Sentence embeddings work poorly for very short sentences ("Yes." "True." "See above."). These will match any query weakly or specifically. Filter out sentences below a minimum length (e.g., 20 characters) before indexing.
Window storage overhead: Each node stores both the sentence and the window text. For a 100,000-sentence corpus, this roughly doubles storage vs storing sentences alone. Pre-compute windows at index time (as LlamaIndex does) rather than at retrieval time to keep latency low.
Cross-section contamination: If your document has hard section boundaries (chapter breaks, headings), windows should not cross them. A sentence at the end of Chapter 1 should not have a window containing Chapter 2 sentences. Add section-aware window construction that clips at section boundaries.
Sentence window retrieval indexes individual sentences for high-precision retrieval but returns a window of surrounding sentences to the LLM, providing context that the isolated sentence embedding could not. It combines the retrieval precision of sentence-level indexing with the generation context quality of paragraph-level content, addressing the fundamental trade-off between chunk size and retrieval accuracy.
| Window Size | Context Provided | Retrieval Precision | Token Cost per Hit | Best For |
|---|---|---|---|---|
| ±0 (sentence only) | Single sentence | Highest | Lowest | Keyword fact lookup |
| ±1 (3 sentences) | Immediate context | Very high | Low | Most factual QA |
| ±2 (5 sentences) | Full thought context | High | Moderate | Technical explanation |
| ±5 (11 sentences) | Paragraph context | Moderate | Higher | Narrative or complex topics |
The window expansion step happens post-retrieval: after identifying the most relevant sentences through embedding similarity, the system fetches the neighboring sentences from the original document using the stored sentence position metadata. This requires maintaining a mapping from each sentence embedding's vector ID to its position within the source document, typically stored as metadata in the vector database. Document boundaries must be respected — sentence windows cannot extend across document boundaries, as neighboring sentences from different documents would be contextually misleading.
Sentence window retrieval works particularly well for structured reference documents like technical manuals, API documentation, and regulatory texts where individual sentences contain high-density facts that are precisely what queries target. For narrative or argumentative content where meaning spans multiple sentences inherently, the sentence-level indexing may retrieve a sentence that contains relevant keywords but lacks the surrounding context needed to answer the query, producing returned windows that are relevant at the retrieval stage but insufficiently informative at the generation stage.
# Sentence window: index sentences, return expanded window
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.postprocessor import MetadataReplacementPostProcessor
# Parse into sentence nodes with ±2 sentence window stored as metadata
parser = SentenceWindowNodeParser.from_defaults(window_size=2)
nodes = parser.get_nodes_from_documents(documents)
# At query time, replace retrieved sentence with its stored window
postprocessor = MetadataReplacementPostProcessor(
target_metadata_key="window" # swap sentence → window before LLM
)
Deduplication of sentence window results prevents the same source content from appearing multiple times in the retrieved context when adjacent sentences all rank highly for the same query. If sentences 12, 13, and 14 from a document are all highly relevant, naively returning three separate windows of ±2 sentences each would include sentences 10–16 three times with significant overlap. Deduplication by source document and position — merging overlapping windows into a single span — reduces context redundancy and allows the token budget to cover a wider span of the source document.
Production deployment of sentence window retrieval requires consistent sentence boundary detection between the indexing and retrieval phases. Using different tokenization libraries, spaCy models, or NLTK sentence tokenizers at index time versus query time can produce sentence boundary disagreements that cause window lookups to fail or return incorrect spans. Storing the sentence tokenization alongside the chunk metadata at index time — rather than re-tokenizing at query time — ensures consistency and allows the retrieval system to be updated without requiring full re-indexing of the document corpus.
Hybrid retrieval combining sentence window retrieval with BM25 keyword matching provides complementary signals for different query types. Dense embedding retrieval excels at semantic similarity and concept-level matching; BM25 excels at exact keyword matching, product names, technical identifiers, and rare terms with low semantic embedding coverage. Combining the two ranked lists using reciprocal rank fusion and then expanding to sentence windows produces retrieval results that are robust to both semantic and keyword query patterns, covering the strengths of both approaches without requiring separate index infrastructure.