Sentence Window

The sentence window idea
Implementation with LlamaIndex
From-scratch Python implementation
Choosing window size
When sentence window beats fixed-size
Limitations
Gotchas

SECTION 01

The sentence window idea

Sentence window chunking solves a core RAG tension: small chunks are better for retrieval (more precise semantic match) but worse for synthesis (LLM lacks context). The solution: decouple the embedding unit from the context unit.

Indexing: Split the document into individual sentences. Embed each sentence separately. Store the sentence's position (document + index within document) in metadata.

Retrieval: Query returns the most semantically similar sentences — precise matches to the question.

Context expansion: Before passing to the LLM, replace each retrieved sentence with a window of sentences around it (e.g., ±3 sentences). The LLM sees coherent paragraphs; retrieval used precise sentence matches.

This approach consistently outperforms fixed-size chunking by 5–15% on answer quality benchmarks while using less context than large fixed-size chunks.

SECTION 02

Implementation with LlamaIndex

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.core.query_engine import RetrieverQueryEngine

# Parse documents into sentence nodes with window metadata
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,              # ±3 sentences around the retrieved sentence
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

documents = SimpleDirectoryReader("./docs").load_data()
nodes = node_parser.get_nodes_from_documents(documents)

# Build index on sentence-level nodes
index = VectorStoreIndex(nodes)
retriever = index.as_retriever(similarity_top_k=5)

# MetadataReplacementPostProcessor swaps sentence → window before LLM sees it
postprocessor = MetadataReplacementPostProcessor(target_metadata_key="window")

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[postprocessor],
)

response = query_engine.query("What is attention mechanism in transformers?")
print(response)

SECTION 03

From-scratch Python implementation

import nltk
from typing import List, Tuple
import numpy as np

nltk.download("punkt")

def build_sentence_window_index(text: str, window: int = 3) -> List[dict]:
    sentences = nltk.sent_tokenize(text)
    nodes = []
    for i, sent in enumerate(sentences):
        lo = max(0, i - window)
        hi = min(len(sentences), i + window + 1)
        window_text = " ".join(sentences[lo:hi])
        nodes.append({
            "sentence": sent,
            "window": window_text,
            "sentence_idx": i,
            "window_range": (lo, hi),
        })
    return nodes

def retrieve_with_window(query: str, nodes: list, embeddings, top_k: int = 3) -> List[str]:
    # Embed sentences, retrieve top_k, return windows
    query_emb = embed(query)
    sims = [cosine_sim(query_emb, embed(n["sentence"])) for n in nodes]
    top_indices = np.argsort(sims)[-top_k:][::-1]
    return [nodes[i]["window"] for i in top_indices]  # return windows, not sentences

nodes = build_sentence_window_index(long_document, window=3)
contexts = retrieve_with_window("What is self-attention?", nodes, embeddings)

SECTION 04

Choosing window size

Window size controls the trade-off between precision and context:

window=1: Sentence ± 1 sentence (3 sentences total). Tightest context, lowest hallucination risk, but may miss important context from nearby sentences.
window=3: Sentence ± 3 sentences (7 sentences total, ~200-400 tokens). The standard recommendation. Covers a full paragraph in most documents.
window=5: 11 sentences, ~500-700 tokens. Approaches fixed-size chunk territory. Use when documents have dense, interdependent sentences.

For technical documentation, window=2 often works well (code examples and their explanations are usually close together). For narrative text or legal documents, window=4 preserves more necessary context.

SECTION 05

When sentence window beats fixed-size

Sentence window is particularly effective when: the query targets a specific fact stated in one sentence (exact retrieval + surrounding context for the LLM), documents have consistent sentence-level structure (technical docs, news articles), you need high precision (avoid returning irrelevant context), and the topic distribution across the document is uniform (no very long sections about one topic).

Fixed-size is still preferred when: documents lack natural sentence boundaries (code, structured data, tables), computational cost matters (sentence tokenisation + individual embeddings is slower), or the topic is distributed over long multi-sentence passages that need to be retrieved as units.

SECTION 06

Limitations

Sentence boundary detection: NLTK's sent_tokenize and similar tools make mistakes on technical text (abbreviations like "e.g.", "Fig.", "Dr." trigger false boundaries), code mixed with prose, and bullet-pointed lists. Consider a custom sentence splitter for specialised domains.

Embedding cost: Embedding every sentence means 3–10× more embeddings than fixed-size chunking. For a 1M-word document corpus, this is significant. Mitigate by using a fast, cheap embedding model (text-embedding-3-small at $0.02/1M tokens).

Retrieval deduplication: If two similar sentences are nearby, they may both be retrieved, and their windows will overlap significantly. Add deduplication: if retrieved windows overlap by more than 50%, keep only the higher-scoring one.

SECTION 07

Gotchas

Very short sentences score poorly: Sentence embeddings work poorly for very short sentences ("Yes." "True." "See above."). These will match any query weakly or specifically. Filter out sentences below a minimum length (e.g., 20 characters) before indexing.

Window storage overhead: Each node stores both the sentence and the window text. For a 100,000-sentence corpus, this roughly doubles storage vs storing sentences alone. Pre-compute windows at index time (as LlamaIndex does) rather than at retrieval time to keep latency low.

Cross-section contamination: If your document has hard section boundaries (chapter breaks, headings), windows should not cross them. A sentence at the end of Chapter 1 should not have a window containing Chapter 2 sentences. Add section-aware window construction that clips at section boundaries.

Sentence Window Retrieval Compared

Sentence window retrieval indexes individual sentences for high-precision retrieval but returns a window of surrounding sentences to the LLM, providing context that the isolated sentence embedding could not. It combines the retrieval precision of sentence-level indexing with the generation context quality of paragraph-level content, addressing the fundamental trade-off between chunk size and retrieval accuracy.

Window Size	Context Provided	Retrieval Precision	Token Cost per Hit	Best For
±0 (sentence only)	Single sentence	Highest	Lowest	Keyword fact lookup
±1 (3 sentences)	Immediate context	Very high	Low	Most factual QA
±2 (5 sentences)	Full thought context	High	Moderate	Technical explanation
±5 (11 sentences)	Paragraph context	Moderate	Higher	Narrative or complex topics

The window expansion step happens post-retrieval: after identifying the most relevant sentences through embedding similarity, the system fetches the neighboring sentences from the original document using the stored sentence position metadata. This requires maintaining a mapping from each sentence embedding's vector ID to its position within the source document, typically stored as metadata in the vector database. Document boundaries must be respected — sentence windows cannot extend across document boundaries, as neighboring sentences from different documents would be contextually misleading.

Sentence window retrieval works particularly well for structured reference documents like technical manuals, API documentation, and regulatory texts where individual sentences contain high-density facts that are precisely what queries target. For narrative or argumentative content where meaning spans multiple sentences inherently, the sentence-level indexing may retrieve a sentence that contains relevant keywords but lacks the surrounding context needed to answer the query, producing returned windows that are relevant at the retrieval stage but insufficiently informative at the generation stage.

# Sentence window: index sentences, return expanded window
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.postprocessor import MetadataReplacementPostProcessor

# Parse into sentence nodes with ±2 sentence window stored as metadata
parser = SentenceWindowNodeParser.from_defaults(window_size=2)
nodes = parser.get_nodes_from_documents(documents)

# At query time, replace retrieved sentence with its stored window
postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"  # swap sentence → window before LLM
)

Deduplication of sentence window results prevents the same source content from appearing multiple times in the retrieved context when adjacent sentences all rank highly for the same query. If sentences 12, 13, and 14 from a document are all highly relevant, naively returning three separate windows of ±2 sentences each would include sentences 10–16 three times with significant overlap. Deduplication by source document and position — merging overlapping windows into a single span — reduces context redundancy and allows the token budget to cover a wider span of the source document.

Production deployment of sentence window retrieval requires consistent sentence boundary detection between the indexing and retrieval phases. Using different tokenization libraries, spaCy models, or NLTK sentence tokenizers at index time versus query time can produce sentence boundary disagreements that cause window lookups to fail or return incorrect spans. Storing the sentence tokenization alongside the chunk metadata at index time — rather than re-tokenizing at query time — ensures consistency and allows the retrieval system to be updated without requiring full re-indexing of the document corpus.

Hybrid retrieval combining sentence window retrieval with BM25 keyword matching provides complementary signals for different query types. Dense embedding retrieval excels at semantic similarity and concept-level matching; BM25 excels at exact keyword matching, product names, technical identifiers, and rare terms with low semantic embedding coverage. Combining the two ranked lists using reciprocal rank fusion and then expanding to sentence windows produces retrieval results that are robust to both semantic and keyword query patterns, covering the strengths of both approaches without requiring separate index infrastructure.