How you split documents shapes retrieval quality — fixed, semantic, and hierarchical approaches compared
In RAG systems, the chunk is the retrieval unit. A query retrieves chunks, not full documents. How you split determines what context the LLM sees — and thus output quality.
Too large: Chunks > 2000 tokens become noise-heavy. The LLM must sift through irrelevant sentences to find the signal. Retrieval precision drops.
Too small: Chunks < 100 tokens lose context. A single proposition isolated from surrounding sentences becomes ambiguous. The LLM must chain many chunks together to understand.
The golden chunk is ~300–500 tokens, semantically cohesive, dense with signal, and self-contained enough that an LLM can understand it without external context.
User query → Embed → Search vector space → Retrieve top-k chunks → Augment prompt → LLM answer. Each stage depends on chunk quality. A perfectly good chunk retrieved as 7 smaller pieces costs k slots. A 10,000-token chunk with 9,800 tokens of noise poisons the context window.
| Strategy | Granularity | Overlap | Compute | Best For |
|---|---|---|---|---|
| Fixed-size | Uniform tokens/chars | Optional | O(n) | Simple baseline, fast ingestion |
| Recursive | Splits on delimiters (.\n\n) | Optional | O(n log n) | Markdown, code, docs |
| Semantic | Embedding-based boundaries | Yes | O(n²) | Long-form articles, narratives |
| Proposition | Atomic facts via LLM | No | O(n · k) | Factual precision, QA |
| Hierarchical | Small chunks + parent refs | No | O(n + k) | Context at retrieval + cheap full expansion |
| Agentic | Task-dependent splits | No | O(n · m) | Multi-step reasoning, edge cases |
Fixed & Recursive: Cheapest. No embeddings needed. Works well for structured text (code, markdown). Blind to semantics — may split meaningful units.
Semantic & Proposition: Expensive. Require embedding or LLM calls. Semantic respects meaning boundaries. Proposition extracts facts — highest precision for factual retrieval.
Hierarchical: Moderate cost. Layers retrieval: fast retrieval of small chunks, optional expansion to parent for context. Scales well.
Fixed-size: Split every N tokens or M characters. Simple. Predictable. Blind to boundaries.
Recursive: Split on delimiters (paragraphs, sentences, words) in order of preference. LangChain's RecursiveCharacterTextSplitter uses this: try "\n\n" first, then "\n", then " ", then "".
Semantic chunking: Embed sentences. Cluster sentences with high embedding similarity. When similarity drops, split. Respects meaning, not delimiters.
Proposition chunking: Extract atomic facts from text using an LLM ("Claude, list 5 propositions from this paragraph"). Group propositions into chunks. Highest precision for factual retrieval.
Semantic: Articles, essays, long-form content where meaning changes smoothly. Works well with embedding similarity metrics.
Proposition: Factual documents (research papers, specifications, regulatory text). Highest recall for specific facts. More expensive (LLM calls).
Create small chunks for retrieval (~300 tokens). Link each to a parent chunk (larger context, ~1200 tokens). Retriever fetches small chunks, expands to parent only if needed.
Benefit: Fast retrieval (search on small chunks, low noise). Flexible expansion (fetch parent for full context). Cheaper than semantic (no embedding overhead).
Long documents (books, theses). Multi-section reports. Code repositories. Context matters: small chunk pins retrieval, parent chunk provides full function/section/chapter.
Raw chunks are orphaned. Attach metadata: document title, source section, page number, creation date, author. This context helps the LLM assess relevance and trust.
Add chunk summaries (1–2 sentences). Extract named entities. Compute TF-IDF relevance. These signals help ranking and reranking stages.
RAGAS framework: Measures context relevance, answer relevance, faithfulness. Run on held-out questions to benchmark chunking strategies.
Chunk size ablation: Vary chunk sizes (128, 256, 512, 1024 tokens). Track retrieval precision and LLM answer quality. Find your system's sweet spot.
Chunking is the first decision in every RAG system and has outsized impact on quality. Master it before optimising retrieval:
LangChain's RecursiveCharacterTextSplitter with chunk_size=512, overlap=64 is a robust default. Use it as your baseline before trying anything fancier.
Inspect retrieved chunks manually for a sample of queries. If the key information is always split across chunk boundaries, you need smaller chunks or a smarter splitter.
Embed small child chunks for precision; fetch their parent chunk for context. LlamaIndex's ParentChildNodeParser implements this. Best for books, contracts, and long reports.
Split where the embedding similarity between consecutive sentences drops sharply. Computationally expensive but produces the most coherent chunks for heterogeneous documents.