RAG · Data Processing

Chunking Strategies for RAG

How you split documents shapes retrieval quality — fixed, semantic, and hierarchical approaches compared

6 strategies
6 sections
Python-first code examples
Contents
  1. Why chunking matters
  2. 6 strategies compared
  3. Fixed & recursive
  4. Semantic & prop
  5. Hierarchical chunks
  6. Metadata enrichment
  7. References
01 — Foundation

Why Chunking Matters

In RAG systems, the chunk is the retrieval unit. A query retrieves chunks, not full documents. How you split determines what context the LLM sees — and thus output quality.

Too large: Chunks > 2000 tokens become noise-heavy. The LLM must sift through irrelevant sentences to find the signal. Retrieval precision drops.

Too small: Chunks < 100 tokens lose context. A single proposition isolated from surrounding sentences becomes ambiguous. The LLM must chain many chunks together to understand.

The golden chunk is ~300–500 tokens, semantically cohesive, dense with signal, and self-contained enough that an LLM can understand it without external context.

💡 Key insight: Chunking is a retrieval problem first, not a storage problem. Optimize for what the retriever can find and what the LLM can use, not for convenience of splitting.

The Retrieval Cascade

User query → Embed → Search vector space → Retrieve top-k chunks → Augment prompt → LLM answer. Each stage depends on chunk quality. A perfectly good chunk retrieved as 7 smaller pieces costs k slots. A 10,000-token chunk with 9,800 tokens of noise poisons the context window.

02 — Comparison

Chunking Strategies Compared

Strategy Granularity Overlap Compute Best For
Fixed-size Uniform tokens/chars Optional O(n) Simple baseline, fast ingestion
Recursive Splits on delimiters (.\n\n) Optional O(n log n) Markdown, code, docs
Semantic Embedding-based boundaries Yes O(n²) Long-form articles, narratives
Proposition Atomic facts via LLM No O(n · k) Factual precision, QA
Hierarchical Small chunks + parent refs No O(n + k) Context at retrieval + cheap full expansion
Agentic Task-dependent splits No O(n · m) Multi-step reasoning, edge cases

Cost vs. Quality Tradeoff

Fixed & Recursive: Cheapest. No embeddings needed. Works well for structured text (code, markdown). Blind to semantics — may split meaningful units.

Semantic & Proposition: Expensive. Require embedding or LLM calls. Semantic respects meaning boundaries. Proposition extracts facts — highest precision for factual retrieval.

Hierarchical: Moderate cost. Layers retrieval: fast retrieval of small chunks, optional expansion to parent for context. Scales well.

03 — Classic

Fixed & Recursive Chunking

Fixed-size: Split every N tokens or M characters. Simple. Predictable. Blind to boundaries.

Recursive: Split on delimiters (paragraphs, sentences, words) in order of preference. LangChain's RecursiveCharacterTextSplitter uses this: try "\n\n" first, then "\n", then " ", then "".

Python: Fixed-Size Splitting

from langchain_text_splitters import CharacterTextSplitter splitter = CharacterTextSplitter( chunk_size=500, # tokens (approx) chunk_overlap=50, # overlap for context ) chunks = splitter.split_text(document) # Overlap helps: a fact split across boundary # is recoverable in both chunks

Python: Recursive Splitting

from langchain_text_splitters import \ RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=[ "\n\n", # paragraph "\n", # sentence ".", # clause " ", # word "" # character (fallback) ], ) chunks = splitter.split_text(markdown_doc) # Respects doc structure. Better for code/markdown.

Best Practices

Do This

  • Use 300–600 token chunks
  • Add 50–100 token overlap
  • Recursive for markdown/code
  • Track chunk boundaries

Avoid

  • Fixed-size for natural text
  • No overlap (boundary loss)
  • Chunks > 2000 tokens
  • Ignoring document structure
04 — Intelligent

Semantic & Proposition Chunking

Semantic chunking: Embed sentences. Cluster sentences with high embedding similarity. When similarity drops, split. Respects meaning, not delimiters.

Proposition chunking: Extract atomic facts from text using an LLM ("Claude, list 5 propositions from this paragraph"). Group propositions into chunks. Highest precision for factual retrieval.

Python: Semantic Chunking

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('all-MiniLM-L6-v2') sentences = text.split('. ') embeddings = model.encode(sentences) # Cosine similarity between consecutive sentences breakpoints = [] threshold = 0.5 for i in range(len(embeddings)-1): sim = np.dot(embeddings[i], embeddings[i+1]) if sim < threshold: breakpoints.append(i) chunks = [] start = 0 for bp in breakpoints: chunk = '. '.join(sentences[start:bp+1]) chunks.append(chunk) start = bp + 1

Python: Proposition Extraction

from anthropic import Anthropic client = Anthropic() def extract_propositions(text: str) -> list[str]: response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=500, messages=[{ "role": "user", "content": f"""Extract 3-5 atomic propositions (subject-predicate-object facts) from this text: {text}""" }] ) # Parse response, return list of propositions return response.content[0].text.split('\n') # Group propositions into chunks props = extract_propositions(document) chunks = ['\n'.join(props[i:i+3]) for i in range(0, len(props), 3)]

When to Use

Semantic: Articles, essays, long-form content where meaning changes smoothly. Works well with embedding similarity metrics.

Proposition: Factual documents (research papers, specifications, regulatory text). Highest recall for specific facts. More expensive (LLM calls).

05 — Advanced

Hierarchical & Parent-Child Chunking

Create small chunks for retrieval (~300 tokens). Link each to a parent chunk (larger context, ~1200 tokens). Retriever fetches small chunks, expands to parent only if needed.

Benefit: Fast retrieval (search on small chunks, low noise). Flexible expansion (fetch parent for full context). Cheaper than semantic (no embedding overhead).

Python: Parent-Child with LlamaIndex

from llama_index.core.node_parser import \ HierarchicalNodeParser parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[2048, 512, 128], # parent, child, grandchild ) nodes = parser.get_nodes_from_documents(documents) # nodes[i].metadata['document_id'] → full document # nodes[i].relationships['parent'] → larger context # Retriever fetches small nodes, LLM expands as needed

Python: Parent-Document Retriever

from langchain.retrievers import \ ParentDocumentRetriever from langchain.storage import InMemoryStore # Create small & parent docs splitter_small = RecursiveCharacterTextSplitter(chunk_size=256) splitter_parent = RecursiveCharacterTextSplitter(chunk_size=2048) store = InMemoryStore() retriever = ParentDocumentRetriever( child_splitter=splitter_small, parent_splitter=splitter_parent, docstore=store, ) # Retrieve returns parent docs (full context) results = retriever.invoke("query")

When Hierarchical Shines

Long documents (books, theses). Multi-section reports. Code repositories. Context matters: small chunk pins retrieval, parent chunk provides full function/section/chapter.

06 — Context

Metadata & Enrichment

Raw chunks are orphaned. Attach metadata: document title, source section, page number, creation date, author. This context helps the LLM assess relevance and trust.

Metadata Best Practices

Essential

  • Source document name
  • Section or heading
  • Page number
  • Document type

+ Helpful

  • Chunk summary
  • Key entities
  • Timestamp
  • Confidence score

Python: Attach Metadata

from langchain_core.documents import Document docs_with_meta = [] for chunk in chunks: doc = Document( page_content=chunk, metadata={ "source": "research_paper_v2.pdf", "section": "Methods", "page": 3, "type": "academic", "document_id": "paper-001", "created": "2024-01-15", } ) docs_with_meta.append(doc) # Retriever preserves metadata # Include in prompt: "From {source} ({section}):"

Enrichment: Summaries & Keywords

Add chunk summaries (1–2 sentences). Extract named entities. Compute TF-IDF relevance. These signals help ranking and reranking stages.

07 — Measurement

Evaluating Chunks

Metrics to Track

📊 Retrieval

  • Chunk recall @ k
  • Precision (noise %)
  • MRR (rank)

🎯 Quality

  • Answer EM/F1
  • Context relevance
  • Factuality

RAGAS framework: Measures context relevance, answer relevance, faithfulness. Run on held-out questions to benchmark chunking strategies.

Chunk size ablation: Vary chunk sizes (128, 256, 512, 1024 tokens). Track retrieval precision and LLM answer quality. Find your system's sweet spot.

08 — Ecosystem

Tools & Libraries

Framework
LangChain
RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter. Most widely used.
Framework
LlamaIndex
HierarchicalNodeParser, SemanticSplitter. Integrates with vector stores.
Parsing
Unstructured
PDF, DOCX, PPTX parsing. Preserves structure. Smart chunking.
Chunking
Chonkie
Fast semantic chunking. Embedding-based boundaries.
Chunking
Semantic Chunker
SOTA semantic splitting. Embedding similarity.
Parsing
Docling
IBM's document parser. Tables, layout preservation.
NLP
NLTK
Sentence tokenization. Baseline for segmentation.
NLP
spaCy
Fast NLP. Sentence splitting, entity extraction.
09 — Further Reading

References

Academic Papers
Documentation & Tools
Practitioner Writing
LEARNING PATH

Learning Path

Chunking is the first decision in every RAG system and has outsized impact on quality. Master it before optimising retrieval:

Fixed-sizesimplest baseline
Sentence / Parasemantic boundaries
Recursivehierarchy-aware
Semanticembedding-based split
AgenticLLM-directed
1

Start with 512-token recursive chunking

LangChain's RecursiveCharacterTextSplitter with chunk_size=512, overlap=64 is a robust default. Use it as your baseline before trying anything fancier.

2

Measure chunk quality, not just pipeline quality

Inspect retrieved chunks manually for a sample of queries. If the key information is always split across chunk boundaries, you need smaller chunks or a smarter splitter.

3

Use parent-child for long documents

Embed small child chunks for precision; fetch their parent chunk for context. LlamaIndex's ParentChildNodeParser implements this. Best for books, contracts, and long reports.

4

Try semantic chunking for unstructured text

Split where the embedding similarity between consecutive sentences drops sharply. Computationally expensive but produces the most coherent chunks for heterogeneous documents.