Chunking Strategies for RAG

Contents

Why chunking matters
6 strategies compared
Fixed & recursive
Semantic & prop
Hierarchical chunks
Metadata enrichment
References

01 — Foundation

Why Chunking Matters

In RAG systems, the chunk is the retrieval unit. A query retrieves chunks, not full documents. How you split determines what context the LLM sees — and thus output quality.

Too large: Chunks > 2000 tokens become noise-heavy. The LLM must sift through irrelevant sentences to find the signal. Retrieval precision drops.

Too small: Chunks < 100 tokens lose context. A single proposition isolated from surrounding sentences becomes ambiguous. The LLM must chain many chunks together to understand.

The golden chunk is ~300–500 tokens, semantically cohesive, dense with signal, and self-contained enough that an LLM can understand it without external context.

💡 Key insight: Chunking is a retrieval problem first, not a storage problem. Optimize for what the retriever can find and what the LLM can use, not for convenience of splitting.

The Retrieval Cascade

User query → Embed → Search vector space → Retrieve top-k chunks → Augment prompt → LLM answer. Each stage depends on chunk quality. A perfectly good chunk retrieved as 7 smaller pieces costs k slots. A 10,000-token chunk with 9,800 tokens of noise poisons the context window.

02 — Comparison

Chunking Strategies Compared

Strategy	Granularity	Overlap	Compute	Best For
Fixed-size	Uniform tokens/chars	Optional	O(n)	Simple baseline, fast ingestion
Recursive	Splits on delimiters (.\n\n)	Optional	O(n log n)	Markdown, code, docs
Semantic	Embedding-based boundaries	Yes	O(n²)	Long-form articles, narratives
Proposition	Atomic facts via LLM	No	O(n · k)	Factual precision, QA
Hierarchical	Small chunks + parent refs	No	O(n + k)	Context at retrieval + cheap full expansion
Agentic	Task-dependent splits	No	O(n · m)	Multi-step reasoning, edge cases

Cost vs. Quality Tradeoff

Fixed & Recursive: Cheapest. No embeddings needed. Works well for structured text (code, markdown). Blind to semantics — may split meaningful units.

Semantic & Proposition: Expensive. Require embedding or LLM calls. Semantic respects meaning boundaries. Proposition extracts facts — highest precision for factual retrieval.

Hierarchical: Moderate cost. Layers retrieval: fast retrieval of small chunks, optional expansion to parent for context. Scales well.

03 — Classic

Fixed & Recursive Chunking

Fixed-size: Split every N tokens or M characters. Simple. Predictable. Blind to boundaries.

Recursive: Split on delimiters (paragraphs, sentences, words) in order of preference. LangChain's RecursiveCharacterTextSplitter uses this: try "\n\n" first, then "\n", then " ", then "".

Python: Fixed-Size Splitting

from langchain_text_splitters import CharacterTextSplitter splitter = CharacterTextSplitter( chunk_size=500, # tokens (approx) chunk_overlap=50, # overlap for context ) chunks = splitter.split_text(document) # Overlap helps: a fact split across boundary # is recoverable in both chunks

Python: Recursive Splitting

from langchain_text_splitters import \ RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=[ "\n\n", # paragraph "\n", # sentence ".", # clause " ", # word "" # character (fallback) ], ) chunks = splitter.split_text(markdown_doc) # Respects doc structure. Better for code/markdown.

Best Practices

✓ Do This

Use 300–600 token chunks
Add 50–100 token overlap
Recursive for markdown/code
Track chunk boundaries

✗ Avoid

Fixed-size for natural text
No overlap (boundary loss)
Chunks > 2000 tokens
Ignoring document structure

04 — Intelligent

Semantic & Proposition Chunking

Semantic chunking: Embed sentences. Cluster sentences with high embedding similarity. When similarity drops, split. Respects meaning, not delimiters.

Proposition chunking: Extract atomic facts from text using an LLM ("Claude, list 5 propositions from this paragraph"). Group propositions into chunks. Highest precision for factual retrieval.

Python: Semantic Chunking

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('all-MiniLM-L6-v2') sentences = text.split('. ') embeddings = model.encode(sentences) # Cosine similarity between consecutive sentences breakpoints = [] threshold = 0.5 for i in range(len(embeddings)-1): sim = np.dot(embeddings[i], embeddings[i+1]) if sim < threshold: breakpoints.append(i) chunks = [] start = 0 for bp in breakpoints: chunk = '. '.join(sentences[start:bp+1]) chunks.append(chunk) start = bp + 1

Python: Proposition Extraction

from anthropic import Anthropic client = Anthropic() def extract_propositions(text: str) -> list[str]: response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=500, messages=[{ "role": "user", "content": f"""Extract 3-5 atomic propositions (subject-predicate-object facts) from this text: {text}""" }] ) # Parse response, return list of propositions return response.content[0].text.split('\n') # Group propositions into chunks props = extract_propositions(document) chunks = ['\n'.join(props[i:i+3]) for i in range(0, len(props), 3)]

When to Use

Semantic: Articles, essays, long-form content where meaning changes smoothly. Works well with embedding similarity metrics.

Proposition: Factual documents (research papers, specifications, regulatory text). Highest recall for specific facts. More expensive (LLM calls).

05 — Advanced

Hierarchical & Parent-Child Chunking

Create small chunks for retrieval (~300 tokens). Link each to a parent chunk (larger context, ~1200 tokens). Retriever fetches small chunks, expands to parent only if needed.

Benefit: Fast retrieval (search on small chunks, low noise). Flexible expansion (fetch parent for full context). Cheaper than semantic (no embedding overhead).

Python: Parent-Child with LlamaIndex

from llama_index.core.node_parser import \ HierarchicalNodeParser parser = HierarchicalNodeParser.from_defaults( chunk_sizes=[2048, 512, 128], # parent, child, grandchild ) nodes = parser.get_nodes_from_documents(documents) # nodes[i].metadata['document_id'] → full document # nodes[i].relationships['parent'] → larger context # Retriever fetches small nodes, LLM expands as needed

Python: Parent-Document Retriever

from langchain.retrievers import \ ParentDocumentRetriever from langchain.storage import InMemoryStore # Create small & parent docs splitter_small = RecursiveCharacterTextSplitter(chunk_size=256) splitter_parent = RecursiveCharacterTextSplitter(chunk_size=2048) store = InMemoryStore() retriever = ParentDocumentRetriever( child_splitter=splitter_small, parent_splitter=splitter_parent, docstore=store, ) # Retrieve returns parent docs (full context) results = retriever.invoke("query")

When Hierarchical Shines

Long documents (books, theses). Multi-section reports. Code repositories. Context matters: small chunk pins retrieval, parent chunk provides full function/section/chapter.

06 — Context

Metadata & Enrichment

Raw chunks are orphaned. Attach metadata: document title, source section, page number, creation date, author. This context helps the LLM assess relevance and trust.

Metadata Best Practices

✓ Essential

Source document name
Section or heading
Page number
Document type

+ Helpful

Chunk summary
Key entities
Timestamp
Confidence score

Python: Attach Metadata

from langchain_core.documents import Document docs_with_meta = [] for chunk in chunks: doc = Document( page_content=chunk, metadata={ "source": "research_paper_v2.pdf", "section": "Methods", "page": 3, "type": "academic", "document_id": "paper-001", "created": "2024-01-15", } ) docs_with_meta.append(doc) # Retriever preserves metadata # Include in prompt: "From {source} ({section}):"

Enrichment: Summaries & Keywords

Add chunk summaries (1–2 sentences). Extract named entities. Compute TF-IDF relevance. These signals help ranking and reranking stages.

07 — Measurement

Evaluating Chunks

Metrics to Track

📊 Retrieval

Chunk recall @ k
Precision (noise %)
MRR (rank)

🎯 Quality

Answer EM/F1
Context relevance
Factuality

RAGAS framework: Measures context relevance, answer relevance, faithfulness. Run on held-out questions to benchmark chunking strategies.

Chunk size ablation: Vary chunk sizes (128, 256, 512, 1024 tokens). Track retrieval precision and LLM answer quality. Find your system's sweet spot.

08 — Ecosystem

Tools & Libraries

Framework

LangChain

RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter. Most widely used.

Framework

LlamaIndex

HierarchicalNodeParser, SemanticSplitter. Integrates with vector stores.

Parsing

Unstructured

PDF, DOCX, PPTX parsing. Preserves structure. Smart chunking.

Chunking

Chonkie

Fast semantic chunking. Embedding-based boundaries.

Chunking

Semantic Chunker

SOTA semantic splitting. Embedding similarity.

Parsing

Docling

IBM's document parser. Tables, layout preservation.

NLP

NLTK

Sentence tokenization. Baseline for segmentation.

NLP

spaCy

Fast NLP. Sentence splitting, entity extraction.

09 — Further Reading

References

Academic Papers

Paper Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906. — arxiv:2004.04906 ↗
Paper Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401. — arxiv:2005.11401 ↗

Documentation & Tools

Docs LangChain Text Splitters. langchain.com ↗
Docs LlamaIndex Node Parsers. llamaindex.ai ↗
Docs Unstructured Document Parsing. unstructured.io ↗
Guide RAG Best Practices (LangChain Blog). langchain.dev ↗

Practitioner Writing

Blog Greg Kamradt. (2023). 5 Levels of Text Splitting for RAG. — github.com ↗
Blog Sophia Yang. (2024). Chunking Strategies for RAG. LangChain Blog. — blog.langchain.dev ↗

LEARNING PATH

Learning Path

Chunking is the first decision in every RAG system and has outsized impact on quality. Master it before optimising retrieval:

Fixed-sizesimplest baseline

→

Sentence / Parasemantic boundaries

→

Recursivehierarchy-aware

→

Semanticembedding-based split

→

AgenticLLM-directed

Start with 512-token recursive chunking

LangChain's RecursiveCharacterTextSplitter with chunk_size=512, overlap=64 is a robust default. Use it as your baseline before trying anything fancier.

Measure chunk quality, not just pipeline quality

Inspect retrieved chunks manually for a sample of queries. If the key information is always split across chunk boundaries, you need smaller chunks or a smarter splitter.

Use parent-child for long documents

Embed small child chunks for precision; fetch their parent chunk for context. LlamaIndex's ParentChildNodeParser implements this. Best for books, contracts, and long reports.

Try semantic chunking for unstructured text

Split where the embedding similarity between consecutive sentences drops sharply. Computationally expensive but produces the most coherent chunks for heterogeneous documents.

Chunking Strategies for RAG

Why Chunking Matters

The Retrieval Cascade

Chunking Strategies Compared

Cost vs. Quality Tradeoff

Fixed & Recursive Chunking

Python: Fixed-Size Splitting

Python: Recursive Splitting

Best Practices

✓ Do This

✗ Avoid

Semantic & Proposition Chunking

Python: Semantic Chunking

Python: Proposition Extraction

When to Use

Hierarchical & Parent-Child Chunking

Python: Parent-Child with LlamaIndex

Python: Parent-Document Retriever

When Hierarchical Shines

Metadata & Enrichment

Metadata Best Practices

✓ Essential

+ Helpful

Python: Attach Metadata

Enrichment: Summaries & Keywords

Evaluating Chunks

Metrics to Track

📊 Retrieval

🎯 Quality

Tools & Libraries

References

Learning Path

Start with 512-token recursive chunking

Measure chunk quality, not just pipeline quality

Use parent-child for long documents

Try semantic chunking for unstructured text

Related concepts