Chunking Strategies

Parent-Child Chunking

Store large parent chunks and small child chunks. Retrieve small child chunks for precision, but return their parent chunks to the LLM for full context. The Goldilocks strategy: embedding precision without context poverty.

Small chunks
For embedding
Large parent
For synthesis
Best of both
Worlds

Table of Contents

SECTION 01

The parent-child idea

Parent-child chunking (also called hierarchical chunking or small-to-big retrieval) decouples the retrieval unit from the synthesis unit — just like sentence windows, but applied to larger document structures.

The problem with fixed-size chunks: Large chunks (1024 tokens) give LLMs good context but embed poorly — the embedding averages over too many concepts and matches imprecisely. Small chunks (128 tokens) embed precisely but leave the LLM with too little context to synthesise a good answer.

The solution: Create two chunk sizes. Small "child" chunks (128–256 tokens) are embedded and indexed for retrieval — they point sharply at specific facts. Large "parent" chunks (1024–2048 tokens) are stored separately. When a child chunk is retrieved, the system looks up its parent and returns the parent chunk to the LLM instead.

SECTION 02

LlamaIndex implementation

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore

# Create hierarchical nodes: 2048 -> 512 -> 128 token chunks
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # parent -> intermediate -> leaf
)

documents = load_documents("./docs")
nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)  # only the 128-token chunks

# Index: embed only leaf nodes, but store all nodes
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

storage_context = StorageContext.from_defaults(docstore=docstore)
index = VectorStoreIndex(leaf_nodes, storage_context=storage_context)

# AutoMergingRetriever: if most children of a parent are retrieved, return the parent
retriever = AutoMergingRetriever(
    index.as_retriever(similarity_top_k=10),
    storage_context=storage_context,
    simple_ratio_thresh=0.3,  # if 30% of children are retrieved, merge to parent
)

response = retriever.retrieve("What are the key components of attention?")
SECTION 03

LangChain parent document retriever

from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Parent splitter: larger chunks for synthesis
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
# Child splitter: smaller chunks for retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)

vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
docstore = InMemoryStore()  # stores parent chunks

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents — splits into children for embedding, stores parents
retriever.add_documents(documents)

# At retrieval time: query hits child embeddings → returns parent chunks
results = retriever.invoke("How does attention work in transformers?")
for r in results:
    print(f"Parent chunk: {len(r.page_content)} chars")
    print(r.page_content[:300])
SECTION 04

Hierarchical chunk design

Three-level hierarchies give even more flexibility:

LlamaIndex's AutoMergingRetriever handles this automatically: if a threshold of sibling chunks are retrieved, it replaces them with their parent. This is the "auto-merging" behaviour — the system automatically scales up the context when many related passages are retrieved.

SECTION 05

Storage architecture

# Two separate stores needed:
# 1. Vector store: child chunk embeddings (Chroma, Pinecone, Weaviate)
# 2. Document store: parent chunk text (InMemoryStore, Redis, MongoDB)

# The child chunk stores a reference to its parent:
{
  "chunk_id": "doc1_chunk_023",
  "parent_id": "doc1_section_003",   # pointer to parent
  "embedding": [...],                 # the embedded vector
  "text": "attention scores are computed by...",  # child text (may not be needed post-retrieval)
}

# Parent chunk in docstore:
{
  "section_id": "doc1_section_003",
  "text": "Self-attention allows each position in the sequence to attend to all positions in the previous layer... [~1024 tokens]",
  "children": ["doc1_chunk_021", "doc1_chunk_022", "doc1_chunk_023"],
}

# Retrieval: vector search on children → lookup parent → return parent text to LLM
SECTION 06

When it outperforms alternatives

Parent-child retrieval consistently outperforms both fixed-size and sentence-window on benchmarks that test answer comprehensiveness — questions that require integrating information from multiple nearby sentences. It's particularly strong for: technical documentation where precise sub-section retrieval + full section context is needed, legal and compliance documents where narrowly-scoped facts must be returned with full surrounding context, and multi-hop reasoning chains where small child chunks enable precise retrieval of each reasoning step.

It's roughly equivalent to sentence-window for simple factoid retrieval but significantly better for complex synthesis tasks.

SECTION 07

Gotchas

Storage complexity: Requires two storage systems (vector store + document store). Simple deployments can use an in-memory docstore, but production needs persistent storage. This adds operational complexity vs single-store approaches.

Parent size calibration: Parents that are too large waste LLM context window; too small don't provide enough synthesis context. Profile your documents — count average tokens per section — and set parent chunk size to cover 2–3 typical sections.

Merging threshold: The auto-merging threshold in LlamaIndex (ratio of children retrieved before upgrading to parent) needs tuning per use case. Too aggressive = always returning full documents; too conservative = never merging, losing the hierarchy benefit. Start at 0.3 and adjust based on context waste metrics.

Parent-Child Chunking vs. Other Strategies

Parent-child chunking creates two levels of document segmentation: small child chunks that are indexed for retrieval precision, and larger parent chunks that are returned to the LLM for answer generation. When a query retrieves relevant child chunks, the system fetches their parent document segments, providing broader context to the language model than the small retrieved chunks alone would supply.

StrategyIndexed UnitReturned to LLMRetrieval PrecisionContext Quality
Fixed-size chunkingSame chunkSame chunkModerateMay miss context
Parent-childChild (small)Parent (large)HighFull context preserved
Sentence windowSentenceSentence ± NVery highGood local context
Full documentDocument summaryFull documentLowExcellent but noisy

The key insight behind parent-child chunking is that the optimal chunk size for embedding-based retrieval differs from the optimal chunk size for LLM reading. Embeddings encode dense semantic summaries — shorter chunks produce embeddings that more precisely represent a single concept, making retrieval more accurate. But LLMs benefit from seeing complete paragraphs, sections, or document spans that provide full reasoning context. Parent-child chunking decouples these two requirements by maintaining separate chunk sizes for each phase of the pipeline.

Implementation requires maintaining a parent-child mapping in the document store or vector database metadata. When child chunks are retrieved, a post-retrieval step looks up each child's parent ID and fetches the corresponding parent chunk from the document store. If multiple child chunks share the same parent, the parent is only fetched once and deduplicated before inclusion in the LLM context window. This deduplication step is important for preventing the same parent chunk from appearing multiple times and consuming excess context tokens.

The optimal ratio between child and parent chunk sizes depends on the document structure. For technical documentation with long, detail-dense paragraphs, a child-to-parent ratio of 1:4 (e.g., 128 tokens child / 512 tokens parent) works well. For conversational or narrative content with shorter, context-dependent sentences, a tighter ratio of 1:2 better preserves the flow of meaning. Testing retrieval precision at multiple ratio configurations on a representative document sample is more reliable than applying a universal default ratio across all document types.

Hierarchical parent-child structures can extend beyond two levels for very long or structured documents. A three-level hierarchy — sentence → paragraph → section — enables the retrieval system to return exactly the right amount of context for each query type. Short factual queries benefit from sentence-level precision; complex analytical queries benefit from paragraph or section context. The SmallToLarge retrieval pattern starts with the smallest chunk and progressively expands to the parent if the retrieved context is insufficient for generating a confident answer.

Storage considerations for parent-child chunking require maintaining the original parent documents alongside the child chunk index. Each child chunk's metadata must include its parent identifier and optionally the byte offsets within the parent document for efficient extraction. Vector databases like Chroma and Qdrant support metadata filtering that can accelerate parent lookup by tenant or document type. For large-scale deployments processing millions of documents, storing parent chunks in a separate high-performance key-value store (Redis, DynamoDB) rather than the vector database reduces the operational complexity of the retrieval pipeline.