A document chunking strategy that uses an LLM to determine chunk boundaries based on content coherence — producing semantically complete, variable-length chunks without fixed window sizes.
Agentic chunking uses an LLM as a document understanding agent: it reads the document and decides where to split based on semantic coherence — keeping related content together and splitting at natural topic boundaries. The result is variable-length chunks that respect document structure: a code block stays with its explanation, a table stays with its header row.
The LLM is given the document (or a rolling window of it) and asked: 'Is this content complete as a standalone chunk, or does it need more context?' If complete, split here. If not, include the next paragraph and re-evaluate. This produces chunks that are semantically complete — each chunk answers a coherent question without requiring surrounding context.
from openai import OpenAI
client = OpenAI()
def agentic_chunk(document: str, target_size: int = 500) -> list[str]:
paragraphs = document.split("\n\n")
chunks = []
current = []
current_len = 0
for para in paragraphs:
current.append(para)
current_len += len(para.split())
if current_len >= target_size:
# Ask LLM if current accumulation is a coherent chunk
candidate = "\n\n".join(current)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
"Is the following text a semantically complete, standalone unit? "
"Reply with YES or NO.\n\n" + candidate[:1000]
)
}],
max_tokens=5,
temperature=0,
)
if "YES" in response.choices[0].message.content.upper():
chunks.append(candidate)
current = []
current_len = 0
if current:
chunks.append("\n\n".join(current))
return chunks
Use agentic chunking for: heterogeneous documents (technical manuals, reports with mixed content), documents where structure matters (code + explanation, tables + captions), and high-value corpora where retrieval quality justifies indexing cost. Avoid for: homogeneous text corpora (news articles), cost-sensitive pipelines, or real-time indexing.
Fixed-size: O(1) cost, low quality, good baseline. Recursive character: O(1) cost, medium quality, respects some structure. Semantic: O(n·embedding) cost, good quality for narrative. Proposition: O(n·LLM) cost, best precision for factoid QA. Agentic: O(n·LLM) cost, best coherence for complex documents. Choose based on the tradeoff between indexing cost and retrieval quality requirements.
A powerful two-stage pipeline: (1) Agentic chunking produces semantically complete parent chunks. (2) Proposition extraction produces fine-grained retrieval units from each chunk. At retrieval time, propositions are retrieved for precision; parent chunks are returned for context richness. This is the highest-quality RAG retrieval strategy, at the highest indexing cost.
Agentic chunking uses an LLM to determine where to break documents into chunks. The LLM evaluates coherence: "does this section logically conclude here?" Variable-length chunks preserve semantic units—a short paragraph stays whole while a long section might split across 2-3 chunks. This reduces retrieval latency (fetching fewer, more complete chunks) and improves downstream LLM context quality.
import anthropic
def agentic_chunk(document_text, max_chunk_tokens=1500):
"""Use Claude to determine chunk boundaries."""
client = anthropic.Anthropic()
# Propose chunks and get Claude to validate boundaries
proposal_prompt = f"""
Document:
{document_text[:5000]}...
Identify natural semantic breaks where a new topic or section begins.
Return JSON: {{"boundaries": [position1, position2, ...]}}
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": proposal_prompt}]
)
boundaries = json.loads(response.content[0].text)["boundaries"]
chunks = [document_text[boundaries[i]:boundaries[i+1]]
for i in range(len(boundaries)-1)]
return chunksCost vs. quality tradeoff: calling an LLM per document adds overhead (~$0.001 per document for Claude 3.5 Sonnet). But the quality gain justifies the cost: agentic chunks reduce retrieval latency by 20-30% and improve final answer quality by 5-15% because chunks are more semantically coherent. Batch processing amortizes costs further: chunk a whole corpus offline, cache results, reuse forever.
# Batch chunking with caching
import hashlib
def batch_chunk_documents(documents, cache_dir="/tmp/chunks"):
"""Efficiently chunk multiple documents with result caching."""
chunks_map = {}
for doc_id, text in documents.items():
doc_hash = hashlib.md5(text.encode()).hexdigest()
cache_path = f"{cache_dir}/{doc_id}_{doc_hash}.json"
if os.path.exists(cache_path):
chunks_map[doc_id] = json.load(open(cache_path))
continue
# Agentic chunking (expensive)
chunks = agentic_chunk(text)
chunks_map[doc_id] = chunks
# Cache the result
os.makedirs(cache_dir, exist_ok=True)
json.dump(chunks, open(cache_path, 'w'))
return chunks_map| Chunking Strategy | Avg Chunk Size | Semantic Coherence | Cost per Doc |
|---|---|---|---|
| Fixed window (512 tokens) | 512 | Poor | Free |
| Sentence boundaries | 200-800 | Good | Free |
| Recursive split | 300-1000 | Good | Free |
| Agentic (Claude) | 400-1500 | Excellent | $0.001 |
Implementation considerations: agentic chunking works best with longer documents (10+ pages). For short documents, sentence-level chunking is sufficient. For medium documents (2-10 pages), it's borderline—measure the quality improvement and cost tradeoff. Batch all documents of the same type together and chunk them with a single API call to amortize the overhead.
Chunk size distribution: agentic chunking produces variable-size chunks (100-3000 tokens typically). This is desirable because semantically complete units have natural boundaries. However, downstream systems might have fixed-size expectations (e.g., vector database pages are 512 tokens). Solution: keep agentic chunks as-is for retrieval and QA, but split them further only when necessary for storage optimization.
Fixed-window chunking (sliding window of 512 tokens) is simple but semantically broken: chunks cut arbitrarily through topics. Sentence-level chunking respects sentence boundaries but can produce chunks of wildly different sizes (5 tokens to 500 tokens). Recursive splitting (split by paragraph, then by sentence, then by token) is a good heuristic but still doesn't understand semantics.
Agentic chunking improves over all these by letting an LLM decide boundaries. The cost is higher (one API call per document) but the payoff is substantial: 20-30% reduction in mean chunk size, 40% reduction in retrieval latency, and 5-15% improvement in final QA accuracy. For large corpora, the cost amortizes: $0.001 per document is negligible compared to the quality gains.
Hybrid approach: use agentic chunking offline for static documents, use sentence-level for real-time streaming documents. Cache agentic chunks and reuse across different RAG applications. Invest in agentic chunking if your document corpus is large and static; use heuristics for streaming or one-off documents.
Agentic chunks integrate well with dense retrieval systems (embedding-based search). Because chunks are semantically coherent, their embeddings are more meaningful—retrieval precision improves. The chunk boundary quality directly impacts downstream QA accuracy: good boundaries mean fewer irrelevant chunks retrieved, better context for generation.
Two-stage retrieval: first stage retrieves agentic chunks (fast, semantic), second stage reranks chunks based on relevance to the query (precise but slower). This hybrid approach achieves high recall (semantic retrieval) and high precision (reranking). For critical applications (legal research, medical information), combine with chain-of-thought verification: LLM verifies that the chunk actually answers the question before returning it to the user.
Robustness and failure modes: agentic chunking can fail if the LLM hallucinate (claims a boundary where none exists) or misunderstands the document structure. Mitigate by: using a reliable LLM (e.g., Claude 3.5 Sonnet), providing clear examples in the prompt, validating chunks post-hoc (verify each chunk is valid text). For critical applications, combine with human review of a sample of chunks.
Iterative refinement: start with sentence-level chunking, measure downstream quality, then upgrade to agentic chunking if justified. This way, you invest in the technique only where it matters. Some applications get 95% of the way with sentence chunking; others need agentic chunking. Data-driven decision-making prevents over-engineering.
Agentic chunking improves retrieval quality at moderate cost. Recommended for large static corpora where quality is critical. Use simpler chunking for small or streaming documents. Measure improvement before investing in LLM-based approaches. Data-driven decisions prevent over-engineering.
Scaling: batch documents into groups of 100-1000, chunk each batch with single API call, parallelize across machines. A distributed system can chunk a billion documents in hours. Cache results with checksums: if documents never change, cached chunks remain valid forever. This infrastructure investment is justified when retrieval quality is critical and documents are stable. Enterprise documents and knowledge bases benefit most from agentic chunking.
Key takeaway: the value of this approach compounds over time. In month one, the benefits might be marginal. In month six, dramatically apparent. In year two, transformative. This is why patience and persistence matter in technical implementation. Build strong foundations, invest in quality, and let the benefits accumulate. The teams that master these techniques gain compounding advantages over competitors. Start today, measure continuously, optimize based on data. Success follows from consistent execution of fundamentals.
Future directions: newer LLMs may be able to chunk even more intelligently by reasoning about content semantics and downstream usage patterns. Combine agentic chunking with dynamic retrieval strategies where chunk sizes adapt to query characteristics. Research continues on optimal chunk boundaries for different retrieval and generation tasks. Stay engaged with research to leverage new techniques as they emerge.