Semantic Chunking

How semantic chunking works
Implementation
Breakpoint percentile tuning
Comparison with fixed-size
When to use semantic chunking
LlamaIndex and LangChain integrations
Gotchas

SECTION 01

How semantic chunking works

Semantic chunking (introduced in Greg Kamradt's "5 levels of chunking") uses embedding similarity to detect where a document's topic changes, then splits at those boundaries.

The algorithm:

Split the document into sentences.
Embed each sentence (or small group of sentences).
Compute cosine distance between consecutive sentence embeddings.
Identify "breakpoints" — positions where cosine distance is unusually high (topic shift).
Split the document at breakpoints. Everything between two breakpoints becomes one chunk.

The key parameter is the breakpoint threshold: typically the 95th percentile of all pairwise distances. Sentences whose distance exceeds this threshold trigger a split. The result: chunks that align with the document's natural topic structure rather than arbitrary token counts.

SECTION 02

Implementation

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from openai import OpenAI

client = OpenAI()

def embed(texts: list) -> np.ndarray:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([e.embedding for e in resp.data])

def semantic_chunk(text: str, percentile: float = 95) -> list[str]:
    import nltk
    sentences = nltk.sent_tokenize(text)
    if len(sentences) < 2:
        return [text]

    embeddings = embed(sentences)
    # Compute distance between consecutive sentences
    distances = []
    for i in range(len(embeddings) - 1):
        sim = cosine_similarity([embeddings[i]], [embeddings[i+1]])[0][0]
        distances.append(1 - sim)  # cosine distance

    # Find breakpoints at high-distance positions
    threshold = np.percentile(distances, percentile)
    breakpoints = [i+1 for i, d in enumerate(distances) if d > threshold]

    # Build chunks
    chunks, prev = [], 0
    for bp in breakpoints:
        chunks.append(" ".join(sentences[prev:bp]))
        prev = bp
    chunks.append(" ".join(sentences[prev:]))
    return chunks

chunks = semantic_chunk(long_document, percentile=90)
print(f"Created {len(chunks)} semantic chunks")
for i, c in enumerate(chunks):
    print(f"Chunk {i}: {len(c.split())} words")

SECTION 03

Breakpoint percentile tuning

The percentile threshold controls chunk granularity:

percentile=70: Aggressive splitting. Many small chunks, each very topically focused. Better precision, worse coverage per chunk.
percentile=90–95: Standard setting. Splits at significant topic changes. Chunks correspond roughly to sections or paragraphs. Recommended starting point.
percentile=99: Conservative splitting. Few, large chunks. Only splits on major topic shifts (e.g., chapter changes).

import matplotlib.pyplot as plt

# Visualise the distance curve to choose a threshold
distances = compute_distances(sentences, embeddings)
plt.figure(figsize=(12, 4))
plt.plot(distances, alpha=0.7)
plt.axhline(y=np.percentile(distances, 90), color='r', linestyle='--', label='90th pct')
plt.axhline(y=np.percentile(distances, 95), color='g', linestyle='--', label='95th pct')
plt.xlabel("Sentence index")
plt.ylabel("Cosine distance to next sentence")
plt.legend()
plt.title("Semantic distance profile — breakpoint detection")
plt.savefig("semantic_chunking_viz.png")

SECTION 04

Comparison with fixed-size

On RAG benchmarks like RAGAS and HotpotQA, semantic chunking typically outperforms fixed-size by 5–12% on answer faithfulness and context precision metrics. The main wins come from: fewer "partial context" retrievals (chunks don't cut mid-topic), better embedding quality (each chunk represents a coherent concept), and reduced noise in retrieved context (LLM gets on-topic text, not arbitrary windows).

The cost: semantic chunking requires N embedding API calls during ingestion (one per sentence), vs O(1) for fixed-size splitting. For a 1M-token document corpus, this is ~2,000 embedding API calls (batching helps). Pre-compute and cache embeddings at ingestion time to amortise this cost.

SECTION 05

When to use semantic chunking

Semantic chunking is the right choice when: documents cover multiple topics within a single file (technical reports, Wikipedia articles), you need high answer precision and can afford longer ingestion time, and the embedding model you use is strong enough to distinguish topic changes (larger models work better here).

Stick with fixed-size when: documents are already topically uniform (each file covers one topic), you need fast, cheap ingestion (batch jobs processing millions of documents), or you're working with code/structured data where topic boundaries don't apply.

SECTION 06

LlamaIndex and LangChain integrations

# LangChain SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95,
)
chunks = splitter.create_documents([document_text])

# LlamaIndex SemanticSplitterNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

splitter = SemanticSplitterNodeParser(
    buffer_size=1,                 # sentences to group before computing distance
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
)
nodes = splitter.get_nodes_from_documents(documents)
print(f"Created {len(nodes)} semantic nodes")

SECTION 07

Gotchas

Embedding cost scales with document size: Each sentence requires an embedding call. Batch sentences (max 2048 per OpenAI call) to minimise API round-trips. For large corpora, this is the main cost driver of semantic chunking vs fixed-size.

Very short or very long chunks: Semantic chunking can produce degenerate chunks: very short (a single transition sentence) or very long (if a document discusses one topic for 50 pages). Add min/max chunk size guardrails: merge chunks below 100 tokens, split chunks above 2000 tokens with a fallback fixed-size split.

Domain specificity: Standard embedding models may not detect topic shifts in highly specialised technical text. The embeddings for "quantum error correction" and "quantum gate fidelity" may be similar enough to prevent splitting, even though these are different topics. Fine-tuned domain embeddings help here.

Semantic Chunking vs. Fixed-Size Strategies

Semantic chunking groups text into chunks based on topical coherence rather than fixed character or token counts. By detecting semantic boundaries — points where the topic or subject shifts — semantic chunking produces chunks that each represent a unified idea, improving both retrieval precision and the quality of context provided to the LLM.

Strategy	Boundary Detection	Chunk Size	Precision	Compute Cost
Fixed-size	Token/character count	Uniform	Moderate	Minimal
Sentence splitting	Punctuation/NLP	Variable (1–5 sentences)	Good	Low
Semantic chunking	Embedding similarity drop	Variable (topic-based)	High	Medium (embeddings)
Agentic chunking	LLM-based segmentation	Variable	Highest	High (LLM calls)

The embedding similarity threshold that triggers a new chunk boundary is the primary hyperparameter in semantic chunking. A high threshold (0.95+) only splits when content changes dramatically, producing large, coherent chunks that risk exceeding the embedding model's effective representation range. A low threshold (0.80) splits more aggressively, producing many small focused chunks at the cost of fragmenting related content that flows naturally across topical micro-transitions. Threshold calibration on a representative document sample is more reliable than applying domain-agnostic defaults.

Semantic chunking's main limitation is that it requires embedding every sentence in the document during ingestion, multiplying the embedding API calls compared to fixed-size approaches. For a 100-page technical manual, fixed-size chunking into 512-token chunks produces about 400 chunks; semantic chunking of the same document embeds each sentence (2,000–4,000 sentences) to detect boundaries, then stores 300–500 semantically coherent chunks. The ingestion cost is 5–10× higher, but retrieval quality improvements justify this investment for knowledge bases queried frequently.

Adaptive semantic chunking adjusts the similarity threshold based on document section characteristics. Introduction sections, which often contain broad topic overviews, may naturally have lower sentence-to-sentence similarity and benefit from a higher threshold to avoid excessive fragmentation. Technical detail sections, where adjacent sentences discuss closely related implementation specifics, benefit from a lower threshold that creates more fine-grained chunks. Section-aware threshold adaptation can be implemented by detecting section headers and applying different thresholds to different document regions.

Combining semantic chunking with chunk size constraints prevents degenerate cases where the semantic segmentation creates extremely long or extremely short chunks. A maximum chunk size limit (e.g., 1500 tokens) forces a split even when semantic similarity remains high, preventing single chunks from spanning multiple pages. A minimum chunk size threshold merges tiny chunks that result from back-to-back semantic boundary detections, ensuring every chunk contains sufficient content to produce a meaningful embedding. These guard rails make semantic chunking production-ready without sacrificing its core quality advantages.

Evaluating semantic chunking quality requires measuring both retrieval precision and chunk coherence. Retrieval precision can be measured with standard information retrieval metrics on a held-out evaluation set. Chunk coherence — whether each chunk discusses a single unified topic — can be approximated by computing the pairwise cosine similarity of sentences within each chunk; high intra-chunk similarity indicates semantic coherence while low similarity suggests the boundary detection missed a topic transition. Combining these two metrics provides a complete quality assessment of the chunking configuration.