Split text into fixed-size token or character chunks with configurable overlap. The simplest chunking strategy and surprisingly competitive — good baseline for most RAG systems before optimising.
Most RAG systems retrieve documents at the chunk level, not the document level. Chunking is the process of splitting source documents into smaller pieces before embedding and indexing them. The quality of retrieval — and therefore the quality of the generated answer — depends heavily on how well your chunks represent coherent, self-contained pieces of information.
Fixed-size chunking splits text into chunks of N tokens (or characters) with M tokens of overlap between consecutive chunks. The overlap ensures that sentences or ideas that span a chunk boundary aren't cut in half — the context bleeds into the next chunk. It's the simplest strategy and a good starting point for any RAG system.
from langchain.text_splitter import TokenTextSplitter, RecursiveCharacterTextSplitter
# Token-based chunking (recommended — aligns with model context)
splitter = TokenTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=50, # overlap between consecutive chunks
encoding_name="cl100k_base", # tiktoken encoding (GPT-4, ada-002)
)
chunks = splitter.split_text(document_text)
# Character-based chunking (simpler, language-agnostic)
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, # characters per chunk (rough ~500 token equiv)
chunk_overlap=200,
separators=["
", "
", ". ", " ", ""], # try to split at boundaries
)
chunks = char_splitter.split_documents(documents)
print(f"Document: {len(document_text)} chars → {len(chunks)} chunks")
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i}: {len(chunk)} chars | {chunk[:80]}...")
The right chunk size depends on your embedding model's context window and the nature of your documents:
Overlap: A 10–15% overlap is usually sufficient. For 512-token chunks, use 50–75 tokens overlap. Too much overlap wastes index space; too little causes missed context at boundaries.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
def token_chunker(text: str, chunk_size: int = 512, overlap: int = 50):
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + chunk_size, len(tokens))
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start += chunk_size - overlap
return chunks
# Compare sizes
text = "A" * 10000 # 10K chars
char_chunks = [text[i:i+2000] for i in range(0, len(text), 1800)]
token_chunks = token_chunker(text, chunk_size=512, overlap=50)
print(f"Character chunks: {len(char_chunks)}")
print(f"Token chunks: {len(token_chunks)}")
# Character count per chunk is unpredictable for multilingual text
# Token count per chunk is exact — use for LLM context budgeting
Use token-based chunking when you need predictable context window usage. Character-based is faster to compute and language-agnostic, but token counts vary across languages (CJK characters ≈ 1 char : 1-2 tokens, ASCII ≈ 4 chars : 1 token).
On standard RAG benchmarks (RAGAS, TruLens), fixed-size chunking at 512 tokens with 50-token overlap typically achieves 70–85% of the performance of more sophisticated strategies (semantic chunking, sentence-window). For many production use cases, the simplicity and speed advantage of fixed-size chunking outweighs the quality gap.
Fixed-size chunking breaks down when: documents have highly variable density (a 512-token chunk of legal boilerplate contains less information than 512 tokens of dense technical content), sentences span chunk boundaries (partially captured sentences hurt both embedding quality and LLM coherence), and documents have natural semantic units (chapters, sections) that should stay together.
Recommendation: start with fixed-size 512 tokens + 50 overlap, measure retrieval quality, then try sentence-window or semantic chunking if you need improvement.
from langchain.text_splitter import TokenTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
# Load and chunk documents
splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.create_documents(
texts=[doc.text for doc in documents],
metadatas=[{"source": doc.source, "page": doc.page} for doc in documents],
)
# Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)
# Query
results = vectorstore.similarity_search("What is attention?", k=4)
for r in results:
print(f"[{r.metadata['source']}] {r.page_content[:200]}")
Mid-sentence splits: Fixed-size chunking can cut sentences in half. Use RecursiveCharacterTextSplitter with sentence-boundary separators to reduce this. Overlap also helps — the end of chunk N and start of chunk N+1 overlap, so a query that needs the full sentence can retrieve it from one of them.
Table and code chunking: Fixed-size chunking destroys table structure and code blocks. For documents with tables, use a dedicated table parser. For codebases, use AST-based chunking (split at function/class boundaries) rather than fixed-size.
Chunk size ≠ embedding window: Ensure your chunk_size fits within the embedding model's token limit. text-embedding-3-small handles 8191 tokens, but smaller chunk sizes (256-512) often yield better retrieval precision because the embedding captures a more focused concept.
Re-chunking after document updates: When a source document is updated, you need to re-chunk and re-embed the entire document. There's no "partial update" for chunks — track document version hashes to detect changes and trigger re-ingestion.
Fixed-size chunking divides documents into segments of a specified number of characters or tokens, with an optional overlap between consecutive chunks. It is the simplest and most widely used chunking strategy due to its predictability, computational efficiency, and compatibility with any document type. The primary design decisions are chunk size and overlap amount.
| Chunk Size | Overlap | Retrieval Precision | Context per Chunk | Index Size |
|---|---|---|---|---|
| 256 tokens | 0 | High (focused) | Low | Large |
| 512 tokens | 50 | Good | Moderate | Medium |
| 1024 tokens | 100 | Moderate | Good | Medium |
| 2048 tokens | 200 | Lower (diluted) | High | Small |
Chunk overlap prevents information loss at chunk boundaries. Without overlap, a sentence that spans the boundary between two chunks gets split, with the first half embedded in one chunk and the second half in the next. Neither chunk's embedding accurately represents the split sentence. Adding overlap ensures that boundary content appears in at least one complete chunk. However, overlap increases storage and index size proportionally, so the overlap should be large enough to capture typical sentence and clause lengths (50–100 tokens) without being so large that it duplicates significant content between chunks.
Chunking at the token level versus character level produces different behaviors across languages. Character-based chunking with a limit of 500 characters produces very different chunk sizes for English (about 100–125 tokens) versus Chinese or Japanese (about 250–350 tokens due to higher information density per character). Token-based chunking with an explicit tokenizer produces consistent semantic density across languages but requires running the tokenizer at ingestion time, adding a preprocessing step that character-based chunking avoids.
Recursive character text splitting improves on naive fixed-size chunking by respecting the natural structural hierarchy of documents. Instead of splitting on a fixed character count, recursive splitting attempts to divide at the highest-level separator available — paragraph breaks first, then sentence boundaries, then word boundaries, then characters as a last resort. This produces chunks that are bounded in size but respect document structure, avoiding the worst cases of fixed-size splitting where a single sentence gets fragmented across chunk boundaries.
Metadata preservation during chunking is often more important than the chunking strategy itself for production RAG systems. Each chunk should carry source URL or file path, section heading, page number, and any relevant document-level metadata (author, date, document type). This metadata enables post-retrieval filtering — returning only chunks from documents published after a specific date, or only from documents matching a certain category — and powers citations that link LLM answers back to specific source locations in the original documents.
Incremental chunking for document update pipelines re-processes only modified or newly added documents rather than re-chunking the entire corpus on each update. Maintaining a hash of each source document in the chunk metadata enables efficient change detection: if the document hash matches the stored hash, no re-chunking is needed; if it differs, the old chunks are deleted and the document is re-chunked and re-embedded. This incremental approach reduces embedding API costs and index update latency for document corpora with frequent partial updates.
Evaluating chunking strategy quality requires measuring retrieval precision and recall on a representative question set before committing to a production configuration. A common evaluation protocol generates 50–100 question-answer pairs from the document corpus, retrieves the top-K chunks for each question, and checks whether the chunk containing the answer appears in the retrieved set. Comparing multiple chunking configurations on this evaluation set provides empirical evidence for configuration decisions that would otherwise be made by intuition or convention.