Late Chunking

The late chunking insight
How it works technically
Implementation with Jina embeddings
Late chunking vs standard chunking
Limitations
When to use it
Gotchas

SECTION 01

The late chunking insight

Standard chunking has a fundamental problem: when you split a document into chunks and embed each chunk independently, you lose the document-level context. A chunk containing "it was announced last year" loses the antecedent of "it" — the embedding encodes an ambiguous fragment. Similarly, technical abbreviations defined earlier in the document are opaque in later chunks.

Late chunking (Günther et al. 2024, Jina AI) inverts the order of operations. Instead of: split → embed chunks, it does: embed whole document → split embeddings.

The key insight: modern transformer encoders produce token-level embeddings that are contextualised by the entire input sequence (via attention). If you feed the full document through the encoder and then pool the resulting token embeddings into chunk-level vectors, each chunk's embedding reflects its meaning in context, not in isolation. A chunk about "it" now has an embedding shaped by everything that came before.

SECTION 02

How it works technically

Standard chunking: for a document of N tokens, split into chunks [C₁, C₂, ..., Cₖ]. Embed each Cᵢ independently: e(Cᵢ) = mean-pool(encoder(Cᵢ)).

Late chunking for the same document:

Feed the entire document through the long-context encoder: token_embeddings = encoder(D) — shape (N, d).
Determine chunk boundaries: positions [0, b₁, b₂, ..., N].
Pool token embeddings within each chunk boundary: e(Cᵢ) = mean(token_embeddings[bᵢ:bᵢ₊₁]).

The embedding for chunk Cᵢ is now computed from token representations that were contextualised by the full document — they've "seen" every other token via self-attention before being pooled.

Requirement: the encoder must have a context window large enough to process the full document (hence Jina's 8K token encoder). For documents longer than the encoder context, late chunking must be applied to document segments rather than the full doc.

SECTION 03

Implementation with Jina embeddings

import numpy as np
from transformers import AutoModel, AutoTokenizer
import torch

model_name = "jinaai/jina-embeddings-v2-base-en"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

def late_chunking(text: str, chunk_size_tokens: int = 256) -> list:
    # Tokenize the full document
    inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True,
                       truncation=True, max_length=8192)
    offsets = inputs.pop("offset_mapping")

    # Forward pass: get token-level embeddings from full document context
    with torch.no_grad():
        outputs = model(**inputs)
    token_embeddings = outputs.last_hidden_state[0]  # (seq_len, hidden_dim)

    # Determine chunk boundaries in token space
    input_ids = inputs["input_ids"][0]
    boundaries = list(range(0, len(input_ids), chunk_size_tokens))
    boundaries.append(len(input_ids))

    # Pool token embeddings within each chunk boundary
    chunk_embeddings = []
    for i in range(len(boundaries) - 1):
        chunk_tokens = token_embeddings[boundaries[i]:boundaries[i+1]]
        chunk_emb = chunk_tokens.mean(dim=0).numpy()
        chunk_embeddings.append(chunk_emb)

    return chunk_embeddings

embeddings = late_chunking(document_text, chunk_size_tokens=256)
print(f"Created {len(embeddings)} contextualised chunk embeddings")

SECTION 04

Late chunking vs standard chunking

On retrieval benchmarks, late chunking outperforms standard chunking by 4–12% on documents where context is important for disambiguation. The gains are largest for: documents with many anaphora (pronouns, "this", "the above"), technical documents where acronyms are defined early and used throughout, legal documents with cross-references, and scientific papers where a finding mentioned in the abstract is referenced by number later.

The gains are smaller on: short documents (where context loss is less severe), documents about a single well-defined topic (less ambiguity), and factoid retrieval from simple FAQs.

A key result from the Jina AI paper: late chunking matches or exceeds parent-child and sentence-window on BEIR benchmark while using a simpler retrieval architecture (no separate parent store needed).

SECTION 05

Limitations

Encoder context window: The full document must fit in the encoder's context window. Jina v2 supports 8192 tokens. For longer documents, you must either: truncate (losing late-chunking's main benefit for later sections), segment the document into 8K windows and apply late chunking to each, or use a long-context encoder (Jina v3 or similar).

Inference cost: One forward pass through a full 8K-token document uses more VRAM and time than K separate forward passes through K small chunks (K × smaller_cost). For batch processing at scale, measure the actual throughput difference on your hardware.

Requires compatible model: Late chunking requires an encoder that returns token-level embeddings with sufficient capacity to contextualise the full document. Not all embedding APIs expose token-level embeddings — OpenAI's API only returns sentence-level embeddings, making late chunking impossible via the API.

SECTION 06

When to use it

Late chunking is the right choice when: documents have strong cross-references or contextual dependencies, you're using a self-hosted encoder (no API token-level embedding restrictions), document length fits within 8K tokens, and retrieval quality is a higher priority than ingestion speed/cost.

For most production systems today, parent-child or sentence-window provides comparable quality with broader infrastructure compatibility. Monitor this space — as long-context encoders improve, late chunking's advantages will compound.

SECTION 07

Gotchas

Pooling strategy matters: Mean pooling across all tokens in a chunk boundary is the standard approach. Max pooling and CLS token approaches have also been tried. Mean pooling is most robust and matches Jina's implementation.

Special tokens at boundaries: Tokenizers add special tokens ([CLS], [SEP]) at the start and end of the full document. These shouldn't be included in chunk pooling. Slice the token embeddings starting from index 1 and ending at -1 to exclude special tokens.

Batching for production: For bulk ingestion, batch documents that fit within the context window together. Jina's model can process ~32 documents of 256 tokens in a single forward pass on a 24GB GPU.

Late Chunking vs. Standard Chunking

Late chunking is a retrieval technique that generates embeddings from full-document context before splitting into retrievable chunks, rather than embedding chunks in isolation. This preserves cross-chunk semantic context in each chunk's embedding, addressing a fundamental weakness of standard chunking where a chunk's embedding only reflects the information visible within that chunk's token window.

Approach	When Context Applied	Chunk Embedding Quality	Index Size	Compute
Standard chunking	At chunk level only	Context-blind	Normal	Low
Late chunking	Full document first	Context-aware	Normal	Higher (full doc pass)
Hypothetical doc embedding	Query side	Context-blind	Normal	LLM per query
ColBERT multi-vector	Token level	Per-token	Large	High

The mechanism of late chunking processes the entire document through a long-context embedding model (such as jina-embeddings-v2 with 8K context), obtaining contextualized token embeddings for every token in the document. These contextualized embeddings are then mean-pooled within each chunk's token span to produce the chunk embedding. Because each token embedding already incorporates information from the surrounding document context, the resulting chunk embeddings carry implicit cross-chunk information that standard mean-pooling over isolated chunks cannot achieve.

Late chunking works best for documents where adjacent chunks are semantically linked by reference, pronoun coreference, or continuing argument. Technical documents that use abbreviations defined earlier in the text, research papers that reference methodology described in previous sections, and legal documents where terms are defined once and used throughout benefit significantly from late chunking. For documents with independent, self-contained sections — FAQs, product catalogs, recipe collections — the quality difference between late and standard chunking is smaller since cross-chunk context adds less value.

# Late chunking: embed full doc, then pool over chunk spans
from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-base-en")
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")

def late_chunk_embed(document: str, chunk_spans: list[tuple[int,int]]):
    # Tokenize full document
    tokens = tokenizer(document, return_tensors="pt", truncation=True, max_length=8192)
    with torch.no_grad():
        output = model(**tokens)  # shape: (1, seq_len, hidden)
    token_embeddings = output.last_hidden_state[0]  # (seq_len, hidden)

    chunk_embeddings = []
    for start, end in chunk_spans:
        # Map char spans to token spans, then mean-pool
        chunk_emb = token_embeddings[start:end].mean(dim=0)
        chunk_embeddings.append(chunk_emb)
    return chunk_embeddings

Late chunking is not universally superior to standard chunking and incurs a meaningful compute cost premium. For short documents where every chunk can be embedded with full local context visible within the chunk window, standard chunking performs comparably to late chunking at lower cost. Late chunking provides the clearest benefits for long documents with many intra-document references, documents that use defined abbreviations extensively, and technical content where the meaning of a sentence depends heavily on earlier sections establishing the conceptual framework.

The choice of long-context embedding model for late chunking significantly affects result quality. Embedding models pretrained on short sequences (512 tokens) and used beyond their training context produce degraded representations for tokens far from the sequence start. Models specifically trained for long-context embedding — jina-embeddings-v2, e5-mistral-7b, and similar — maintain representation quality across their full supported context length and should be selected specifically for late chunking pipelines rather than repurposing a standard sentence embedding model.