Tokenization

Why Tokenization Matters Byte-Pair Encoding (BPE) SentencePiece & WordPiece Tokenizer Quirks Tokenizer Arithmetic Multilingual Tokenization Special Tokens

SECTION 01

Why Tokenization Matters

Large language models don't read text character-by-character. Instead, they operate on sequences of discrete units called tokens. A token is typically 2-4 characters for English, but can range from a single byte to entire words.

Tokenization sits at the boundary between the user (who thinks in characters and words) and the model (which thinks in token IDs). Understanding tokenization is crucial because it affects three key areas:

1. Cost & Billing — API pricing is per token, not per character. The same prompt can have different token counts depending on the tokenizer. A 1000-character prompt might be 200 tokens (common English) or 500 tokens (rare symbols, mixed languages). That's a 2.5x cost difference.

2. Context Window — Model context length is measured in tokens, not characters. GPT-4 has a 128k token context window, but in practice you get fewer characters because each token is ~4 characters on average. A 128k token window is roughly 500k characters of English text—substantial, but not infinite.

3. Behavior & Safety — Tokenization affects how the model processes text. Unusual tokenization (e.g., token splitting that separates semantically related concepts) can cause unexpected behavior, jailbreaks, or unsafe outputs. For example, if a safety word is tokenized across multiple tokens, some safety fine-tuning might not apply.

Historically, models operated on character or word vocabularies, which was inefficient. Character-level models needed thousands of steps for short documents. Word-level models couldn't handle out-of-vocabulary words (OOV). BPE solved this by learning a vocabulary of subword units that balances efficiency and expressiveness.

Key Insight: Tokenization is lossy compression. We map human-readable text to a compact sequence of integers. The tokenizer is deterministic, but not all information is preserved—sometimes important semantic or syntactic information gets scattered across token boundaries.

SECTION 02

Byte-Pair Encoding (BPE)

Byte-Pair Encoding is the dominant tokenization algorithm, used by GPT-2, GPT-3, GPT-4, and many other models. BPE builds a vocabulary by iteratively merging the most frequent byte pairs in a corpus.

How BPE Works

Starting with raw bytes (256 basic tokens representing 0-255), BPE repeatedly finds the most common adjacent pair and merges them into a new token:

Iteration 1: Vocabulary = [0-255 bytes] Corpus: "hello world hello" In bytes: [h=104, e=101, l=108, l=108, o=111, ...] Count pairs: "he" → 2 occurrences "el" → 2 occurrences "ll" → 2 occurrences "lo" → 2 occurrences "o " → 1 occurrence ... Most frequent: "he" (or any 2-occurrence pair) Action: Add new token 256 → "he" Update corpus: [256, l=108, l=108, o=111, ...] Iteration 2: Find next most frequent pair "ll" → 2 occurrences Action: Add token 257 → "ll" ... After 50,000 iterations: Vocabulary ≈ 50k tokens

Properties of BPE

Learned from data: Common sequences in training data become tokens; rare sequences stay split.
Language-aware: English tokenizers learn "ing", "ed", "tion" as single tokens. Chinese/Japanese learn character n-grams.
Lossless (within BPE): Any sequence can be represented; BPE never fails (unlike WordPiece with ).
Deterministic: Same text always produces the same tokens (given the same vocab).

BPE Tokenization Process

Given a trained vocab, tokenizing text is straightforward:

Vocabulary (excerpt): 256: "he" 257: "ll" 258: "hello" 259: "world" ... Text: "hello world" Step 1: Encode as bytes: [h=104, e=101, l=108, l=108, o=111, ...] Step 2: Greedily merge from left to right: - h=104, e=101 → found "he" (256) ✓ - l=108, l=108 → found "ll" (257) ✓ - o=111, space=32 → not in vocab - space=32, w=119 → not in vocab ... Result: [256 (he), 257 (ll), 111 (o), ...] Note: Greedy merging (left-to-right) is simple but not optimal.

GPT Tokenizers (TikToken)

OpenAI uses a modified BPE called tiktoken (part of the Python package). Key differences from vanilla BPE:

Regex preprocessing: Splits on whitespace and punctuation before BPE, improving space handling
Special tokens: Reserved tokens like <|im_start|>, <|im_end|> for chat formats
Larger vocab: GPT-4 has ~100k tokens (larger than GPT-2's ~50k)

# Using tiktoken (Python) import tiktoken enc = tiktoken.encoding_for_model("gpt-4") tokens = enc.encode("Hello, world!") print(tokens) # [9906, 11, 1917, 0] print(enc.decode(tokens)) # "Hello, world!" # Count tokens in a string def count_tokens(text, model="gpt-4"): enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) print(count_tokens("The quick brown fox")) # Usually ~5-6 print(count_tokens("Supercalifragilisticexpialidocious")) # 1 or 2

BPE Advantage: BPE is simple, fast, and effective. It's learned from data, so it adapts to any language. It always finds a tokenization (lossless). The downside: token boundaries are learned, not linguistic, so occasional quirks emerge.

SECTION 03

SentencePiece & WordPiece

While BPE dominates for large models, alternatives exist with different tradeoffs:

SentencePiece (Google)

SentencePiece is a language-agnostic tokenization framework used by many non-English and multilingual models (T5, mBERT, LLaMA). Key features:

Space preservation: Treats spaces as explicit tokens (▁), making decoding lossless and readable
BPE or Unigram: Can use BPE or Unigram LM as the merge algorithm
Multilingual: No language-specific preprocessing; works on raw bytes
Reversible: Can always recover original text from tokens

SentencePiece tokens (excerpt): ~~: sentence start~~ : sentence end ▁: space (called "underbar") e, l, o, h, w, r, d: common characters ▁The, ▁quick, ▁brown: subword units Text: "hello world" Tokens: [h, e, l, l, o, ▁, w, o, r, l, d] Or with merging learned: Tokens: [hello, ▁world] Decoding: hello + ▁ + world = "hello world" Space is explicit, so no ambiguity.

WordPiece (Google/BERT)

WordPiece is similar to BPE but used different merge strategy. Common in BERT and other encoder models:

Greedy longest-match: At each position, match the longest token in vocab (vs BPE's frequency-based merging)
fallback: If a character can't be tokenized, use the unknown token
Subword prefix: Uses ## prefix for non-initial subwords (e.g., "un" + "##breakable")

WordPiece tokenization of "unbreakable": Step 1: Check "unbreakable" → not in vocab Step 2: Check "unbreak" → not in vocab Step 3: Check "unbr" → not in vocab Step 4: Check "un" → found! Use it Remaining: "breakable" Step 5: Check "breakable" → not in vocab Step 6: Check "break" → found! Use it Remaining: "able" Step 7: Check "able" → found! Use it Tokens: ["un", "##break", "##able"]

Comparison

Aspect	BPE	SentencePiece	WordPiece
Algorithm	Frequency-based merging	BPE or Unigram LM	Greedy longest-match
Language-specific?	Requires preprocessing	Language-agnostic	Requires preprocessing
Space handling	Via preprocessing	Explicit ▁ token	Explicit space token
Fallback	Lossless (always works)	Lossless	on failure
Models	GPT-2, GPT-4, Llama	T5, mBERT, Llama 2	BERT, ALBERT

SECTION 04

Tokenizer Quirks

Tokenization is not always intuitive. Here are common surprises:

1. Unicode Fragmentation

Non-ASCII characters often tokenize into multiple tokens. A single emoji or accented character can be 2-4 tokens:

Text: "café" Tokens (GPT-4): [c, a, f, é] Token count: 1 + 1 + 1 + 1 = 4 tokens (not 4 characters) Text: "naïve" Tokens: [na, ï, ve] or [n, a, ï, v, e] Depends on training data. Text: "你好" (Chinese "hello") Tokens: [n, 你, 好] or just [你, 好] Depends on vocab size and training data. Cost implications: Non-ASCII text is more expensive!

2. Whitespace Sensitivity

Leading/trailing spaces or multiple spaces tokenize differently:

Text: "hello" → tokens: [hello] Text: " hello" → tokens: [space, hello] Text: " hello" → tokens: [space, space, hello] or [2-space-token, hello] Cost: Same word with vs without leading space costs differently.

3. Number Tokenization

Numbers are tokenized digit-by-digit or as number words:

Text: "12345" Tokens: [12345] or [123, 45] or [1, 2, 3, 4, 5] GPT-4 usually: [123, 45] or similar splits Implications: - Math problems with large numbers are token-expensive - Model may struggle with out-of-distribution numbers - Arithmetic over 3-4 digit numbers is harder for models

4. Case Sensitivity

BERT and older models are case-insensitive (all tokens are lowercased). GPT models are case-sensitive:

BERT: "HELLO" and "hello" → same tokens GPT-4: "HELLO" → [HELLO] or [H, ELL, O] "hello" → [hello] Cost: All-caps text may tokenize differently.

5. Punctuation Attachment

Punctuation is sometimes attached to words, sometimes separate:

Text: "hello, world" Tokens: [hello, ",", world] or [hello, ",world"] Depends on training data and vocab. Text: "don't" Tokens: [don, ', t] or [don't] or [do, n't]

Practical Tip: Always count tokens for your specific text and model before assuming cost. Use tiktoken or the model's tokenizer directly. A 1000-character prompt might be 150 tokens or 600 tokens depending on language and punctuation.

SECTION 05

Tokenizer Arithmetic

Estimating and managing token count is a practical necessity when building LLM applications.

Counting Tokens with TikToken

import tiktoken from typing import Optional def count_tokens(text: str, model: str = "gpt-4") -> int: """Count tokens in text for a given model.""" enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) def estimate_api_cost( text: str, model: str = "gpt-4-turbo", input_cost_per_1k: float = 0.01, output_cost_per_1k: float = 0.03 ) -> dict: """Estimate API cost for a prompt.""" input_tokens = count_tokens(text, model) # Estimate output (rough: 30% of input or max 2000) output_estimate = min(int(input_tokens * 0.3), 2000) input_cost = (input_tokens / 1000) * input_cost_per_1k output_cost = (output_estimate / 1000) * output_cost_per_1k return { "input_tokens": input_tokens, "output_estimate": output_estimate, "input_cost": input_cost, "output_cost": output_cost, "total_cost": input_cost + output_cost } # Example text = "Explain quantum computing in 500 words." result = estimate_api_cost(text) print(f"Tokens: {result['input_tokens']}") print(f"Est. cost: ${result['total_cost']:.4f}") # Context window management def truncate_to_context( messages: list[dict], max_tokens: int = 128000, model: str = "gpt-4" ) -> list[dict]: """Remove oldest messages if context is exceeded.""" enc = tiktoken.encoding_for_model(model) total = 0 # Count tokens, working backwards kept = [] for msg in reversed(messages): msg_tokens = len(enc.encode(msg['content'])) if total + msg_tokens <= max_tokens: kept.append(msg) total += msg_tokens else: break return list(reversed(kept)) # Restore order

Rule-of-Thumb Estimates

For quick estimation without calling the API:

English: ~4 characters per token (divide text length by 4)
Code: ~3 characters per token (denser, more punctuation)
Non-ASCII: ~2 characters per token (unicode overhead)
Math/symbols: ~2 characters per token

Cost Optimization Strategies

Summarize context: Instead of including full documents, summarize them (shorter, cheaper)
Use shorter models: For simple tasks, use GPT-3.5 instead of GPT-4 (cheaper per token, faster)
Batch requests: Process many requests in one API call if possible
Cache results: Store responses to avoid re-processing identical prompts
Compress input: Remove unnecessary whitespace, combine similar requests

Example Savings: Summarizing a 5000-token document to 500 tokens saves ~90% on input cost. If you process this 1000 times/month, that's substantial savings.

SECTION 06

Multilingual Tokenization

English tokenizers are optimized for English but work (inefficiently) on other languages. This creates a fairness issue: non-English users pay more.

Token Efficiency by Language

Different languages have different "fertility ratios"—how many tokens per character:

Language	Chars/Token	Fertility Ratio	Notes
English	~4	1.0 (baseline)	Optimized for GPT tokenizers
Spanish/French	~3.5	1.1x	Slightly less efficient
German	~3.2	1.25x	Longer words
Japanese	~1.5	2.7x	Each character is a token
Chinese	~1.3	3.1x	Each character is a token
Arabic	~2	2.0x	Diacritics, ligatures

Why the Difference?

English GPT tokenizers learn "the", "and", "ing" as single tokens. Chinese doesn't have word boundaries, so each character is usually its own token. A 100-character Chinese sentence costs ~3x more than the equivalent English sentence.

Implications for Multilingual Models

Multilingual models (like mBERT or mT5) use language-agnostic tokenizers (usually SentencePiece) that are fairer across languages, but less efficient than language-specific tokenizers. The tradeoff:

Language-specific (GPT): - English: Very efficient - Japanese: Inefficient (no Japanese-specific vocab) Language-agnostic (SentencePiece): - English: Moderate (slightly less efficient than BPE) - Japanese: Better (learned Japanese subwords) Conclusion: Multilingual models often have lower avg efficiency but more equitable cost across languages.

Handling Mixed-Language Text

If you're mixing languages (code-switching), be aware:

Switching between English and Japanese in a single sentence is expensive
The tokenizer can't efficiently handle both alphabets
Each language fragment may require extra tokens

Fairness Concern: Non-English speakers pay 2-3x more per character for the same API, due to tokenization inefficiency. This is a known issue in the AI industry. Language-agnostic tokenizers and multilingual models partially address this, but the problem persists.

SECTION 07

Special Tokens

Beyond regular tokens, models use special tokens for control, structure, and safety. These are reserved tokens with predefined meanings.

Common Special Tokens

<BOS>: Beginning of Sequence (marks start of generation)
<EOS>: End of Sequence (marks end)
<PAD>: Padding (for batch processing, fills unused positions)
<UNK>: Unknown token (fallback for OOV words)
<MASK>: Masked token (used in masked language modeling)

Chat Format Special Tokens (Claude, GPT)

Adding Custom Tokens

If you're fine-tuning a model, you may add new special tokens:

# Using Hugging Face Transformers from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") # Original vocab size print(f"Original: {tokenizer.vocab_size}") # 50257 # Add custom tokens new_tokens = ["", "", "<|code|>"] tokenizer.add_tokens(new_tokens) print(f"After adding: {tokenizer.vocab_size}") # 50260 # Save updated tokenizer tokenizer.save_pretrained("./custom_tokenizer") # When fine-tuning, resize model embeddings: model.resize_token_embeddings(len(tokenizer)) # This adds new rows to token embedding matrix

Token Efficiency in Chat Protocols

Special tokens add overhead. A typical chat message costs extra tokens for structure:

Message: "Hello" Raw tokens: [Hello] = 1 token With chat format: [<|im_start|>, user, \n, Hello, \n, <|im_end|>] = 6-7 tokens Cost: 6-7x overhead for a single-token message! Optimization: Batch messages, minimize chat wrapper overhead.

Reserved Tokens & Safety

Some models reserve token ranges for safety. For example, certain token IDs might trigger refusals. This is not well-documented for commercial models, but is sometimes used in research:

Token IDs in range [50000, 50100] might be reserved for internal use
Trying to force-decode these tokens may cause errors or unexpected behavior
Fine-tuning should avoid modifying reserved ranges

Best Practice: Understand your model's special tokens. Document custom tokens. When fine-tuning, resize embeddings properly. Monitor that token counts match expectations (off-by-one errors are common).

Table of Contents

Why Tokenization Matters

Byte-Pair Encoding (BPE)

SentencePiece & WordPiece

Tokenizer Quirks

Tokenizer Arithmetic

Multilingual Tokenization

Special Tokens

Tokenizer Selection Guide

Tokenization

Table of Contents

Why Tokenization Matters

Byte-Pair Encoding (BPE)

SentencePiece & WordPiece

Tokenizer Quirks

Tokenizer Arithmetic

Multilingual Tokenization

Special Tokens

Tokenizer Selection Guide

Related concepts