SECTION 01
Why Tokenization Matters
Large language models don't read text character-by-character. Instead, they operate on sequences of discrete units called tokens. A token is typically 2-4 characters for English, but can range from a single byte to entire words.
Tokenization sits at the boundary between the user (who thinks in characters and words) and the model (which thinks in token IDs). Understanding tokenization is crucial because it affects three key areas:
1. Cost & Billing — API pricing is per token, not per character. The same prompt can have different token counts depending on the tokenizer. A 1000-character prompt might be 200 tokens (common English) or 500 tokens (rare symbols, mixed languages). That's a 2.5x cost difference.
2. Context Window — Model context length is measured in tokens, not characters. GPT-4 has a 128k token context window, but in practice you get fewer characters because each token is ~4 characters on average. A 128k token window is roughly 500k characters of English text—substantial, but not infinite.
3. Behavior & Safety — Tokenization affects how the model processes text. Unusual tokenization (e.g., token splitting that separates semantically related concepts) can cause unexpected behavior, jailbreaks, or unsafe outputs. For example, if a safety word is tokenized across multiple tokens, some safety fine-tuning might not apply.
Historically, models operated on character or word vocabularies, which was inefficient. Character-level models needed thousands of steps for short documents. Word-level models couldn't handle out-of-vocabulary words (OOV). BPE solved this by learning a vocabulary of subword units that balances efficiency and expressiveness.
Key Insight: Tokenization is lossy compression. We map human-readable text to a compact sequence of integers. The tokenizer is deterministic, but not all information is preserved—sometimes important semantic or syntactic information gets scattered across token boundaries.
SECTION 02
Byte-Pair Encoding (BPE)
Byte-Pair Encoding is the dominant tokenization algorithm, used by GPT-2, GPT-3, GPT-4, and many other models. BPE builds a vocabulary by iteratively merging the most frequent byte pairs in a corpus.
How BPE Works
Starting with raw bytes (256 basic tokens representing 0-255), BPE repeatedly finds the most common adjacent pair and merges them into a new token:
Iteration 1: Vocabulary = [0-255 bytes]
Corpus: "hello world hello"
In bytes: [h=104, e=101, l=108, l=108, o=111, ...]
Count pairs:
"he" → 2 occurrences
"el" → 2 occurrences
"ll" → 2 occurrences
"lo" → 2 occurrences
"o " → 1 occurrence
...
Most frequent: "he" (or any 2-occurrence pair)
Action: Add new token 256 → "he"
Update corpus: [256, l=108, l=108, o=111, ...]
Iteration 2: Find next most frequent pair
"ll" → 2 occurrences
Action: Add token 257 → "ll"
...
After 50,000 iterations: Vocabulary ≈ 50k tokens
Properties of BPE
- Learned from data: Common sequences in training data become tokens; rare sequences stay split.
- Language-aware: English tokenizers learn "ing", "ed", "tion" as single tokens. Chinese/Japanese learn character n-grams.
- Lossless (within BPE): Any sequence can be represented; BPE never fails (unlike WordPiece with ).
- Deterministic: Same text always produces the same tokens (given the same vocab).
BPE Tokenization Process
Given a trained vocab, tokenizing text is straightforward:
Vocabulary (excerpt):
256: "he"
257: "ll"
258: "hello"
259: "world"
...
Text: "hello world"
Step 1: Encode as bytes: [h=104, e=101, l=108, l=108, o=111, ...]
Step 2: Greedily merge from left to right:
- h=104, e=101 → found "he" (256) ✓
- l=108, l=108 → found "ll" (257) ✓
- o=111, space=32 → not in vocab
- space=32, w=119 → not in vocab
...
Result: [256 (he), 257 (ll), 111 (o), ...]
Note: Greedy merging (left-to-right) is simple but not optimal.
GPT Tokenizers (TikToken)
OpenAI uses a modified BPE called tiktoken (part of the Python package). Key differences from vanilla BPE:
- Regex preprocessing: Splits on whitespace and punctuation before BPE, improving space handling
- Special tokens: Reserved tokens like <|im_start|>, <|im_end|> for chat formats
- Larger vocab: GPT-4 has ~100k tokens (larger than GPT-2's ~50k)
# Using tiktoken (Python)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, world!")
print(tokens) # [9906, 11, 1917, 0]
print(enc.decode(tokens)) # "Hello, world!"
# Count tokens in a string
def count_tokens(text, model="gpt-4"):
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
print(count_tokens("The quick brown fox")) # Usually ~5-6
print(count_tokens("Supercalifragilisticexpialidocious")) # 1 or 2
BPE Advantage: BPE is simple, fast, and effective. It's learned from data, so it adapts to any language. It always finds a tokenization (lossless). The downside: token boundaries are learned, not linguistic, so occasional quirks emerge.
SECTION 03
SentencePiece & WordPiece
While BPE dominates for large models, alternatives exist with different tradeoffs:
SentencePiece (Google)
SentencePiece is a language-agnostic tokenization framework used by many non-English and multilingual models (T5, mBERT, LLaMA). Key features:
- Space preservation: Treats spaces as explicit tokens (▁), making decoding lossless and readable
- BPE or Unigram: Can use BPE or Unigram LM as the merge algorithm
- Multilingual: No language-specific preprocessing; works on raw bytes
- Reversible: Can always recover original text from tokens
SentencePiece tokens (excerpt):
: sentence start
: sentence end
▁: space (called "underbar")
e, l, o, h, w, r, d: common characters
▁The, ▁quick, ▁brown: subword units
Text: "hello world"
Tokens: [h, e, l, l, o, ▁, w, o, r, l, d]
Or with merging learned:
Tokens: [hello, ▁world]
Decoding: hello + ▁ + world = "hello world"
Space is explicit, so no ambiguity.
WordPiece (Google/BERT)
WordPiece is similar to BPE but used different merge strategy. Common in BERT and other encoder models:
- Greedy longest-match: At each position, match the longest token in vocab (vs BPE's frequency-based merging)
- fallback: If a character can't be tokenized, use the unknown token
- Subword prefix: Uses ## prefix for non-initial subwords (e.g., "un" + "##breakable")
WordPiece tokenization of "unbreakable":
Step 1: Check "unbreakable" → not in vocab
Step 2: Check "unbreak" → not in vocab
Step 3: Check "unbr" → not in vocab
Step 4: Check "un" → found! Use it
Remaining: "breakable"
Step 5: Check "breakable" → not in vocab
Step 6: Check "break" → found! Use it
Remaining: "able"
Step 7: Check "able" → found! Use it
Tokens: ["un", "##break", "##able"]
Comparison
| Aspect |
BPE |
SentencePiece |
WordPiece |
| Algorithm |
Frequency-based merging |
BPE or Unigram LM |
Greedy longest-match |
| Language-specific? |
Requires preprocessing |
Language-agnostic |
Requires preprocessing |
| Space handling |
Via preprocessing |
Explicit ▁ token |
Explicit space token |
| Fallback |
Lossless (always works) |
Lossless |
on failure |
| Models |
GPT-2, GPT-4, Llama |
T5, mBERT, Llama 2 |
BERT, ALBERT |
SECTION 04
Tokenizer Quirks
Tokenization is not always intuitive. Here are common surprises:
1. Unicode Fragmentation
Non-ASCII characters often tokenize into multiple tokens. A single emoji or accented character can be 2-4 tokens:
Text: "café"
Tokens (GPT-4): [c, a, f, é]
Token count: 1 + 1 + 1 + 1 = 4 tokens (not 4 characters)
Text: "naïve"
Tokens: [na, ï, ve] or [n, a, ï, v, e]
Depends on training data.
Text: "你好" (Chinese "hello")
Tokens: [n, 你, 好] or just [你, 好]
Depends on vocab size and training data.
Cost implications: Non-ASCII text is more expensive!
2. Whitespace Sensitivity
Leading/trailing spaces or multiple spaces tokenize differently:
Text: "hello" → tokens: [hello]
Text: " hello" → tokens: [space, hello]
Text: " hello" → tokens: [space, space, hello] or [2-space-token, hello]
Cost: Same word with vs without leading space costs differently.
3. Number Tokenization
Numbers are tokenized digit-by-digit or as number words:
Text: "12345"
Tokens: [12345] or [123, 45] or [1, 2, 3, 4, 5]
GPT-4 usually: [123, 45] or similar splits
Implications:
- Math problems with large numbers are token-expensive
- Model may struggle with out-of-distribution numbers
- Arithmetic over 3-4 digit numbers is harder for models
4. Case Sensitivity
BERT and older models are case-insensitive (all tokens are lowercased). GPT models are case-sensitive:
BERT: "HELLO" and "hello" → same tokens
GPT-4: "HELLO" → [HELLO] or [H, ELL, O]
"hello" → [hello]
Cost: All-caps text may tokenize differently.
5. Punctuation Attachment
Punctuation is sometimes attached to words, sometimes separate:
Text: "hello, world"
Tokens: [hello, ",", world] or [hello, ",world"]
Depends on training data and vocab.
Text: "don't"
Tokens: [don, ', t] or [don't] or [do, n't]
Practical Tip: Always count tokens for your specific text and model before assuming cost. Use tiktoken or the model's tokenizer directly. A 1000-character prompt might be 150 tokens or 600 tokens depending on language and punctuation.
SECTION 05
Tokenizer Arithmetic
Estimating and managing token count is a practical necessity when building LLM applications.
Counting Tokens with TikToken
import tiktoken
from typing import Optional
def count_tokens(text: str, model: str = "gpt-4") -> int:
"""Count tokens in text for a given model."""
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def estimate_api_cost(
text: str,
model: str = "gpt-4-turbo",
input_cost_per_1k: float = 0.01,
output_cost_per_1k: float = 0.03
) -> dict:
"""Estimate API cost for a prompt."""
input_tokens = count_tokens(text, model)
# Estimate output (rough: 30% of input or max 2000)
output_estimate = min(int(input_tokens * 0.3), 2000)
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_estimate / 1000) * output_cost_per_1k
return {
"input_tokens": input_tokens,
"output_estimate": output_estimate,
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": input_cost + output_cost
}
# Example
text = "Explain quantum computing in 500 words."
result = estimate_api_cost(text)
print(f"Tokens: {result['input_tokens']}")
print(f"Est. cost: ${result['total_cost']:.4f}")
# Context window management
def truncate_to_context(
messages: list[dict],
max_tokens: int = 128000,
model: str = "gpt-4"
) -> list[dict]:
"""Remove oldest messages if context is exceeded."""
enc = tiktoken.encoding_for_model(model)
total = 0
# Count tokens, working backwards
kept = []
for msg in reversed(messages):
msg_tokens = len(enc.encode(msg['content']))
if total + msg_tokens <= max_tokens:
kept.append(msg)
total += msg_tokens
else:
break
return list(reversed(kept)) # Restore order
Rule-of-Thumb Estimates
For quick estimation without calling the API:
- English: ~4 characters per token (divide text length by 4)
- Code: ~3 characters per token (denser, more punctuation)
- Non-ASCII: ~2 characters per token (unicode overhead)
- Math/symbols: ~2 characters per token
Cost Optimization Strategies
- Summarize context: Instead of including full documents, summarize them (shorter, cheaper)
- Use shorter models: For simple tasks, use GPT-3.5 instead of GPT-4 (cheaper per token, faster)
- Batch requests: Process many requests in one API call if possible
- Cache results: Store responses to avoid re-processing identical prompts
- Compress input: Remove unnecessary whitespace, combine similar requests
Example Savings: Summarizing a 5000-token document to 500 tokens saves ~90% on input cost. If you process this 1000 times/month, that's substantial savings.
SECTION 06
Multilingual Tokenization
English tokenizers are optimized for English but work (inefficiently) on other languages. This creates a fairness issue: non-English users pay more.
Token Efficiency by Language
Different languages have different "fertility ratios"—how many tokens per character:
| Language |
Chars/Token |
Fertility Ratio |
Notes |
| English |
~4 |
1.0 (baseline) |
Optimized for GPT tokenizers |
| Spanish/French |
~3.5 |
1.1x |
Slightly less efficient |
| German |
~3.2 |
1.25x |
Longer words |
| Japanese |
~1.5 |
2.7x |
Each character is a token |
| Chinese |
~1.3 |
3.1x |
Each character is a token |
| Arabic |
~2 |
2.0x |
Diacritics, ligatures |
Why the Difference?
English GPT tokenizers learn "the", "and", "ing" as single tokens. Chinese doesn't have word boundaries, so each character is usually its own token. A 100-character Chinese sentence costs ~3x more than the equivalent English sentence.
Implications for Multilingual Models
Multilingual models (like mBERT or mT5) use language-agnostic tokenizers (usually SentencePiece) that are fairer across languages, but less efficient than language-specific tokenizers. The tradeoff:
Language-specific (GPT):
- English: Very efficient
- Japanese: Inefficient (no Japanese-specific vocab)
Language-agnostic (SentencePiece):
- English: Moderate (slightly less efficient than BPE)
- Japanese: Better (learned Japanese subwords)
Conclusion: Multilingual models often have lower avg efficiency
but more equitable cost across languages.
Handling Mixed-Language Text
If you're mixing languages (code-switching), be aware:
- Switching between English and Japanese in a single sentence is expensive
- The tokenizer can't efficiently handle both alphabets
- Each language fragment may require extra tokens
Fairness Concern: Non-English speakers pay 2-3x more per character for the same API, due to tokenization inefficiency. This is a known issue in the AI industry. Language-agnostic tokenizers and multilingual models partially address this, but the problem persists.
SECTION 07
Special Tokens
Beyond regular tokens, models use special tokens for control, structure, and safety. These are reserved tokens with predefined meanings.
Common Special Tokens
- <BOS>: Beginning of Sequence (marks start of generation)
- <EOS>: End of Sequence (marks end)
- <PAD>: Padding (for batch processing, fills unused positions)
- <UNK>: Unknown token (fallback for OOV words)
- <MASK>: Masked token (used in masked language modeling)
Chat Format Special Tokens (Claude, GPT)
Claude chat format:
<|im_start|>user
What is tokenization?
<|im_end|>
<|im_start|>assistant
Tokenization is...
<|im_end|>
Each <|im_start|> and <|im_end|> is a special token.
These tokens are not part of regular vocabulary.
GPT chat format:
<|begin_header_id|>user<|end_header_id|>
What is tokenization?
<|begin_header_id|>assistant<|end_header_id|>
Tokenization is...
Similar structure with different tokens.
Adding Custom Tokens
If you're fine-tuning a model, you may add new special tokens:
# Using Hugging Face Transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Original vocab size
print(f"Original: {tokenizer.vocab_size}") # 50257
# Add custom tokens
new_tokens = ["", "", "<|code|>"]
tokenizer.add_tokens(new_tokens)
print(f"After adding: {tokenizer.vocab_size}") # 50260
# Save updated tokenizer
tokenizer.save_pretrained("./custom_tokenizer")
# When fine-tuning, resize model embeddings:
model.resize_token_embeddings(len(tokenizer))
# This adds new rows to token embedding matrix
Token Efficiency in Chat Protocols
Special tokens add overhead. A typical chat message costs extra tokens for structure:
Message: "Hello"
Raw tokens: [Hello] = 1 token
With chat format: [<|im_start|>, user, \n, Hello, \n, <|im_end|>]
= 6-7 tokens
Cost: 6-7x overhead for a single-token message!
Optimization: Batch messages, minimize chat wrapper overhead.
Reserved Tokens & Safety
Some models reserve token ranges for safety. For example, certain token IDs might trigger refusals. This is not well-documented for commercial models, but is sometimes used in research:
- Token IDs in range [50000, 50100] might be reserved for internal use
- Trying to force-decode these tokens may cause errors or unexpected behavior
- Fine-tuning should avoid modifying reserved ranges
Best Practice: Understand your model's special tokens. Document custom tokens. When fine-tuning, resize embeddings properly. Monitor that token counts match expectations (off-by-one errors are common).