Foundations

Tokenization

How language models break down text into discrete units called tokens, and why this matters for cost, performance, and behavior.

BPE
Dominant Algorithm
~4 chars
Per Token (English)
100k
GPT-4 Vocab Size

Table of Contents

SECTION 01

Why Tokenization Matters

Large language models don't read text character-by-character. Instead, they operate on sequences of discrete units called tokens. A token is typically 2-4 characters for English, but can range from a single byte to entire words.

Tokenization sits at the boundary between the user (who thinks in characters and words) and the model (which thinks in token IDs). Understanding tokenization is crucial because it affects three key areas:

1. Cost & Billing — API pricing is per token, not per character. The same prompt can have different token counts depending on the tokenizer. A 1000-character prompt might be 200 tokens (common English) or 500 tokens (rare symbols, mixed languages). That's a 2.5x cost difference.

2. Context Window — Model context length is measured in tokens, not characters. GPT-4 has a 128k token context window, but in practice you get fewer characters because each token is ~4 characters on average. A 128k token window is roughly 500k characters of English text—substantial, but not infinite.

3. Behavior & Safety — Tokenization affects how the model processes text. Unusual tokenization (e.g., token splitting that separates semantically related concepts) can cause unexpected behavior, jailbreaks, or unsafe outputs. For example, if a safety word is tokenized across multiple tokens, some safety fine-tuning might not apply.

Historically, models operated on character or word vocabularies, which was inefficient. Character-level models needed thousands of steps for short documents. Word-level models couldn't handle out-of-vocabulary words (OOV). BPE solved this by learning a vocabulary of subword units that balances efficiency and expressiveness.

Key Insight: Tokenization is lossy compression. We map human-readable text to a compact sequence of integers. The tokenizer is deterministic, but not all information is preserved—sometimes important semantic or syntactic information gets scattered across token boundaries.
SECTION 02

Byte-Pair Encoding (BPE)

Byte-Pair Encoding is the dominant tokenization algorithm, used by GPT-2, GPT-3, GPT-4, and many other models. BPE builds a vocabulary by iteratively merging the most frequent byte pairs in a corpus.

How BPE Works

Starting with raw bytes (256 basic tokens representing 0-255), BPE repeatedly finds the most common adjacent pair and merges them into a new token:

Iteration 1: Vocabulary = [0-255 bytes] Corpus: "hello world hello" In bytes: [h=104, e=101, l=108, l=108, o=111, ...] Count pairs: "he" → 2 occurrences "el" → 2 occurrences "ll" → 2 occurrences "lo" → 2 occurrences "o " → 1 occurrence ... Most frequent: "he" (or any 2-occurrence pair) Action: Add new token 256 → "he" Update corpus: [256, l=108, l=108, o=111, ...] Iteration 2: Find next most frequent pair "ll" → 2 occurrences Action: Add token 257 → "ll" ... After 50,000 iterations: Vocabulary ≈ 50k tokens

Properties of BPE

BPE Tokenization Process

Given a trained vocab, tokenizing text is straightforward:

Vocabulary (excerpt): 256: "he" 257: "ll" 258: "hello" 259: "world" ... Text: "hello world" Step 1: Encode as bytes: [h=104, e=101, l=108, l=108, o=111, ...] Step 2: Greedily merge from left to right: - h=104, e=101 → found "he" (256) ✓ - l=108, l=108 → found "ll" (257) ✓ - o=111, space=32 → not in vocab - space=32, w=119 → not in vocab ... Result: [256 (he), 257 (ll), 111 (o), ...] Note: Greedy merging (left-to-right) is simple but not optimal.

GPT Tokenizers (TikToken)

OpenAI uses a modified BPE called tiktoken (part of the Python package). Key differences from vanilla BPE:

# Using tiktoken (Python) import tiktoken enc = tiktoken.encoding_for_model("gpt-4") tokens = enc.encode("Hello, world!") print(tokens) # [9906, 11, 1917, 0] print(enc.decode(tokens)) # "Hello, world!" # Count tokens in a string def count_tokens(text, model="gpt-4"): enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) print(count_tokens("The quick brown fox")) # Usually ~5-6 print(count_tokens("Supercalifragilisticexpialidocious")) # 1 or 2
BPE Advantage: BPE is simple, fast, and effective. It's learned from data, so it adapts to any language. It always finds a tokenization (lossless). The downside: token boundaries are learned, not linguistic, so occasional quirks emerge.
SECTION 03

SentencePiece & WordPiece

While BPE dominates for large models, alternatives exist with different tradeoffs:

SentencePiece (Google)

SentencePiece is a language-agnostic tokenization framework used by many non-English and multilingual models (T5, mBERT, LLaMA). Key features:

SentencePiece tokens (excerpt): : sentence start : sentence end ▁: space (called "underbar") e, l, o, h, w, r, d: common characters ▁The, ▁quick, ▁brown: subword units Text: "hello world" Tokens: [h, e, l, l, o, ▁, w, o, r, l, d] Or with merging learned: Tokens: [hello, ▁world] Decoding: hello + ▁ + world = "hello world" Space is explicit, so no ambiguity.

WordPiece (Google/BERT)

WordPiece is similar to BPE but used different merge strategy. Common in BERT and other encoder models:

WordPiece tokenization of "unbreakable": Step 1: Check "unbreakable" → not in vocab Step 2: Check "unbreak" → not in vocab Step 3: Check "unbr" → not in vocab Step 4: Check "un" → found! Use it Remaining: "breakable" Step 5: Check "breakable" → not in vocab Step 6: Check "break" → found! Use it Remaining: "able" Step 7: Check "able" → found! Use it Tokens: ["un", "##break", "##able"]

Comparison

Aspect BPE SentencePiece WordPiece
Algorithm Frequency-based merging BPE or Unigram LM Greedy longest-match
Language-specific? Requires preprocessing Language-agnostic Requires preprocessing
Space handling Via preprocessing Explicit ▁ token Explicit space token
Fallback Lossless (always works) Lossless on failure
Models GPT-2, GPT-4, Llama T5, mBERT, Llama 2 BERT, ALBERT
SECTION 04

Tokenizer Quirks

Tokenization is not always intuitive. Here are common surprises:

1. Unicode Fragmentation

Non-ASCII characters often tokenize into multiple tokens. A single emoji or accented character can be 2-4 tokens:

Text: "café" Tokens (GPT-4): [c, a, f, é] Token count: 1 + 1 + 1 + 1 = 4 tokens (not 4 characters) Text: "naïve" Tokens: [na, ï, ve] or [n, a, ï, v, e] Depends on training data. Text: "你好" (Chinese "hello") Tokens: [n, 你, 好] or just [你, 好] Depends on vocab size and training data. Cost implications: Non-ASCII text is more expensive!

2. Whitespace Sensitivity

Leading/trailing spaces or multiple spaces tokenize differently:

Text: "hello" → tokens: [hello] Text: " hello" → tokens: [space, hello] Text: " hello" → tokens: [space, space, hello] or [2-space-token, hello] Cost: Same word with vs without leading space costs differently.

3. Number Tokenization

Numbers are tokenized digit-by-digit or as number words:

Text: "12345" Tokens: [12345] or [123, 45] or [1, 2, 3, 4, 5] GPT-4 usually: [123, 45] or similar splits Implications: - Math problems with large numbers are token-expensive - Model may struggle with out-of-distribution numbers - Arithmetic over 3-4 digit numbers is harder for models

4. Case Sensitivity

BERT and older models are case-insensitive (all tokens are lowercased). GPT models are case-sensitive:

BERT: "HELLO" and "hello" → same tokens GPT-4: "HELLO" → [HELLO] or [H, ELL, O] "hello" → [hello] Cost: All-caps text may tokenize differently.

5. Punctuation Attachment

Punctuation is sometimes attached to words, sometimes separate:

Text: "hello, world" Tokens: [hello, ",", world] or [hello, ",world"] Depends on training data and vocab. Text: "don't" Tokens: [don, ', t] or [don't] or [do, n't]
Practical Tip: Always count tokens for your specific text and model before assuming cost. Use tiktoken or the model's tokenizer directly. A 1000-character prompt might be 150 tokens or 600 tokens depending on language and punctuation.
SECTION 05

Tokenizer Arithmetic

Estimating and managing token count is a practical necessity when building LLM applications.

Counting Tokens with TikToken

import tiktoken from typing import Optional def count_tokens(text: str, model: str = "gpt-4") -> int: """Count tokens in text for a given model.""" enc = tiktoken.encoding_for_model(model) return len(enc.encode(text)) def estimate_api_cost( text: str, model: str = "gpt-4-turbo", input_cost_per_1k: float = 0.01, output_cost_per_1k: float = 0.03 ) -> dict: """Estimate API cost for a prompt.""" input_tokens = count_tokens(text, model) # Estimate output (rough: 30% of input or max 2000) output_estimate = min(int(input_tokens * 0.3), 2000) input_cost = (input_tokens / 1000) * input_cost_per_1k output_cost = (output_estimate / 1000) * output_cost_per_1k return { "input_tokens": input_tokens, "output_estimate": output_estimate, "input_cost": input_cost, "output_cost": output_cost, "total_cost": input_cost + output_cost } # Example text = "Explain quantum computing in 500 words." result = estimate_api_cost(text) print(f"Tokens: {result['input_tokens']}") print(f"Est. cost: ${result['total_cost']:.4f}") # Context window management def truncate_to_context( messages: list[dict], max_tokens: int = 128000, model: str = "gpt-4" ) -> list[dict]: """Remove oldest messages if context is exceeded.""" enc = tiktoken.encoding_for_model(model) total = 0 # Count tokens, working backwards kept = [] for msg in reversed(messages): msg_tokens = len(enc.encode(msg['content'])) if total + msg_tokens <= max_tokens: kept.append(msg) total += msg_tokens else: break return list(reversed(kept)) # Restore order

Rule-of-Thumb Estimates

For quick estimation without calling the API:

Cost Optimization Strategies

Example Savings: Summarizing a 5000-token document to 500 tokens saves ~90% on input cost. If you process this 1000 times/month, that's substantial savings.
SECTION 06

Multilingual Tokenization

English tokenizers are optimized for English but work (inefficiently) on other languages. This creates a fairness issue: non-English users pay more.

Token Efficiency by Language

Different languages have different "fertility ratios"—how many tokens per character:

Language Chars/Token Fertility Ratio Notes
English ~4 1.0 (baseline) Optimized for GPT tokenizers
Spanish/French ~3.5 1.1x Slightly less efficient
German ~3.2 1.25x Longer words
Japanese ~1.5 2.7x Each character is a token
Chinese ~1.3 3.1x Each character is a token
Arabic ~2 2.0x Diacritics, ligatures

Why the Difference?

English GPT tokenizers learn "the", "and", "ing" as single tokens. Chinese doesn't have word boundaries, so each character is usually its own token. A 100-character Chinese sentence costs ~3x more than the equivalent English sentence.

Implications for Multilingual Models

Multilingual models (like mBERT or mT5) use language-agnostic tokenizers (usually SentencePiece) that are fairer across languages, but less efficient than language-specific tokenizers. The tradeoff:

Language-specific (GPT): - English: Very efficient - Japanese: Inefficient (no Japanese-specific vocab) Language-agnostic (SentencePiece): - English: Moderate (slightly less efficient than BPE) - Japanese: Better (learned Japanese subwords) Conclusion: Multilingual models often have lower avg efficiency but more equitable cost across languages.

Handling Mixed-Language Text

If you're mixing languages (code-switching), be aware:

Fairness Concern: Non-English speakers pay 2-3x more per character for the same API, due to tokenization inefficiency. This is a known issue in the AI industry. Language-agnostic tokenizers and multilingual models partially address this, but the problem persists.
SECTION 07

Special Tokens

Beyond regular tokens, models use special tokens for control, structure, and safety. These are reserved tokens with predefined meanings.

Common Special Tokens

Chat Format Special Tokens (Claude, GPT)

Claude chat format: <|im_start|>user What is tokenization? <|im_end|> <|im_start|>assistant Tokenization is... <|im_end|> Each <|im_start|> and <|im_end|> is a special token. These tokens are not part of regular vocabulary. GPT chat format: <|begin_header_id|>user<|end_header_id|> What is tokenization? <|begin_header_id|>assistant<|end_header_id|> Tokenization is... Similar structure with different tokens.

Adding Custom Tokens

If you're fine-tuning a model, you may add new special tokens:

# Using Hugging Face Transformers from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") # Original vocab size print(f"Original: {tokenizer.vocab_size}") # 50257 # Add custom tokens new_tokens = ["", "", "<|code|>"] tokenizer.add_tokens(new_tokens) print(f"After adding: {tokenizer.vocab_size}") # 50260 # Save updated tokenizer tokenizer.save_pretrained("./custom_tokenizer") # When fine-tuning, resize model embeddings: model.resize_token_embeddings(len(tokenizer)) # This adds new rows to token embedding matrix

Token Efficiency in Chat Protocols

Special tokens add overhead. A typical chat message costs extra tokens for structure:

Message: "Hello" Raw tokens: [Hello] = 1 token With chat format: [<|im_start|>, user, \n, Hello, \n, <|im_end|>] = 6-7 tokens Cost: 6-7x overhead for a single-token message! Optimization: Batch messages, minimize chat wrapper overhead.

Reserved Tokens & Safety

Some models reserve token ranges for safety. For example, certain token IDs might trigger refusals. This is not well-documented for commercial models, but is sometimes used in research:

Best Practice: Understand your model's special tokens. Document custom tokens. When fine-tuning, resize embeddings properly. Monitor that token counts match expectations (off-by-one errors are common).
SECTION 08

Tokenizer Selection Guide

When building a system that uses multiple models, tokenizer incompatibilities are a common source of subtle bugs. Tokenizers are model-specific: a string tokenised by GPT-4 produces different tokens — and a different count — than the same string tokenised by Claude or Llama. Never use one model's tokenizer to estimate token counts for a different model; always use the exact tokenizer shipped with the model.

For cost estimation, the most common error is applying a global "4 characters per token" rule of thumb to non-English text. Asian languages (Chinese, Japanese, Korean) average 1–2 characters per token; code with long identifiers may average 5–6 characters per token. Use the actual tokenizer for accurate billing projections, and build a per-language token-to-character ratio table from a representative sample of your real traffic.

# Count tokens accurately with tiktoken (OpenAI models) or transformers tokenizers
import tiktoken
from transformers import AutoTokenizer

def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def count_tokens_hf(text: str, model_id: str = "meta-llama/Llama-3.1-8B") -> int:
    tok = AutoTokenizer.from_pretrained(model_id)
    return len(tok.encode(text, add_special_tokens=False))

sample = "Explain transformer attention in three sentences."
print(f"GPT-4o:    {count_tokens_openai(sample)} tokens")
print(f"Llama-3.1: {count_tokens_hf(sample)} tokens")