01 — TEXT TO TOKENS
Tokenization: Text to Tokens
LLMs don't see characters or words — they see tokens (subword units from BPE or WordPiece). Understanding tokenization is foundational because it determines how text enters the model, affects cost, and can introduce subtle bugs.
Byte-Pair Encoding (BPE)
BPE merges the most frequent adjacent byte pairs until the vocabulary size is reached. Vocabulary typically contains 32K–200K tokens depending on the model. After training on massive text corpora, BPE learns to encode common subwords efficiently:
- 1 token ≈ 0.75 words in English on average
- "tokenization" = 3 tokens
- "hello" = 1 token
- "antidisestablishmentarianism" = 8 tokens
Rare words, non-English text, and code tokenize less efficiently → more tokens → higher cost and longer context usage.
Example: Tokenizing with tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}") # 10 tokens
# Decode back
decoded = enc.decode(tokens)
# Chinese text tokenizes differently
chinese = "快速的棕色狐狸"
print(len(enc.encode(chinese))) # ~7 tokens for 7 chars
⚠️
Tokenization artifacts cause subtle bugs. "9.11 > 9.9" is often answered incorrectly because "9", ".", "11" are separate tokens — the model sees numbers as strings, not values.
02 — THE WORKING MEMORY
Context Window: The Working Memory
Context window = total token budget for input + output in a single forward pass. Everything the model "knows" about your task must fit in the context window — there is no other memory during inference.
Context Lengths Across Models
Lost in the Middle Effect
Research by Liu et al. (2023) shows that retrieval accuracy degrades for content in the middle of long contexts. Content at the start and end is remembered better. This has important implications for prompt engineering with long documents.
Context Window Comparison Across Use Cases
| Context needed | Example task | Suitable models |
| 4K–8K | Single document QA, short chat | All models |
| 32K | Long document, book chapter | GPT-4o, Claude, Gemini |
| 128K | Full codebase, long report | GPT-4o, Claude 3.5, Llama 3.1 |
| 1M+ | Entire novel, large repo | Gemini 1.5, Claude 3.5 (200K) |
💡
Long context does not mean infinite memory. Cost scales linearly with token count. A 1M-token request costs ~$2.50 with Gemini 1.5 Flash. Always compress before expanding context.
03 — HOW LLMS GENERATE
Next-Token Prediction: How LLMs Generate
Training objective: predict the next token given all previous tokens. That's it. Generalization, reasoning, and instruction-following all emerge from this simple objective.
Autoregressive Generation
The generation process is sequential: generate token 1 → append to context → generate token 2 → ... → stop token or max_tokens reached. Each token is sampled from a probability distribution over the vocabulary (~100K tokens). The sampling strategy determines which token to pick from that distribution.
Manual Token-by-Token Generation
from openai import OpenAI
client = OpenAI()
# Stream tokens as they are generated
with client.chat.completions.stream(
model="gpt-4o",
messages=[{"role": "user", "content": "Count from 1 to 5"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
04 — CONTROLLING RANDOMNESS
Sampling Strategies
Once the model outputs a probability distribution over tokens, we need to decide which token to pick. Different sampling strategies create different behaviors.
Core Sampling Methods
- Temperature: scales logits before softmax. T=0 → always pick highest-probability token (greedy). T=1 → sample from raw distribution. T>1 → more random.
- Top-p (nucleus sampling): sample from smallest set of tokens whose cumulative probability ≥ p. p=0.9 → ignore long tail of low-probability tokens.
- Top-k: sample only from top-k highest probability tokens. k=50 is common.
Sampling Strategy Comparison
| Strategy | Deterministic | Diversity | Use case |
| Greedy (T=0) | Yes | None | Exact extraction, structured output |
| Temperature only | No | Controlled | General chat, creative |
| Top-p (0.9) | No | Controlled | Standard production default |
| Top-k (50) | No | Controlled | Alternative to top-p |
| T=1, top-p=1 | No | Max | Brainstorming, diversity |
Temperature in Action
Same prompt, different temperatures show dramatically different outputs:
Prompt: "The sky is"
T=0 (greedy): "blue" (always, deterministic)
T=0.7 (default): "a beautiful shade of blue" or "painted with clouds" or "clear today"
T=1.5 (high): "cerulean" or "an eternal canvas" or "weeping with soft tears"
Usage guide:
T=0 for: JSON extraction, code generation, classification
T=0.7 for: default chat, summarization
T=1.0+ for: creative writing, brainstorming
✓
For structured output (JSON mode, function calling), always use T=0 or near-zero. Sampling randomness causes JSON parse failures and function call errors.
05 — CONFIDENCE AND UNCERTAINTY
Logprobs and Calibration
Models can expose log-probabilities for each output token. This is useful for: confidence estimation, multiple choice evaluation, re-ranking candidate responses, and detecting when a model is uncertain.
What is Calibration?
A well-calibrated model's stated 80% confidence should correspond to 80% actual accuracy. Many LLMs are overconfident — they claim high certainty on tasks they actually fail at.
Token Probability for Classification
Instead of asking a model "Is this positive or negative?", feed "Positive" and "Negative" as candidate tokens and compare their logprobs. This is more reliable than asking the model to say one or the other, because it reduces the model's degrees of freedom.
Example: Using Logprobs for Classification
from openai import OpenAI
import math
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Sentiment: 'I love this product!' Answer: Positive or Negative"}],
logprobs=True,
top_logprobs=5,
max_tokens=1
)
top_tokens = response.choices[0].logprobs.content[0].top_logprobs
for t in top_tokens:
prob = math.exp(t.logprob)
print(f"{t.token}: {prob:.3f}")
# Positive: 0.987
# Negative: 0.011
06 — ROLES AND CHAT TEMPLATES
System Prompts, Roles, and Chat Templates
Chat models are trained with a specific conversation format: system / user / assistant turns. The system prompt sets context, persona, rules, and output format, and is processed first with higher attention weight during training.
Chat Template Importance
The exact formatting tokens the model was trained on matters significantly. Each model family has its own chat template (e.g., Llama 3 uses <|begin_of_text|>, <|start_header_id|>, etc.). Using the wrong template causes silent quality degradation.
Llama 3 Chat Template Example
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is 2+2?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
System Prompt Best Practices
- Be explicit about tone, style, and constraints
- Specify output format (JSON, markdown, etc.)
- For critical tasks, include examples of correct/incorrect behavior
- Keep system prompts under 1000 tokens for efficiency
⚠️
Using the wrong chat template for a fine-tuned model causes silent quality degradation. Always check the model card for the expected format and use tokenizer.apply_chat_template().
07 — LIMITS AND BOUNDARIES
Stop Tokens, Max Tokens, and Limits
Controlling where generation stops is critical for structured outputs and cost management. Stop tokens signal end-of-response, and max_tokens enforces hard limits on output length.
Key Concepts
- Stop tokens: special tokens that signal end of response (EOS). Can also specify custom stop sequences like "".
- max_tokens / max_completion_tokens: hard limit on output length. Truncates mid-sentence if hit — always set generously.
- Token counting for cost: input_tokens × input_price + output_tokens × output_price
Common LLM API Parameters
| Parameter | What it controls | Recommended default |
| temperature | Randomness | 0 for structured, 0.7 for chat |
| max_tokens | Output length limit | 2× expected output |
| top_p | Nucleus sampling | 0.9 (leave temperature unchanged) |
| stop | Custom stop sequences | [""] for structured outputs |
| seed | Reproducibility | Set for testing/evals |
| logprobs | Token probabilities | true for evals, classification |
Tools for Token Management
Further Reading
References
Academic Papers
-
Paper
Liu, N. F. et al. (2023).
Lost in the Middle: How Language Models Use Long Contexts.
arXiv:2307.03172. —
arxiv:2307.03172 ↗
Documentation & Guides
Practitioner Resources