LLM Internals

Contents

Tokenization
Context window
Next-token prediction
Sampling strategies
Logprobs & calibration
System prompts & roles
Stop tokens & limits

01 — TEXT TO TOKENS

Tokenization: Text to Tokens

LLMs don't see characters or words — they see tokens (subword units from BPE or WordPiece). Understanding tokenization is foundational because it determines how text enters the model, affects cost, and can introduce subtle bugs.

Byte-Pair Encoding (BPE)

BPE merges the most frequent adjacent byte pairs until the vocabulary size is reached. Vocabulary typically contains 32K–200K tokens depending on the model. After training on massive text corpora, BPE learns to encode common subwords efficiently:

1 token ≈ 0.75 words in English on average
"tokenization" = 3 tokens
"hello" = 1 token
"antidisestablishmentarianism" = 8 tokens

Rare words, non-English text, and code tokenize less efficiently → more tokens → higher cost and longer context usage.

Example: Tokenizing with tiktoken

import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") text = "The quick brown fox jumps over the lazy dog." tokens = enc.encode(text) print(f"Text: {text}") print(f"Token IDs: {tokens}") print(f"Token count: {len(tokens)}") # 10 tokens # Decode back decoded = enc.decode(tokens) # Chinese text tokenizes differently chinese = "快速的棕色狐狸" print(len(enc.encode(chinese))) # ~7 tokens for 7 chars

⚠️ Tokenization artifacts cause subtle bugs. "9.11 > 9.9" is often answered incorrectly because "9", ".", "11" are separate tokens — the model sees numbers as strings, not values.

02 — THE WORKING MEMORY

Context Window: The Working Memory

Context window = total token budget for input + output in a single forward pass. Everything the model "knows" about your task must fit in the context window — there is no other memory during inference.

Context Lengths Across Models

GPT-4o: 128K tokens
Claude 3.5 Sonnet: 200K tokens
Gemini 1.5 Pro: 1M tokens
Llama 3.1: 128K tokens
Mistral Large: 128K tokens

Lost in the Middle Effect

Research by Liu et al. (2023) shows that retrieval accuracy degrades for content in the middle of long contexts. Content at the start and end is remembered better. This has important implications for prompt engineering with long documents.

Context Window Comparison Across Use Cases

Context needed	Example task	Suitable models
4K–8K	Single document QA, short chat	All models
32K	Long document, book chapter	GPT-4o, Claude, Gemini
128K	Full codebase, long report	GPT-4o, Claude 3.5, Llama 3.1
1M+	Entire novel, large repo	Gemini 1.5, Claude 3.5 (200K)

💡 Long context does not mean infinite memory. Cost scales linearly with token count. A 1M-token request costs ~$2.50 with Gemini 1.5 Flash. Always compress before expanding context.

03 — HOW LLMS GENERATE

Next-Token Prediction: How LLMs Generate

Training objective: predict the next token given all previous tokens. That's it. Generalization, reasoning, and instruction-following all emerge from this simple objective.

Autoregressive Generation

The generation process is sequential: generate token 1 → append to context → generate token 2 → ... → stop token or max_tokens reached. Each token is sampled from a probability distribution over the vocabulary (~100K tokens). The sampling strategy determines which token to pick from that distribution.

Manual Token-by-Token Generation

from openai import OpenAI client = OpenAI() # Stream tokens as they are generated with client.chat.completions.stream( model="gpt-4o", messages=[{"role": "user", "content": "Count from 1 to 5"}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True)

04 — CONTROLLING RANDOMNESS

Sampling Strategies

Once the model outputs a probability distribution over tokens, we need to decide which token to pick. Different sampling strategies create different behaviors.

Core Sampling Methods

Temperature: scales logits before softmax. T=0 → always pick highest-probability token (greedy). T=1 → sample from raw distribution. T>1 → more random.
Top-p (nucleus sampling): sample from smallest set of tokens whose cumulative probability ≥ p. p=0.9 → ignore long tail of low-probability tokens.
Top-k: sample only from top-k highest probability tokens. k=50 is common.

Sampling Strategy Comparison

Strategy	Deterministic	Diversity	Use case
Greedy (T=0)	Yes	None	Exact extraction, structured output
Temperature only	No	Controlled	General chat, creative
Top-p (0.9)	No	Controlled	Standard production default
Top-k (50)	No	Controlled	Alternative to top-p
T=1, top-p=1	No	Max	Brainstorming, diversity

Temperature in Action

Same prompt, different temperatures show dramatically different outputs:

Prompt: "The sky is" T=0 (greedy): "blue" (always, deterministic) T=0.7 (default): "a beautiful shade of blue" or "painted with clouds" or "clear today" T=1.5 (high): "cerulean" or "an eternal canvas" or "weeping with soft tears" Usage guide: T=0 for: JSON extraction, code generation, classification T=0.7 for: default chat, summarization T=1.0+ for: creative writing, brainstorming

✓ For structured output (JSON mode, function calling), always use T=0 or near-zero. Sampling randomness causes JSON parse failures and function call errors.

05 — CONFIDENCE AND UNCERTAINTY

Logprobs and Calibration

Models can expose log-probabilities for each output token. This is useful for: confidence estimation, multiple choice evaluation, re-ranking candidate responses, and detecting when a model is uncertain.

What is Calibration?

A well-calibrated model's stated 80% confidence should correspond to 80% actual accuracy. Many LLMs are overconfident — they claim high certainty on tasks they actually fail at.

Token Probability for Classification

Instead of asking a model "Is this positive or negative?", feed "Positive" and "Negative" as candidate tokens and compare their logprobs. This is more reliable than asking the model to say one or the other, because it reduces the model's degrees of freedom.

Example: Using Logprobs for Classification

from openai import OpenAI import math client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Sentiment: 'I love this product!' Answer: Positive or Negative"}], logprobs=True, top_logprobs=5, max_tokens=1 ) top_tokens = response.choices[0].logprobs.content[0].top_logprobs for t in top_tokens: prob = math.exp(t.logprob) print(f"{t.token}: {prob:.3f}") # Positive: 0.987 # Negative: 0.011

06 — ROLES AND CHAT TEMPLATES

System Prompts, Roles, and Chat Templates

Chat models are trained with a specific conversation format: system / user / assistant turns. The system prompt sets context, persona, rules, and output format, and is processed first with higher attention weight during training.

Chat Template Importance

The exact formatting tokens the model was trained on matters significantly. Each model family has its own chat template (e.g., Llama 3 uses <|begin_of_text|>, <|start_header_id|>, etc.). Using the wrong template causes silent quality degradation.

Llama 3 Chat Template Example

System Prompt Best Practices

Be explicit about tone, style, and constraints
Specify output format (JSON, markdown, etc.)
For critical tasks, include examples of correct/incorrect behavior
Keep system prompts under 1000 tokens for efficiency

⚠️ Using the wrong chat template for a fine-tuned model causes silent quality degradation. Always check the model card for the expected format and use tokenizer.apply_chat_template().

07 — LIMITS AND BOUNDARIES

Stop Tokens, Max Tokens, and Limits

Controlling where generation stops is critical for structured outputs and cost management. Stop tokens signal end-of-response, and max_tokens enforces hard limits on output length.

Key Concepts

Stop tokens: special tokens that signal end of response (EOS). Can also specify custom stop sequences like "".
max_tokens / max_completion_tokens: hard limit on output length. Truncates mid-sentence if hit — always set generously.
Token counting for cost: input_tokens × input_price + output_tokens × output_price

Common LLM API Parameters

Parameter	What it controls	Recommended default
temperature	Randomness	0 for structured, 0.7 for chat
max_tokens	Output length limit	2× expected output
top_p	Nucleus sampling	0.9 (leave temperature unchanged)
stop	Custom stop sequences	[""] for structured outputs
seed	Reproducibility	Set for testing/evals
logprobs	Token probabilities	true for evals, classification

Tools for Token Management

tiktoken

OpenAI

Fast token counting for OpenAI models. Essential for cost estimation and validation.

tokenizers (HF)

HuggingFace

Fast BPE tokenizer implementation. Works with any model supporting Hugging Face tokenizers.

OpenAI API

API

Native token counting via API for production workflows.

Anthropic API

API

Token counting for Claude models in API calls.

vLLM

Serving

LLM inference engine with detailed token-level metrics.

ollama

Local

Run local LLMs with full tokenization control.

References

Academic Papers

Paper Liu, N. F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. — arxiv:2307.03172 ↗

Documentation & Guides

Docs HuggingFace — Tokenization explainer. huggingface.co/docs ↗
Docs OpenAI — Tokenization and token counting. platform.openai.com ↗
Docs Anthropic — Token counting for Claude. docs.anthropic.com ↗

Practitioner Resources

Blog OpenAI Cookbook — Token counting and cost estimation. — cookbook.openai.com ↗

LLM Internals

Tokenization: Text to Tokens

Byte-Pair Encoding (BPE)

Example: Tokenizing with tiktoken

Context Window: The Working Memory

Context Lengths Across Models

Lost in the Middle Effect

Context Window Comparison Across Use Cases

Next-Token Prediction: How LLMs Generate

Autoregressive Generation

Manual Token-by-Token Generation

Sampling Strategies

Core Sampling Methods

Sampling Strategy Comparison

Temperature in Action

Logprobs and Calibration

What is Calibration?

Token Probability for Classification

Example: Using Logprobs for Classification

System Prompts, Roles, and Chat Templates

Chat Template Importance

Llama 3 Chat Template Example

System Prompt Best Practices

Stop Tokens, Max Tokens, and Limits

Key Concepts

Common LLM API Parameters

Tools for Token Management

References

Related concepts