HOW LLMS WORK

LLM Internals

Tokenization, context windows, sampling, and the mechanics that explain LLM behavior

token ≠ word the first thing to internalize
context = memory the hard limit
temperature controls randomness the key sampling knob
Contents
  1. Tokenization
  2. Context window
  3. Next-token prediction
  4. Sampling strategies
  5. Logprobs & calibration
  6. System prompts & roles
  7. Stop tokens & limits
01 — TEXT TO TOKENS

Tokenization: Text to Tokens

LLMs don't see characters or words — they see tokens (subword units from BPE or WordPiece). Understanding tokenization is foundational because it determines how text enters the model, affects cost, and can introduce subtle bugs.

Byte-Pair Encoding (BPE)

BPE merges the most frequent adjacent byte pairs until the vocabulary size is reached. Vocabulary typically contains 32K–200K tokens depending on the model. After training on massive text corpora, BPE learns to encode common subwords efficiently:

Rare words, non-English text, and code tokenize less efficiently → more tokens → higher cost and longer context usage.

Example: Tokenizing with tiktoken

import tiktoken enc = tiktoken.encoding_for_model("gpt-4o") text = "The quick brown fox jumps over the lazy dog." tokens = enc.encode(text) print(f"Text: {text}") print(f"Token IDs: {tokens}") print(f"Token count: {len(tokens)}") # 10 tokens # Decode back decoded = enc.decode(tokens) # Chinese text tokenizes differently chinese = "快速的棕色狐狸" print(len(enc.encode(chinese))) # ~7 tokens for 7 chars
⚠️ Tokenization artifacts cause subtle bugs. "9.11 > 9.9" is often answered incorrectly because "9", ".", "11" are separate tokens — the model sees numbers as strings, not values.
02 — THE WORKING MEMORY

Context Window: The Working Memory

Context window = total token budget for input + output in a single forward pass. Everything the model "knows" about your task must fit in the context window — there is no other memory during inference.

Context Lengths Across Models

Lost in the Middle Effect

Research by Liu et al. (2023) shows that retrieval accuracy degrades for content in the middle of long contexts. Content at the start and end is remembered better. This has important implications for prompt engineering with long documents.

Context Window Comparison Across Use Cases

Context neededExample taskSuitable models
4K–8KSingle document QA, short chatAll models
32KLong document, book chapterGPT-4o, Claude, Gemini
128KFull codebase, long reportGPT-4o, Claude 3.5, Llama 3.1
1M+Entire novel, large repoGemini 1.5, Claude 3.5 (200K)
💡 Long context does not mean infinite memory. Cost scales linearly with token count. A 1M-token request costs ~$2.50 with Gemini 1.5 Flash. Always compress before expanding context.
03 — HOW LLMS GENERATE

Next-Token Prediction: How LLMs Generate

Training objective: predict the next token given all previous tokens. That's it. Generalization, reasoning, and instruction-following all emerge from this simple objective.

Autoregressive Generation

The generation process is sequential: generate token 1 → append to context → generate token 2 → ... → stop token or max_tokens reached. Each token is sampled from a probability distribution over the vocabulary (~100K tokens). The sampling strategy determines which token to pick from that distribution.

Manual Token-by-Token Generation

from openai import OpenAI client = OpenAI() # Stream tokens as they are generated with client.chat.completions.stream( model="gpt-4o", messages=[{"role": "user", "content": "Count from 1 to 5"}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True)
04 — CONTROLLING RANDOMNESS

Sampling Strategies

Once the model outputs a probability distribution over tokens, we need to decide which token to pick. Different sampling strategies create different behaviors.

Core Sampling Methods

Sampling Strategy Comparison

StrategyDeterministicDiversityUse case
Greedy (T=0)YesNoneExact extraction, structured output
Temperature onlyNoControlledGeneral chat, creative
Top-p (0.9)NoControlledStandard production default
Top-k (50)NoControlledAlternative to top-p
T=1, top-p=1NoMaxBrainstorming, diversity

Temperature in Action

Same prompt, different temperatures show dramatically different outputs:

Prompt: "The sky is" T=0 (greedy): "blue" (always, deterministic) T=0.7 (default): "a beautiful shade of blue" or "painted with clouds" or "clear today" T=1.5 (high): "cerulean" or "an eternal canvas" or "weeping with soft tears" Usage guide: T=0 for: JSON extraction, code generation, classification T=0.7 for: default chat, summarization T=1.0+ for: creative writing, brainstorming
For structured output (JSON mode, function calling), always use T=0 or near-zero. Sampling randomness causes JSON parse failures and function call errors.
05 — CONFIDENCE AND UNCERTAINTY

Logprobs and Calibration

Models can expose log-probabilities for each output token. This is useful for: confidence estimation, multiple choice evaluation, re-ranking candidate responses, and detecting when a model is uncertain.

What is Calibration?

A well-calibrated model's stated 80% confidence should correspond to 80% actual accuracy. Many LLMs are overconfident — they claim high certainty on tasks they actually fail at.

Token Probability for Classification

Instead of asking a model "Is this positive or negative?", feed "Positive" and "Negative" as candidate tokens and compare their logprobs. This is more reliable than asking the model to say one or the other, because it reduces the model's degrees of freedom.

Example: Using Logprobs for Classification

from openai import OpenAI import math client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Sentiment: 'I love this product!' Answer: Positive or Negative"}], logprobs=True, top_logprobs=5, max_tokens=1 ) top_tokens = response.choices[0].logprobs.content[0].top_logprobs for t in top_tokens: prob = math.exp(t.logprob) print(f"{t.token}: {prob:.3f}") # Positive: 0.987 # Negative: 0.011
06 — ROLES AND CHAT TEMPLATES

System Prompts, Roles, and Chat Templates

Chat models are trained with a specific conversation format: system / user / assistant turns. The system prompt sets context, persona, rules, and output format, and is processed first with higher attention weight during training.

Chat Template Importance

The exact formatting tokens the model was trained on matters significantly. Each model family has its own chat template (e.g., Llama 3 uses <|begin_of_text|>, <|start_header_id|>, etc.). Using the wrong template causes silent quality degradation.

Llama 3 Chat Template Example

<|begin_of_text|> <|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|> <|start_header_id|>user<|end_header_id|> What is 2+2?<|eot_id|> <|start_header_id|>assistant<|end_header_id|>

System Prompt Best Practices

⚠️ Using the wrong chat template for a fine-tuned model causes silent quality degradation. Always check the model card for the expected format and use tokenizer.apply_chat_template().
07 — LIMITS AND BOUNDARIES

Stop Tokens, Max Tokens, and Limits

Controlling where generation stops is critical for structured outputs and cost management. Stop tokens signal end-of-response, and max_tokens enforces hard limits on output length.

Key Concepts

Common LLM API Parameters

ParameterWhat it controlsRecommended default
temperatureRandomness0 for structured, 0.7 for chat
max_tokensOutput length limit2× expected output
top_pNucleus sampling0.9 (leave temperature unchanged)
stopCustom stop sequences[""] for structured outputs
seedReproducibilitySet for testing/evals
logprobsToken probabilitiestrue for evals, classification

Tools for Token Management

tiktoken
OpenAI
Fast token counting for OpenAI models. Essential for cost estimation and validation.
tokenizers (HF)
HuggingFace
Fast BPE tokenizer implementation. Works with any model supporting Hugging Face tokenizers.
OpenAI API
API
Native token counting via API for production workflows.
Anthropic API
API
Token counting for Claude models in API calls.
vLLM
Serving
LLM inference engine with detailed token-level metrics.
ollama
Local
Run local LLMs with full tokenization control.
Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Resources