ML Foundations

Probability & Stats

Probability theory, Bayes' theorem, and entropy underpin loss functions, beam search, sampling strategies, and everything statistical in LLMs.

Cross-Entropy
Loss Function
Bayes
Key Theorem
Softmax
Distribution

Table of Contents

SECTION 01

Probability in LLMs

Language models are probability distributions over token sequences. P(token | context) is what the model outputs β€” probability theory is the native language of LLMs.

SECTION 02

Distributions & Sampling

import torch import torch.nn.functional as F # LLM outputs logits (raw scores) for each vocab token logits = torch.randn(50000) # One score per vocab token # Softmax: convert to valid probability distribution (sum=1, allβ‰₯0) probs = F.softmax(logits, dim=-1) print(f"Sum: {probs.sum():.4f}") # 1.0 # Greedy decoding: always pick the most probable token next_token = probs.argmax() # Multinomial sampling: sample from the distribution next_token = torch.multinomial(probs, num_samples=1) # Top-k sampling: only sample from top-k most probable tokens k = 50 top_k_probs, top_k_ids = probs.topk(k) top_k_probs = top_k_probs / top_k_probs.sum() # renormalize next_token = top_k_ids[torch.multinomial(top_k_probs, 1)]
SECTION 03

Cross-Entropy Loss

Cross-entropy is the standard loss for language models. It measures the average number of bits needed to encode the true token under the model's distribution.

import torch import torch.nn.functional as F # For next-token prediction: # logits: (batch, seq_len, vocab_size) # labels: (batch, seq_len) β€” integer token IDs logits = torch.randn(2, 10, 50000) # batch=2, seq=10 labels = torch.randint(0, 50000, (2, 10)) # Standard cross-entropy loss loss = F.cross_entropy( logits.view(-1, 50000), # (20, 50000) labels.view(-1) # (20,) ) # What it computes: # For each position: L = -log(P(correct_token)) # If P(correct) = 0.9 β†’ L = -log(0.9) = 0.105 (good) # If P(correct) = 0.01 β†’ L = -log(0.01) = 4.6 (bad) # Perplexity = exp(avg cross-entropy) perplexity = torch.exp(loss) print(f"Loss: {loss:.4f}, Perplexity: {perplexity:.1f}")
SECTION 04

Bayes' Theorem

P(A|B) = P(B|A)Β·P(A) / P(B). In ML: posterior = likelihood Γ— prior / evidence.

# Bayes in the context of RAG / classification: # P(document is relevant | query) ∝ P(query | document) Γ— P(document) # # P(query | document) = likelihood: does this doc produce this query? # P(document) = prior: how common/important is this document? # # In practice: BM25 estimates P(query|doc), embedding sim β‰ˆ P(doc|query) # Bayesian inference in uncertainty estimation: # Model outputs P(y|x) but we want P(model correct | output) # Calibration ensures output probabilities match actual accuracy # Example: model says 90% confidence β€” should be right ~90% of the time # Overconfident models: say 99% but only right 70% β†’ needs calibration import numpy as np def expected_calibration_error(confidences, correct): """Measure calibration β€” lower is better.""" bins = np.linspace(0, 1, 11) ece = 0 for low, high in zip(bins[:-1], bins[1:]): mask = (confidences >= low) & (confidences < high) if mask.sum() == 0: continue avg_conf = confidences[mask].mean() avg_acc = correct[mask].mean() ece += mask.mean() * abs(avg_conf - avg_acc) return ece
SECTION 05

Temperature & Top-p Sampling

Temperature and top-p control the randomness of LLM generation by shaping the output probability distribution.

import torch import torch.nn.functional as F def sample_with_temperature(logits, temperature=1.0, top_p=0.9): """Temperature + nucleus sampling.""" # Temperature scaling β€” divide logits before softmax scaled_logits = logits / temperature # temperature < 1.0 β†’ sharper (more deterministic, picks top tokens) # temperature > 1.0 β†’ flatter (more random, explores more tokens) # temperature β†’ 0 β†’ greedy decoding probs = F.softmax(scaled_logits, dim=-1) # Top-p (nucleus) sampling: keep smallest set of tokens summing to p sorted_probs, sorted_ids = probs.sort(descending=True) cumulative = sorted_probs.cumsum(dim=-1) # Remove tokens past the nucleus sorted_probs[cumulative - sorted_probs > top_p] = 0 sorted_probs /= sorted_probs.sum() # Sample from the nucleus idx = torch.multinomial(sorted_probs, 1) return sorted_ids[idx] # Typical settings: # Creative writing: temperature=0.8, top_p=0.9 # Code generation: temperature=0.2, top_p=0.95 # Factual Q&A: temperature=0.0 (greedy)
Choosing temperature: Higher temperature = more creative/varied. Lower = more focused/factual. Start at 0.7 for chat, 0.2 for code, experiment from there.
SECTION 06

Information Theory Basics

Information theory formalizes uncertainty measurement. Shannon entropy, KL divergence, and mutual information are used throughout ML.

import numpy as np # Shannon entropy: H(P) = -Ξ£ P(x) log P(x) # High entropy = high uncertainty (uniform distribution) # Low entropy = low uncertainty (peaked distribution) def entropy(probs): probs = np.clip(probs, 1e-9, 1) return -(probs * np.log2(probs)).sum() uniform = np.ones(4) / 4 # [0.25, 0.25, 0.25, 0.25] peaked = np.array([0.97, 0.01, 0.01, 0.01]) print(f"Uniform entropy: {entropy(uniform):.2f} bits") # 2.0 print(f"Peaked entropy: {entropy(peaked):.2f} bits") # ~0.22 # KL divergence: D_KL(P || Q) = Ξ£ P(x) log(P(x)/Q(x)) # Measures how different Q is from P. Used in VAEs, RLHF KL penalty def kl_divergence(P, Q): P, Q = np.clip(P, 1e-9, 1), np.clip(Q, 1e-9, 1) return (P * np.log(P / Q)).sum() # RLHF KL penalty: D_KL(policy || reference_model) # Prevents policy from drifting too far from pre-trained model
Perplexity in plain English: Perplexity of 50 means the model is as uncertain as if uniformly choosing among 50 options at each step. Lower = better model.

Probability in attention mechanisms

The softmax function converts raw attention logits β€” the dot products between query and key vectors β€” into a probability distribution over positions. This distribution determines how much each position's value vector contributes to the output representation. The temperature scaling in the softmax (dividing by the square root of key dimension) prevents the distribution from becoming too peaked on a single position when key dimension is large, ensuring that gradient flow during training is not concentrated on the highest-scoring position and that attention can be distributed across multiple relevant positions simultaneously.

Log-probabilities and token scoring

Language models generate log-probabilities for each token in the vocabulary at every position, which are converted to probabilities via the softmax function. Log-probabilities are numerically more stable than raw probabilities because they avoid the floating-point underflow that occurs when multiplying many small probabilities together (as when computing the probability of a long sequence). Sequence-level probability is computed as the sum of token-level log-probabilities, which corresponds to the product of individual token probabilities. This relationship makes perplexity β€” the exponential of average per-token log-probability β€” the standard measure of language model quality on a held-out test set.

ConceptFormulaLLM application
Softmaxexp(xα΅’) / Ξ£exp(xβ±Ό)Token probability distribution, attention weights
Cross-entropy loss-log P(y|x)Training objective for next-token prediction
Perplexityexp(mean(-log P(yα΅’)))Evaluation metric; lower = better
Temperature scalingsoftmax(logits / T)Sharpens (T<1) or flattens (T>1) distribution

The multinomial distribution describes how tokens are sampled from a language model's vocabulary distribution at each generation step. At temperature 1.0, the model samples proportionally to the probability of each token. Top-p sampling (nucleus sampling) restricts the sampling distribution to the smallest subset of tokens whose cumulative probability exceeds p, dynamically adjusting the effective vocabulary size based on distribution sharpness. A peaked distribution with one dominant token produces a small nucleus; a flat distribution produces a large nucleus. This adaptive behavior makes top-p sampling more robust to distribution variation across generation steps than fixed top-k sampling.

Calibration β€” the alignment between a model's stated confidence and actual accuracy β€” is an important probabilistic property for production LLM applications. A perfectly calibrated model would be correct 80% of the time on questions where it assigns 80% confidence. Most LLMs are overconfident: they assign high probabilities to incorrect answers more often than their accuracy justifies. Calibration error can be measured using Expected Calibration Error (ECE) on a labeled dataset, and mitigated through post-hoc temperature scaling that adjusts the model's logit scale to reduce overconfidence without changing its accuracy.

The relationship between perplexity and downstream task performance is non-linear and task-dependent. Reducing perplexity on a language modeling evaluation set (typically WikiText or C4) through continued pre-training does not guarantee proportional improvement on downstream tasks. Models trained on code-heavy corpora may have higher general-domain perplexity but lower perplexity on coding tasks, making them better code generators despite worse headline perplexity numbers. Task-specific evaluation benchmarks remain necessary alongside perplexity as quality indicators, because optimizing for perplexity alone can degrade performance on under-represented task types.

Beam search decoding uses probability theory to select sequences with higher overall likelihood than greedy decoding. Instead of selecting the most probable token at each step, beam search maintains k candidate sequences (the beam) and expands each by selecting the top tokens, keeping only the k highest-scoring expanded sequences. The beam score is the sum of log-probabilities of all selected tokens. Increasing beam width k improves sequence quality up to a point (typically k=4–8 for summarization) beyond which marginal quality improvements are negligible while compute cost increases linearly with k. Length normalization β€” dividing the sum of log-probabilities by sequence length β€” prevents beam search from systematically preferring shorter sequences.

Monte Carlo estimation in LLM evaluation approximates the expected quality of model outputs by averaging over multiple sampled responses to the same prompt. Because LLM outputs are stochastic (different at temperature > 0), single-sample evaluation has high variance β€” the measured quality of a single response may be unrepresentative of the model's average behavior. Running 10–50 samples per prompt and averaging quality scores provides more reliable quality estimates for noisy metrics, and sampling diverse responses enables better understanding of the model's output distribution, including its worst-case behaviors.