Probability theory, Bayes' theorem, and entropy underpin loss functions, beam search, sampling strategies, and everything statistical in LLMs.
Language models are probability distributions over token sequences. P(token | context) is what the model outputs β probability theory is the native language of LLMs.
Cross-entropy is the standard loss for language models. It measures the average number of bits needed to encode the true token under the model's distribution.
P(A|B) = P(B|A)Β·P(A) / P(B). In ML: posterior = likelihood Γ prior / evidence.
Temperature and top-p control the randomness of LLM generation by shaping the output probability distribution.
Information theory formalizes uncertainty measurement. Shannon entropy, KL divergence, and mutual information are used throughout ML.
The softmax function converts raw attention logits β the dot products between query and key vectors β into a probability distribution over positions. This distribution determines how much each position's value vector contributes to the output representation. The temperature scaling in the softmax (dividing by the square root of key dimension) prevents the distribution from becoming too peaked on a single position when key dimension is large, ensuring that gradient flow during training is not concentrated on the highest-scoring position and that attention can be distributed across multiple relevant positions simultaneously.
Language models generate log-probabilities for each token in the vocabulary at every position, which are converted to probabilities via the softmax function. Log-probabilities are numerically more stable than raw probabilities because they avoid the floating-point underflow that occurs when multiplying many small probabilities together (as when computing the probability of a long sequence). Sequence-level probability is computed as the sum of token-level log-probabilities, which corresponds to the product of individual token probabilities. This relationship makes perplexity β the exponential of average per-token log-probability β the standard measure of language model quality on a held-out test set.
| Concept | Formula | LLM application |
|---|---|---|
| Softmax | exp(xα΅’) / Ξ£exp(xβ±Ό) | Token probability distribution, attention weights |
| Cross-entropy loss | -log P(y|x) | Training objective for next-token prediction |
| Perplexity | exp(mean(-log P(yα΅’))) | Evaluation metric; lower = better |
| Temperature scaling | softmax(logits / T) | Sharpens (T<1) or flattens (T>1) distribution |
The multinomial distribution describes how tokens are sampled from a language model's vocabulary distribution at each generation step. At temperature 1.0, the model samples proportionally to the probability of each token. Top-p sampling (nucleus sampling) restricts the sampling distribution to the smallest subset of tokens whose cumulative probability exceeds p, dynamically adjusting the effective vocabulary size based on distribution sharpness. A peaked distribution with one dominant token produces a small nucleus; a flat distribution produces a large nucleus. This adaptive behavior makes top-p sampling more robust to distribution variation across generation steps than fixed top-k sampling.
Calibration β the alignment between a model's stated confidence and actual accuracy β is an important probabilistic property for production LLM applications. A perfectly calibrated model would be correct 80% of the time on questions where it assigns 80% confidence. Most LLMs are overconfident: they assign high probabilities to incorrect answers more often than their accuracy justifies. Calibration error can be measured using Expected Calibration Error (ECE) on a labeled dataset, and mitigated through post-hoc temperature scaling that adjusts the model's logit scale to reduce overconfidence without changing its accuracy.
The relationship between perplexity and downstream task performance is non-linear and task-dependent. Reducing perplexity on a language modeling evaluation set (typically WikiText or C4) through continued pre-training does not guarantee proportional improvement on downstream tasks. Models trained on code-heavy corpora may have higher general-domain perplexity but lower perplexity on coding tasks, making them better code generators despite worse headline perplexity numbers. Task-specific evaluation benchmarks remain necessary alongside perplexity as quality indicators, because optimizing for perplexity alone can degrade performance on under-represented task types.
Beam search decoding uses probability theory to select sequences with higher overall likelihood than greedy decoding. Instead of selecting the most probable token at each step, beam search maintains k candidate sequences (the beam) and expands each by selecting the top tokens, keeping only the k highest-scoring expanded sequences. The beam score is the sum of log-probabilities of all selected tokens. Increasing beam width k improves sequence quality up to a point (typically k=4β8 for summarization) beyond which marginal quality improvements are negligible while compute cost increases linearly with k. Length normalization β dividing the sum of log-probabilities by sequence length β prevents beam search from systematically preferring shorter sequences.
Monte Carlo estimation in LLM evaluation approximates the expected quality of model outputs by averaging over multiple sampled responses to the same prompt. Because LLM outputs are stochastic (different at temperature > 0), single-sample evaluation has high variance β the measured quality of a single response may be unrepresentative of the model's average behavior. Running 10β50 samples per prompt and averaging quality scores provides more reliable quality estimates for noisy metrics, and sampling diverse responses enables better understanding of the model's output distribution, including its worst-case behaviors.