How transformers know token order — sinusoidal, learned, RoPE, ALiBi, and YaRN context extension
The problem: Transformers use self-attention: each token attends to all other tokens with weighted sums. Attention is permutation-invariant — the order of tokens doesn't matter to the mechanism. "Alice ate an apple" and "apple an ate Alice" have identical attention scores.
But order is crucial to meaning. Position encoding injects order information into the model.
Absolute position: Token 1 is "Alice", token 2 is "ate". Token position is distinct information.
Relative position: Token j is 5 positions after token i. The distance (gap) between tokens matters. This is more generalizable: a model trained on sequences ≤1000 tokens struggles on 2000-token sequences because positions are out-of-distribution.
| Method | Extrapolation | Memory | Relative | Used in |
|---|---|---|---|---|
| Sinusoidal | Poor | O(1) | No | BERT, original Transformer |
| Learned absolute | Poor | O(max_len) | No | Early LLMs |
| RoPE | Excellent | O(1) | Yes (implicit) | LLaMA, GPT-3.5+ |
| ALiBi | Excellent | O(1) | Yes (explicit) | BLOOM, modern LLMs |
| NoPE | Good (with extrapolation) | O(1) | Implicit in data | Research (not production) |
Extrapolation: Can the model generalise to longer sequences than training? Sinusoidal + learned absolute: no. RoPE + ALiBi: yes, but with some quality degradation.
Memory: Does the method require storing large tables? Sinusoidal and RoPE: no (computed on-the-fly). Learned: yes (embedding table grows with max_len).
Vaswani et al. (2017): encode position m in dimension d using sine/cosine functions at different frequencies:
Formula:
PE(m, 2i) = sin(m / 10000^(2i/d))
PE(m, 2i+1) = cos(m / 10000^(2i/d))
For position m and dimension index i, compute either sine (even indices) or cosine (odd indices) at a frequency that decreases with dimension.
Why sinusoids: Different dimensions oscillate at different rates. Dim 0 completes a cycle every 1 token, dim 2 every ~100, dim d-1 every ~10000. This creates unique signatures for each position.
Linear transformation property: Sinusoidal embeddings can be expressed as linear transformations: PE(m+k) = f(PE(m), k) for some function f. This allows extrapolation, but it's not perfect.
Extrapolation is weak: a model trained on sequences ≤1024 will see PE values at new positions during inference on longer sequences. The learned patterns don't transfer well.
Core idea: Represent positions as rotations in 2D complex planes. Query and key vectors are rotated by an angle proportional to their position.
For position m and dimension d, group d dimensions into d/2 pairs. Each pair (x_{2i}, x_{2i+1}) is treated as a complex number x_{2i} + i·x_{2i+1}. Rotate by angle θ_m = m·θ_0, where θ_0 is the base frequency.
Formula (per 2D block):
[cos(m·θ) -sin(m·θ)] [q_{2i} ]
[sin(m·θ) cos(m·θ)] [q_{2i+1}]
Why it works: The angle difference between position m and n is (m-n)·θ. When attention computes q_m · k_n^T, the rotation matrices interact to encode relative position in the similarity score. Relative position information is baked into the dot product.
Extrapolation: If trained with context 4k, can handle 8k by extrapolating the base frequency. Quality degrades but much more graceful than sinusoidal.
Memory efficient: No embedding table. Rotations are applied on-the-fly.
Relative position implicit: Attention dot product naturally encodes (m - n).
Simplest idea: Don't embed position. Instead, bias attention scores by the distance between tokens.
Compute attention normally: scores = (Q·K^T) / √d. But before softmax, add a linear bias penalty:
scores[i, j] ← scores[i, j] - α|i - j|
The penalty α is a constant (e.g., 1/16 for LLaMA). Tokens far apart get reduced attention. No position vectors needed.
Advantages:
ALiBi is excellent for long-context models. It's the standard in BLOOM, PaLM 2, and others. Empirically, ALiBi generalises better than sinusoidal/learned for extrapolation but slightly worse than RoPE in some benchmarks.
The problem: a model trained on 4k tokens can't handle 8k without position information going out-of-distribution. How do we extend?
For RoPE, the base frequency θ_0 controls the maximum context. Scaling it up allows longer sequences. But training stability is critical.
YaRN (Yet Another RoPE extension): Dynamically scale the base frequency based on context length. If training context was 4k and we need 8k, scale θ_0 by a factor β ≈ √2.
Formula: θ' = θ · scale, where scale depends on desired context vs training context.
Why it works: Rotations are still meaningful but compressed into the original position range. The model's attention patterns (learned on 4k) can extrapolate.
Extends further by selectively scaling different dimensions. Low frequencies (long-range) are scaled more aggressively, high frequencies (local) less. Fine-tuning on a sample of long sequences refines the scaling.
| Method | Base → Target | Perplexity Degradation | Fine-tuning Required |
|---|---|---|---|
| No extension | 4k → 8k | 2–5% (severe) | Not helpful |
| NTK scaling | 4k → 8k | 0.5–1% | No (but improves) |
| YaRN | 4k → 32k | < 0.5% | Optional |
| LongRoPE | 4k → 128k | < 0.3% | Recommended |
Practical recommendation: For 2–4× extension, NTK scaling works. For 8–32× extension, use YaRN + fine-tuning. For 32×+ extension, use LongRoPE.