Transformers · Architecture

Positional Encoding

How transformers know token order — sinusoidal, learned, RoPE, ALiBi, and YaRN context extension

5 methods
6 sections
maths-heavy theory
Contents
  1. Why position matters
  2. 5 methods
  3. Sinusoidal
  4. RoPE
  5. ALiBi
  6. Context extension
  7. References
01 — Foundation

Why Position Matters

The problem: Transformers use self-attention: each token attends to all other tokens with weighted sums. Attention is permutation-invariant — the order of tokens doesn't matter to the mechanism. "Alice ate an apple" and "apple an ate Alice" have identical attention scores.

But order is crucial to meaning. Position encoding injects order information into the model.

Absolute position: Token 1 is "Alice", token 2 is "ate". Token position is distinct information.

Relative position: Token j is 5 positions after token i. The distance (gap) between tokens matters. This is more generalizable: a model trained on sequences ≤1000 tokens struggles on 2000-token sequences because positions are out-of-distribution.

💡 Key insight: Good positional encodings enable length generalisation: train on short sequences, test on longer ones. ALiBi and RoPE are superior because they encode relative position naturally.
02 — Comparison

Positional Encoding Methods

Method Extrapolation Memory Relative Used in
Sinusoidal Poor O(1) No BERT, original Transformer
Learned absolute Poor O(max_len) No Early LLMs
RoPE Excellent O(1) Yes (implicit) LLaMA, GPT-3.5+
ALiBi Excellent O(1) Yes (explicit) BLOOM, modern LLMs
NoPE Good (with extrapolation) O(1) Implicit in data Research (not production)

Extrapolation: Can the model generalise to longer sequences than training? Sinusoidal + learned absolute: no. RoPE + ALiBi: yes, but with some quality degradation.

Memory: Does the method require storing large tables? Sinusoidal and RoPE: no (computed on-the-fly). Learned: yes (embedding table grows with max_len).

03 — Classic

Sinusoidal Encoding

Vaswani et al. (2017): encode position m in dimension d using sine/cosine functions at different frequencies:

Formula:
PE(m, 2i) = sin(m / 10000^(2i/d))
PE(m, 2i+1) = cos(m / 10000^(2i/d))

For position m and dimension index i, compute either sine (even indices) or cosine (odd indices) at a frequency that decreases with dimension.

Why sinusoids: Different dimensions oscillate at different rates. Dim 0 completes a cycle every 1 token, dim 2 every ~100, dim d-1 every ~10000. This creates unique signatures for each position.

Linear transformation property: Sinusoidal embeddings can be expressed as linear transformations: PE(m+k) = f(PE(m), k) for some function f. This allows extrapolation, but it's not perfect.

Python: Sinusoidal PE (Numpy)

import numpy as np def sinusoidal_pe(seq_len, d_model): position = np.arange(seq_len).reshape(-1, 1) dim = np.arange(0, d_model, 2) # Frequency: 10000^(2i/d) div_term = 10000.0 ** (2 * dim / d_model) pe = np.zeros((seq_len, d_model)) pe[:, 0::2] = np.sin(position / div_term) # even dims: sin pe[:, 1::2] = np.cos(position / div_term) # odd dims: cos return pe pe = sinusoidal_pe(100, 512) # 100 tokens, 512 dims # Shape: (100, 512)

Limitations

Extrapolation is weak: a model trained on sequences ≤1024 will see PE values at new positions during inference on longer sequences. The learned patterns don't transfer well.

04 — Modern

Rotary Position Embedding (RoPE)

Core idea: Represent positions as rotations in 2D complex planes. Query and key vectors are rotated by an angle proportional to their position.

For position m and dimension d, group d dimensions into d/2 pairs. Each pair (x_{2i}, x_{2i+1}) is treated as a complex number x_{2i} + i·x_{2i+1}. Rotate by angle θ_m = m·θ_0, where θ_0 is the base frequency.

Formula (per 2D block):
[cos(m·θ) -sin(m·θ)] [q_{2i} ]
[sin(m·θ) cos(m·θ)] [q_{2i+1}]

Why it works: The angle difference between position m and n is (m-n)·θ. When attention computes q_m · k_n^T, the rotation matrices interact to encode relative position in the similarity score. Relative position information is baked into the dot product.

Key Advantages

Extrapolation: If trained with context 4k, can handle 8k by extrapolating the base frequency. Quality degrades but much more graceful than sinusoidal.

Memory efficient: No embedding table. Rotations are applied on-the-fly.

Relative position implicit: Attention dot product naturally encodes (m - n).

Python: RoPE Implementation

import torch import math def rope_rotations(seq_len, d_model, theta_0=10000): inv_freq = 1.0 / ( theta_0 ** (torch.arange(0, d_model, 2).float() / d_model) ) t = torch.arange(seq_len, dtype=inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, inv_freq) # Expand to full d_model (cos and sin both) emb = torch.cat([freqs, freqs], dim=-1) cos_val = emb.cos() sin_val = emb.sin() return cos_val, sin_val def apply_rope(q, k, cos, sin): # Reshape for 2D pairs: (..., d_model) -> (..., d/2, 2) d = q.shape[-1] q_rot = torch.cat([q[..., :d//2:2], q[..., 1:d//2:2]], dim=-1) k_rot = torch.cat([k[..., :d//2:2], k[..., 1:d//2:2]], dim=-1) # Apply rotation: (x, y) -> (x*cos - y*sin, x*sin + y*cos) q_out = q_rot * cos + rotate_half(q_rot) * sin k_out = k_rot * cos + rotate_half(k_rot) * sin return q_out, k_out
⚠️ RoPE base frequency: The θ_0 (often 10000) determines extrapolation range. Larger θ_0 → longer natural context. For 32k context, use θ_0 ≈ 1M. For dynamic context, use NTK-aware scaling (see Context Extension).
05 — Linear Bias

ALiBi: Attention with Linear Biases

Simplest idea: Don't embed position. Instead, bias attention scores by the distance between tokens.

Compute attention normally: scores = (Q·K^T) / √d. But before softmax, add a linear bias penalty:

scores[i, j] ← scores[i, j] - α|i - j|

The penalty α is a constant (e.g., 1/16 for LLaMA). Tokens far apart get reduced attention. No position vectors needed.

Advantages:

Python: ALiBi in Attention

import torch import torch.nn.functional as F def apply_alibi(scores, batch, n_heads, seq_len): # scores shape: (batch, n_heads, seq_len, seq_len) # Create distance matrix: |i - j| i = torch.arange(seq_len, device=scores.device) j = torch.arange(seq_len, device=scores.device) distance = torch.abs(i.unsqueeze(1) - j.unsqueeze(0)) # Head-specific bias: smaller α for higher heads # This allows variable attention ranges per head head_id = torch.arange(1, n_heads + 1, device=scores.device) alpha = -1.0 / (2 ** (8 * head_id / n_heads)) bias = alpha.unsqueeze(-1).unsqueeze(-1) * distance scores = scores + bias return scores

When to Use ALiBi

ALiBi is excellent for long-context models. It's the standard in BLOOM, PaLM 2, and others. Empirically, ALiBi generalises better than sinusoidal/learned for extrapolation but slightly worse than RoPE in some benchmarks.

06 — Scaling

Context Length Extension Methods

The problem: a model trained on 4k tokens can't handle 8k without position information going out-of-distribution. How do we extend?

NTK-Aware Scaling (YaRN)

For RoPE, the base frequency θ_0 controls the maximum context. Scaling it up allows longer sequences. But training stability is critical.

YaRN (Yet Another RoPE extension): Dynamically scale the base frequency based on context length. If training context was 4k and we need 8k, scale θ_0 by a factor β ≈ √2.

Formula: θ' = θ · scale, where scale depends on desired context vs training context.

Why it works: Rotations are still meaningful but compressed into the original position range. The model's attention patterns (learned on 4k) can extrapolate.

LongRoPE: Dynamic Scaling

Extends further by selectively scaling different dimensions. Low frequencies (long-range) are scaled more aggressively, high frequencies (local) less. Fine-tuning on a sample of long sequences refines the scaling.

Comparison: Extension Methods

Method Base → Target Perplexity Degradation Fine-tuning Required
No extension 4k → 8k 2–5% (severe) Not helpful
NTK scaling 4k → 8k 0.5–1% No (but improves)
YaRN 4k → 32k < 0.5% Optional
LongRoPE 4k → 128k < 0.3% Recommended

Practical recommendation: For 2–4× extension, NTK scaling works. For 8–32× extension, use YaRN + fine-tuning. For 32×+ extension, use LongRoPE.

💡 Research frontier: Context length extension is an active area. The next generation of LLMs will have native 100k+ context windows via architectural changes (sparse attention, memory augmentation) rather than position encoding tricks.
07 — Ecosystem

Tools & Libraries

Framework
HuggingFace Transformers
Standard implementations of RoPE, ALiBi, sinusoidal. Easy swapping.
Inference
llama.cpp
Optimised inference. Built-in RoPE scaling for long context.
Inference
vLLM
High-throughput serving. Supports RoPE extension natively.
Inference
ExLlamaV2
Fast LLaMA inference. RoPE optimisations.
Training
MegaBlocks
Efficient sparse attention. Alternative to dense position encodings.
Training
FlexAttention (PyTorch)
Custom attention kernels. Implement custom position encodings easily.
Fine-tuning
torchtune
Meta's fine-tuning framework. YaRN context extension built-in.
08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing