Positional Encoding in Transformers

Contents

Why position matters
5 methods
Sinusoidal
RoPE
ALiBi
Context extension
References

01 — Foundation

Why Position Matters

The problem: Transformers use self-attention: each token attends to all other tokens with weighted sums. Attention is permutation-invariant — the order of tokens doesn't matter to the mechanism. "Alice ate an apple" and "apple an ate Alice" have identical attention scores.

But order is crucial to meaning. Position encoding injects order information into the model.

Absolute position: Token 1 is "Alice", token 2 is "ate". Token position is distinct information.

Relative position: Token j is 5 positions after token i. The distance (gap) between tokens matters. This is more generalizable: a model trained on sequences ≤1000 tokens struggles on 2000-token sequences because positions are out-of-distribution.

💡 Key insight: Good positional encodings enable length generalisation: train on short sequences, test on longer ones. ALiBi and RoPE are superior because they encode relative position naturally.

02 — Comparison

Positional Encoding Methods

Method	Extrapolation	Memory	Relative	Used in
Sinusoidal	Poor	O(1)	No	BERT, original Transformer
Learned absolute	Poor	O(max_len)	No	Early LLMs
RoPE	Excellent	O(1)	Yes (implicit)	LLaMA, GPT-3.5+
ALiBi	Excellent	O(1)	Yes (explicit)	BLOOM, modern LLMs
NoPE	Good (with extrapolation)	O(1)	Implicit in data	Research (not production)

Extrapolation: Can the model generalise to longer sequences than training? Sinusoidal + learned absolute: no. RoPE + ALiBi: yes, but with some quality degradation.

Memory: Does the method require storing large tables? Sinusoidal and RoPE: no (computed on-the-fly). Learned: yes (embedding table grows with max_len).

03 — Classic

Sinusoidal Encoding

Vaswani et al. (2017): encode position m in dimension d using sine/cosine functions at different frequencies:

Formula:
PE(m, 2i) = sin(m / 10000^(2i/d))
PE(m, 2i+1) = cos(m / 10000^(2i/d))

For position m and dimension index i, compute either sine (even indices) or cosine (odd indices) at a frequency that decreases with dimension.

Why sinusoids: Different dimensions oscillate at different rates. Dim 0 completes a cycle every 1 token, dim 2 every ~100, dim d-1 every ~10000. This creates unique signatures for each position.

Linear transformation property: Sinusoidal embeddings can be expressed as linear transformations: PE(m+k) = f(PE(m), k) for some function f. This allows extrapolation, but it's not perfect.

Python: Sinusoidal PE (Numpy)

import numpy as np def sinusoidal_pe(seq_len, d_model): position = np.arange(seq_len).reshape(-1, 1) dim = np.arange(0, d_model, 2) # Frequency: 10000^(2i/d) div_term = 10000.0 ** (2 * dim / d_model) pe = np.zeros((seq_len, d_model)) pe[:, 0::2] = np.sin(position / div_term) # even dims: sin pe[:, 1::2] = np.cos(position / div_term) # odd dims: cos return pe pe = sinusoidal_pe(100, 512) # 100 tokens, 512 dims # Shape: (100, 512)

Limitations

Extrapolation is weak: a model trained on sequences ≤1024 will see PE values at new positions during inference on longer sequences. The learned patterns don't transfer well.

04 — Modern

Rotary Position Embedding (RoPE)

Core idea: Represent positions as rotations in 2D complex planes. Query and key vectors are rotated by an angle proportional to their position.

For position m and dimension d, group d dimensions into d/2 pairs. Each pair (x_{2i}, x_{2i+1}) is treated as a complex number x_{2i} + i·x_{2i+1}. Rotate by angle θ_m = m·θ_0, where θ_0 is the base frequency.

Formula (per 2D block):
[cos(m·θ) -sin(m·θ)] [q_{2i} ]
[sin(m·θ) cos(m·θ)] [q_{2i+1}]

Why it works: The angle difference between position m and n is (m-n)·θ. When attention computes q_m · k_n^T, the rotation matrices interact to encode relative position in the similarity score. Relative position information is baked into the dot product.

Key Advantages

Extrapolation: If trained with context 4k, can handle 8k by extrapolating the base frequency. Quality degrades but much more graceful than sinusoidal.

Memory efficient: No embedding table. Rotations are applied on-the-fly.

Relative position implicit: Attention dot product naturally encodes (m - n).

Python: RoPE Implementation

import torch import math def rope_rotations(seq_len, d_model, theta_0=10000): inv_freq = 1.0 / ( theta_0 ** (torch.arange(0, d_model, 2).float() / d_model) ) t = torch.arange(seq_len, dtype=inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, inv_freq) # Expand to full d_model (cos and sin both) emb = torch.cat([freqs, freqs], dim=-1) cos_val = emb.cos() sin_val = emb.sin() return cos_val, sin_val def apply_rope(q, k, cos, sin): # Reshape for 2D pairs: (..., d_model) -> (..., d/2, 2) d = q.shape[-1] q_rot = torch.cat([q[..., :d//2:2], q[..., 1:d//2:2]], dim=-1) k_rot = torch.cat([k[..., :d//2:2], k[..., 1:d//2:2]], dim=-1) # Apply rotation: (x, y) -> (x*cos - y*sin, x*sin + y*cos) q_out = q_rot * cos + rotate_half(q_rot) * sin k_out = k_rot * cos + rotate_half(k_rot) * sin return q_out, k_out

⚠️ RoPE base frequency: The θ_0 (often 10000) determines extrapolation range. Larger θ_0 → longer natural context. For 32k context, use θ_0 ≈ 1M. For dynamic context, use NTK-aware scaling (see Context Extension).

05 — Linear Bias

ALiBi: Attention with Linear Biases

Simplest idea: Don't embed position. Instead, bias attention scores by the distance between tokens.

Compute attention normally: scores = (Q·K^T) / √d. But before softmax, add a linear bias penalty:

scores[i, j] ← scores[i, j] - α|i - j|

The penalty α is a constant (e.g., 1/16 for LLaMA). Tokens far apart get reduced attention. No position vectors needed.

Advantages:

Excellent length generalisation: α is independent of sequence length
No learnable position embeddings → simpler
Multi-head variant: each head has different α value

Python: ALiBi in Attention

import torch import torch.nn.functional as F def apply_alibi(scores, batch, n_heads, seq_len): # scores shape: (batch, n_heads, seq_len, seq_len) # Create distance matrix: |i - j| i = torch.arange(seq_len, device=scores.device) j = torch.arange(seq_len, device=scores.device) distance = torch.abs(i.unsqueeze(1) - j.unsqueeze(0)) # Head-specific bias: smaller α for higher heads # This allows variable attention ranges per head head_id = torch.arange(1, n_heads + 1, device=scores.device) alpha = -1.0 / (2 ** (8 * head_id / n_heads)) bias = alpha.unsqueeze(-1).unsqueeze(-1) * distance scores = scores + bias return scores

When to Use ALiBi

ALiBi is excellent for long-context models. It's the standard in BLOOM, PaLM 2, and others. Empirically, ALiBi generalises better than sinusoidal/learned for extrapolation but slightly worse than RoPE in some benchmarks.

06 — Scaling

Context Length Extension Methods

The problem: a model trained on 4k tokens can't handle 8k without position information going out-of-distribution. How do we extend?

NTK-Aware Scaling (YaRN)

For RoPE, the base frequency θ_0 controls the maximum context. Scaling it up allows longer sequences. But training stability is critical.

YaRN (Yet Another RoPE extension): Dynamically scale the base frequency based on context length. If training context was 4k and we need 8k, scale θ_0 by a factor β ≈ √2.

Formula: θ' = θ · scale, where scale depends on desired context vs training context.

Why it works: Rotations are still meaningful but compressed into the original position range. The model's attention patterns (learned on 4k) can extrapolate.

LongRoPE: Dynamic Scaling

Extends further by selectively scaling different dimensions. Low frequencies (long-range) are scaled more aggressively, high frequencies (local) less. Fine-tuning on a sample of long sequences refines the scaling.

Comparison: Extension Methods

Method	Base → Target	Perplexity Degradation	Fine-tuning Required
No extension	4k → 8k	2–5% (severe)	Not helpful
NTK scaling	4k → 8k	0.5–1%	No (but improves)
YaRN	4k → 32k	< 0.5%	Optional
LongRoPE	4k → 128k	< 0.3%	Recommended

Practical recommendation: For 2–4× extension, NTK scaling works. For 8–32× extension, use YaRN + fine-tuning. For 32×+ extension, use LongRoPE.

💡 Research frontier: Context length extension is an active area. The next generation of LLMs will have native 100k+ context windows via architectural changes (sparse attention, memory augmentation) rather than position encoding tricks.

07 — Ecosystem

Tools & Libraries

Framework

HuggingFace Transformers

Standard implementations of RoPE, ALiBi, sinusoidal. Easy swapping.

Inference

llama.cpp

Optimised inference. Built-in RoPE scaling for long context.

Inference

vLLM

High-throughput serving. Supports RoPE extension natively.

Inference

ExLlamaV2

Fast LLaMA inference. RoPE optimisations.

Training

MegaBlocks

Efficient sparse attention. Alternative to dense position encodings.

Training

FlexAttention (PyTorch)

Custom attention kernels. Implement custom position encodings easily.

Fine-tuning

torchtune

Meta's fine-tuning framework. YaRN context extension built-in.

08 — Further Reading

References

Academic Papers

Paper Vaswani, A. et al. (2017). Attention Is All You Need. Original Transformer, sinusoidal PE. arXiv:1706.03762. — arxiv:1706.03762 ↗
Paper Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864. — arxiv:2104.09864 ↗
Paper Press, O. et al. (2022). Train Short, Test Long: Attention with Linear Biases. ALiBi. arXiv:2108.12284. — arxiv:2108.12284 ↗
Paper Peng, B. et al. (2024). YaRN: Efficient Context Window Extension of Large Language Models. arXiv:2309.00071. — arxiv:2309.00071 ↗
Paper Jin, Y. et al. (2024). LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. arXiv:2402.13753. — arxiv:2402.13753 ↗

Documentation & Guides

Docs HuggingFace Positional Embedding Guide. huggingface.co ↗
Docs PyTorch FlexAttention. pytorch.org ↗
Guide vLLM Context Length Handling. vllm.ai ↗
Guide torchtune Fine-tuning. github.com/pytorch/torchtune ↗

Practitioner Writing

Blog Jimmy Lei Ba. (2021). Position Embeddings: The Good, the Bad, and the Ugly. — eleuther.ai ↗
Blog Meta. (2024). torchtune: Native PyTorch Fine-tuning. — pytorch.org/blog ↗

Positional Encoding

Why Position Matters

Positional Encoding Methods

Sinusoidal Encoding

Python: Sinusoidal PE (Numpy)

Limitations

Rotary Position Embedding (RoPE)

Key Advantages

Python: RoPE Implementation

ALiBi: Attention with Linear Biases

Python: ALiBi in Attention

When to Use ALiBi

Context Length Extension Methods

NTK-Aware Scaling (YaRN)

LongRoPE: Dynamic Scaling

Comparison: Extension Methods

Tools & Libraries

References

Related concepts