Positional Encoding

RoPE

Rotary Position Embeddings encode position by rotating query and key vectors in pairs of dimensions — giving relative positional awareness that generalises to longer sequences than seen during training.

Relative
Position aware
Used in
Llama / Mistral
Extrapolates
Long context

Table of Contents

SECTION 01

Why positional encoding matters

Self-attention is permutation-equivariant: shuffle the input tokens and you get the same attention outputs, just in a different order. The model has no built-in sense of token order. For language, order is everything — "dog bites man" and "man bites dog" are opposite meanings.

Positional encoding injects position information into token representations. The original transformer used fixed sinusoidal functions; BERT used learned absolute position embeddings. Both have a problem: they encode absolute position, not relative position. Token i at position 5 has no inherent relationship to token j at position 8 — you have to infer "3 apart" from their absolute encodings.

RoPE (Su et al. 2021) encodes position directly into the query-key dot product in a way that makes the attention score depend on the relative distance between tokens — regardless of their absolute positions. This is more natural (grammar cares about relative distance) and enables better length generalisation.

SECTION 02

How RoPE works

The key insight: if you rotate vector Q[m] (at position m) and vector K[n] (at position n) by angles proportional to their positions, their dot product Q[m]·K[n] will naturally depend only on (m-n) — the relative distance.

RoPE works on pairs of dimensions. For a d-dimensional vector, it rotates dimension pairs (0,1), (2,3), (4,5), ... by different frequencies. Each pair (2i, 2i+1) gets rotation angle θ_i = position × base-2i/d, where base=10000 (same as sinusoidal PE's frequencies).

For a query q at position m:

RoPE(q, m)[2i]   = q[2i]   · cos(m·θ_i) - q[2i+1] · sin(m·θ_i)
RoPE(q, m)[2i+1] = q[2i+1] · cos(m·θ_i) + q[2i]   · sin(m·θ_i)

Applied to both Q and K before the dot product. The result: QKT[m,n] = f(q, k, m-n) — only relative position matters, not absolute.

SECTION 03

The rotation formula

import numpy as np

def get_rope_frequencies(d: int, base: float = 10000.0) -> np.ndarray:
    # Compute theta for each dimension pair: theta_i = base^(-2i/d)
    i = np.arange(0, d, 2)           # [0, 2, 4, ..., d-2]
    return 1.0 / (base ** (i / d))   # shape: (d/2,)

def apply_rope(x: np.ndarray, position: int, freqs: np.ndarray) -> np.ndarray:
    # x: (d,) vector at position `position`
    # freqs: (d/2,) rotation frequencies
    angles = position * freqs          # (d/2,)
    cos_a = np.cos(angles)             # (d/2,)
    sin_a = np.sin(angles)             # (d/2,)

    # Split into even/odd dimensions
    x_even = x[0::2]   # (d/2,)
    x_odd  = x[1::2]   # (d/2,)

    # Rotate each pair
    out = np.empty_like(x)
    out[0::2] = x_even * cos_a - x_odd * sin_a
    out[1::2] = x_odd  * cos_a + x_even * sin_a
    return out

# Verify: dot product depends only on relative position
d = 64
freqs = get_rope_frequencies(d)
q = np.random.randn(d)
k = np.random.randn(d)

q5  = apply_rope(q, position=5, freqs=freqs)
q10 = apply_rope(q, position=10, freqs=freqs)
k8  = apply_rope(k, position=8, freqs=freqs)
k13 = apply_rope(k, position=13, freqs=freqs)

# Both pairs are 3 apart — scores should be similar
print(f"q5·k8  (dist=3): {q5 @ k8:.4f}")
print(f"q10·k13 (dist=3): {q10 @ k13:.4f}")  # similar magnitude
SECTION 04

Python implementation

import torch

def precompute_rope_freqs(d: int, seq_len: int, base: float = 10000.0):
    # Precompute cos/sin for all positions up to seq_len
    theta = 1.0 / (base ** (torch.arange(0, d, 2).float() / d))  # (d/2,)
    positions = torch.arange(seq_len).float()                       # (seq_len,)
    freqs = torch.outer(positions, theta)                           # (seq_len, d/2)
    return torch.cos(freqs), torch.sin(freqs)   # each (seq_len, d/2)

def apply_rope_torch(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor):
    # x: (batch, heads, seq, d_k)
    # cos, sin: (seq, d_k/2)
    d = x.shape[-1]
    x_even = x[..., 0::2]   # (batch, heads, seq, d_k/2)
    x_odd  = x[..., 1::2]

    # Broadcast cos/sin over batch and heads dimensions
    cos = cos.unsqueeze(0).unsqueeze(0)   # (1, 1, seq, d_k/2)
    sin = sin.unsqueeze(0).unsqueeze(0)

    out = torch.empty_like(x)
    out[..., 0::2] = x_even * cos - x_odd * sin
    out[..., 1::2] = x_odd  * cos + x_even * sin
    return out

# In a transformer forward pass:
d_k, seq_len = 64, 512
cos_freqs, sin_freqs = precompute_rope_freqs(d_k, seq_len)

Q = torch.randn(2, 8, seq_len, d_k)   # (batch, heads, seq, d_k)
K = torch.randn(2, 8, seq_len, d_k)

Q_rope = apply_rope_torch(Q, cos_freqs, sin_freqs)
K_rope = apply_rope_torch(K, cos_freqs, sin_freqs)
# Use Q_rope, K_rope in scaled dot-product attention
SECTION 05

RoPE vs sinusoidal vs ALiBi

Sinusoidal PE (original transformer): adds fixed sin/cos patterns to token embeddings before the first layer. Encodes absolute position. Doesn't generalise well beyond training length — no mechanism to extrapolate. Simple and still used in some models.

Learned absolute PE (BERT, GPT-2): same as sinusoidal but learned from data. Hard cap at training sequence length — a BERT-base trained on 512 tokens has no positional embedding for position 513. Fine for fixed-length tasks, bad for variable-length generation.

RoPE (Llama, Mistral, Qwen, GPT-NeoX): rotates Q and K. Relative positions emerge naturally from the dot product. With tricks like YaRN (Yet another RoPE extension), models can be extended from 4K to 128K context post-training by adjusting the rotation frequencies.

ALiBi (BLOOM, MPT): instead of modifying embeddings, subtracts a linear bias from attention scores proportional to distance. Simpler than RoPE, strong extrapolation, but slightly lower quality on long-context retrieval tasks. Chosen for models where simplicity and extrapolation matter more than peak performance.

SECTION 06

Long-context extrapolation

RoPE's base frequency determines how positions are encoded. With base=10000 and d=64, the lowest-frequency pair has θ = 10000-62/64 ≈ 1/9441 — completing one full rotation every ~59,000 tokens. This is why RoPE doesn't degrade until sequences exceed ~10K tokens with the default base.

To extend context, increase the base (YaRN uses base up to 500,000 for 128K context):

# Standard base — good to ~8K context
cos, sin = precompute_rope_freqs(d_k=64, seq_len=8192, base=10000.0)

# Extended base (as used in Llama-3.1 for 128K context)
cos, sin = precompute_rope_freqs(d_k=64, seq_len=131072, base=500000.0)

# Dynamic NTK scaling — increases base as needed at inference
def dynamic_ntk_base(d: int, seq_len: int, max_trained: int, alpha: float = 1.0):
    if seq_len <= max_trained:
        return 10000.0
    # Increases base proportionally when seq_len exceeds training length
    return 10000.0 * ((alpha * seq_len / max_trained) - (alpha - 1)) ** (d / (d - 2))
SECTION 07

Gotchas

RoPE must be applied before, not after, the attention projection. It rotates Q and K as part of the attention computation. Applying it to the raw embeddings (like adding sinusoidal PE to X before the first layer) breaks the relative-position property — the rotation must happen on the per-head projected Q and K vectors.

Extending context requires fine-tuning, not just changing the base. Simply increasing the base at inference time degrades model quality because the model hasn't seen those relative position encodings during training. Effective long-context extension requires continued pre-training on long documents with the new base, plus techniques like YaRN's interpolation.

RoPE doesn't work on V. Only Q and K are rotated. V is left as-is. This is intentional — the rotation only needs to affect the dot-product similarity score, not the values that are summed. Applying rotation to V would have no effect on position encoding and just wastes compute.

RoPE Variants and Context Extension

Rotary Position Embedding (RoPE) encodes position by rotating query and key vectors in a position-dependent manner, so that the attention dot product naturally captures relative distance between tokens. This design makes RoPE more generalizable to sequence lengths beyond those seen during training compared to absolute position embeddings, and has made it the standard choice in most modern open-weight models.

VariantMax ContextMethodQuality at Extension
Base RoPETraining lengthFixed thetaDegrades beyond training
Position Interpolation4–8× trainingScale positions downGood with fine-tuning
YaRN4–32× trainingNTK-aware interpolationStrong, minimal fine-tuning
LongRoPE2M+ tokensNon-uniform rescalingStrong across all lengths

The theta parameter in RoPE controls how quickly rotations accumulate with distance. Larger theta values rotate more slowly, effectively compressing more positional information into the same angular range and allowing the model to handle longer sequences without the rotation cycling issue. Models like Llama 3 use a significantly higher base theta (500,000) compared to the original Llama (10,000), which is the primary mechanism enabling their extended context windows without additional fine-tuning tricks.

YaRN (Yet Another RoPE extensioN) applies different scaling factors to different frequency components of the RoPE embedding. Low-frequency dimensions, which encode long-range positional relationships, are interpolated more aggressively, while high-frequency dimensions, which encode local structure, are kept closer to their original values. This targeted approach preserves local attention patterns — critical for coherent phrase-level generation — while successfully extending the effective context window.

The choice of RoPE base theta has downstream effects on retrieval and attention patterns. With small theta values, position rotations cycle rapidly, causing the model to confuse distant tokens with nearby ones at long context lengths — a phenomenon sometimes called "position aliasing." Increasing theta delays the cycling, but if theta is too large, the model's pre-training never exposed it to meaningful rotational variation, potentially reducing its sensitivity to fine-grained position information at short ranges.

When fine-tuning a model on longer sequences to extend its effective context window, it is important to include long-document examples in the fine-tuning data that require the model to retrieve information from the beginning of the context to answer questions at the end. Without such training signal, the model may learn to handle longer position embeddings technically but never develop the attention patterns needed to actually use distant tokens effectively for multi-hop reasoning.