The original Transformer positional encoding: add fixed sin/cos waves of different frequencies to token embeddings. Foundational — understanding it is prerequisite for RoPE, ALiBi, and modern variants.
The self-attention mechanism is permutation-invariant: if you shuffle the tokens in the input, the attention computation produces the same result (just shuffled). This is fundamentally wrong for language — word order matters. Positional encoding injects position information into token embeddings so the model can distinguish "dog bites man" from "man bites dog". Sinusoidal PE was the original solution proposed in "Attention Is All You Need" (Vaswani et al. 2017).
For a token at position pos in the sequence and embedding dimension i:
This creates a unique pattern for each position: each dimension oscillates at a different frequency, from very fast (i=0, wavelength=2π) to very slow (i=d_model/2, wavelength=10000·2π). The superposition of these frequencies creates a unique "fingerprint" for each position, analogous to binary counting where different bit positions toggle at different frequencies.
import torch
import math
def sinusoidal_pe(max_seq_len: int, d_model: int) -> torch.Tensor:
pe = torch.zeros(max_seq_len, d_model)
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
# div_term: [10000^(2i/d_model)] for i = 0, 1, ..., d_model/2
div_term = torch.exp(
torch.arange(0, d_model, 2, dtype=torch.float) * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term) # even dims
pe[:, 1::2] = torch.cos(position * div_term) # odd dims
return pe # (max_seq_len, d_model)
class TransformerWithPE(torch.nn.Module):
def __init__(self, vocab_size, d_model, max_seq_len):
super().__init__()
self.embedding = torch.nn.Embedding(vocab_size, d_model)
self.register_buffer("pe", sinusoidal_pe(max_seq_len, d_model))
def forward(self, x):
# x: (batch, seq_len)
seq_len = x.shape[1]
emb = self.embedding(x) # (batch, seq, d_model)
emb = emb + self.pe[:seq_len] # broadcast PE over batch
return emb
# Visualise PE
import matplotlib.pyplot as plt
pe = sinusoidal_pe(100, 64)
plt.figure(figsize=(10, 4))
plt.pcolormesh(pe.numpy().T, cmap="RdBu")
plt.xlabel("Position"); plt.ylabel("Dimension")
plt.colorbar(); plt.title("Sinusoidal Positional Encoding"); plt.show()
Three key properties make sinusoidal PE work:
Vaswani et al. tested both fixed sinusoidal PE and learned positional embeddings and found nearly identical performance on translation tasks. Sinusoidal PE has practical advantages: no parameters to train, works for any sequence length at inference time (no hard cutoff), and can in principle extrapolate beyond training length (though in practice this degrades). Learned PE has a hard cutoff at the training max length but can be extended with techniques like position interpolation.
Sinusoidal PE inspired RoPE (Rotary Position Embedding, Su et al. 2021). Where sinusoidal PE is added to embeddings before attention, RoPE is applied directly to query and key vectors by rotating them in a position-dependent way. RoPE inherits sinusoidal PE's nice relative-position properties but integrates more naturally with the attention mechanism and supports efficient context extension via YaRN and dynamic NTK scaling. RoPE is now the dominant positional encoding in modern LLMs.
import torch
import math
def sinusoidal_positional_encoding(seq_len, d_model):
"""
Generate sinusoidal positional encodings.
PE[pos, 2i] = sin(pos / 10000^(2i/d_model))
PE[pos, 2i+1] = cos(pos / 10000^(2i/d_model))
"""
pe = torch.zeros(seq_len, d_model)
position = torch.arange(seq_len, dtype=torch.float).unsqueeze(1)
# 1 / 10000^(2i/d_model)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term) # Even indices: sin
pe[:, 1::2] = torch.cos(position * div_term) # Odd indices: cos
return pe| Property | Sinusoidal | RoPE | ALiBi |
|---|---|---|---|
| Relative distance awareness | Implicit (via frequencies) | Explicit (rotation) | Explicit (bias) |
| Extrapolates to longer seqs | No | Yes (with scaling) | Yes |
| Learnable | Can be | No | No |
| Memory efficient | Very | Very | Very |
Why sinusoids for position encoding: Sinusoidal functions have the property that PE[pos + k] can be represented as a linear combination of PE[pos] for any offset k. This means relative position information is naturally encoded—the attention mechanism can learn relative distances without explicit biasing. Different frequency components (high-frequency for nearby tokens, low-frequency for distant ones) capture multi-scale positional structure, enabling the model to interpolate position information for sequences longer than seen during training (to some extent).
The base frequency 10000 was chosen somewhat arbitrarily in the original Transformer paper. The log-spaced frequencies (10000^(2i/d_model)) ensure that frequency differences double every two dimensions, creating a balanced representation across timescales. Modern research has explored alternative bases (like 2^(2i/d_model)) and found that the exact base matters less than the log-spacing property and the frequency range covered.
Learned positional embeddings (concatenated with token embeddings) offer more flexibility than sinusoidal encodings but don't extrapolate. Rotary embeddings (RoPE) apply sinusoidal-inspired rotations directly to the query and key vectors, combining the relative position awareness of sinusoids with the simplicity of rotations. ALiBi takes a different approach with linear biases. Understanding the tradeoffs guides architecture choices for different problem domains and sequence length requirements.
Recent work questions whether any position encoding is necessary. Some architectures achieve competitive performance using only token-relative positions derived implicitly from attention patterns. Others use explicit continuous position functions instead of discrete encodings. The field hasn't converged on a single best approach, indicating that position encoding design remains an open research problem.
T5 bias terms (similar in spirit to ALiBi) and other bias-based approaches show that explicit relative position encoding isn't necessary if the attention mechanism learns to infer relative positions. This suggests position information is somewhat redundant if the model has sufficient capacity and appropriate inductive biases. Simplifying architectures by removing position encodings could be viable if the model learns position implicitly.
Extrapolation beyond training sequence lengths remains imperfect for most position encodings. Linear interpolation (scaling position indices for longer sequences) degrades performance. Interpolation strategies like YaRN (designed for RoPE) and ALiBi's inherent extrapolation properties are active research areas. Understanding fundamentally why and how models can interpolate position knowledge could improve generalization to out-of-distribution lengths.
Analyzing positional encodings in the Fourier domain provides theoretical insights. Sinusoidal encodings decompose positions into frequency components, where high frequencies represent local details and low frequencies represent large-scale structure. This multi-scale decomposition is why sinusoids work: the model naturally separates position information across frequency bands. Other encodings (learned, RoPE, ALiBi) use different mechanisms to achieve similar multi-scale information.
The attention mechanism's ability to extract relative positions from mixed sinusoidal encodings is remarkable. Even without explicit teaching, attention heads learn to compute relative distances by comparing position encoding patterns. This implicit relative position learning explains sinusoidal encodings' effectiveness despite their mathematical simplicity. Understanding this mechanism deepens appreciation for transformer architecture design.
Future work on position encodings might explore other mathematical structures: wavelets for better localization, sparse encodings for ultra-long sequences, or continuous position functions for fine-grained position control. The fundamental requirement—enabling the model to distinguish positions and learn relative distances—can be satisfied by many mechanisms. The field continues exploring trade-offs between simplicity, parameter efficiency, and generalization.
While positional encodings originated in NLP with transformers, they're now essential in vision transformers (ViTs) and multimodal models. 2D positional encodings for image patches, 3D for video, and hierarchical encodings for graph structures extend the concept. The fundamental requirement—enabling the model to understand spatial or structural relationships—applies across domains. Extending sinusoidal or RoPE encodings to multiple dimensions and non-Euclidean spaces remains an active research area.
In multimodal models (text + image + audio), aligning positional encodings across modalities is non-trivial. Audio has continuous time, images have 2D space, text has sequence order. Some models learn joint positional encodings; others use modality-specific encodings with learned fusion layers. Understanding positional encodings' role in multimodal fusion helps debug models that fail to integrate information across modalities effectively.
Absolute vs. relative positional encodings remain debated in certain domains. Relative encodings (like RoPE or ALiBi) theoretically enable better length generalization, but empirical results are mixed. Some practitioners prefer simplicity and robustness of learned absolute embeddings. Domain-specific empirical evaluation remains necessary—no universal winner exists yet.