Sinusoidal PE

Why positional encoding
Sinusoidal PE formula
Implementation
Properties: why it works
Learned vs sinusoidal
The road to RoPE
Gotchas

SECTION 01

Why positional encoding

The self-attention mechanism is permutation-invariant: if you shuffle the tokens in the input, the attention computation produces the same result (just shuffled). This is fundamentally wrong for language — word order matters. Positional encoding injects position information into token embeddings so the model can distinguish "dog bites man" from "man bites dog". Sinusoidal PE was the original solution proposed in "Attention Is All You Need" (Vaswani et al. 2017).

SECTION 02

Sinusoidal PE formula

For a token at position pos in the sequence and embedding dimension i:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This creates a unique pattern for each position: each dimension oscillates at a different frequency, from very fast (i=0, wavelength=2π) to very slow (i=d_model/2, wavelength=10000·2π). The superposition of these frequencies creates a unique "fingerprint" for each position, analogous to binary counting where different bit positions toggle at different frequencies.

SECTION 03

Implementation

import torch
import math

def sinusoidal_pe(max_seq_len: int, d_model: int) -> torch.Tensor:
    pe = torch.zeros(max_seq_len, d_model)
    position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
    # div_term: [10000^(2i/d_model)] for i = 0, 1, ..., d_model/2
    div_term = torch.exp(
        torch.arange(0, d_model, 2, dtype=torch.float) * (-math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)  # even dims
    pe[:, 1::2] = torch.cos(position * div_term)  # odd dims
    return pe  # (max_seq_len, d_model)

class TransformerWithPE(torch.nn.Module):
    def __init__(self, vocab_size, d_model, max_seq_len):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, d_model)
        self.register_buffer("pe", sinusoidal_pe(max_seq_len, d_model))

    def forward(self, x):
        # x: (batch, seq_len)
        seq_len = x.shape[1]
        emb = self.embedding(x)        # (batch, seq, d_model)
        emb = emb + self.pe[:seq_len]  # broadcast PE over batch
        return emb

# Visualise PE
import matplotlib.pyplot as plt
pe = sinusoidal_pe(100, 64)
plt.figure(figsize=(10, 4))
plt.pcolormesh(pe.numpy().T, cmap="RdBu")
plt.xlabel("Position"); plt.ylabel("Dimension")
plt.colorbar(); plt.title("Sinusoidal Positional Encoding"); plt.show()

SECTION 04

Properties: why it works

Three key properties make sinusoidal PE work:

Unique per position: No two positions produce the same encoding, so the model can distinguish them.
Smooth interpolation: Similar positions have similar encodings (nearby positions produce similar sinusoidal patterns), which is a useful inductive bias.
Relative positions accessible via dot product: PE(pos+k) can be expressed as a linear function of PE(pos), which means dot products between position encodings depend only on the relative offset k. This lets attention heads learn relative position patterns.

SECTION 05

Learned vs sinusoidal

Vaswani et al. tested both fixed sinusoidal PE and learned positional embeddings and found nearly identical performance on translation tasks. Sinusoidal PE has practical advantages: no parameters to train, works for any sequence length at inference time (no hard cutoff), and can in principle extrapolate beyond training length (though in practice this degrades). Learned PE has a hard cutoff at the training max length but can be extended with techniques like position interpolation.

SECTION 06

The road to RoPE

Sinusoidal PE inspired RoPE (Rotary Position Embedding, Su et al. 2021). Where sinusoidal PE is added to embeddings before attention, RoPE is applied directly to query and key vectors by rotating them in a position-dependent way. RoPE inherits sinusoidal PE's nice relative-position properties but integrates more naturally with the attention mechanism and supports efficient context extension via YaRN and dynamic NTK scaling. RoPE is now the dominant positional encoding in modern LLMs.

SECTION 07

Gotchas

Don't normalise PE: Layer normalisation after adding PE can wash out positional information. If using post-attention LayerNorm (GPT-style), PE survives. If applying LayerNorm to the embeddings directly, position signal is attenuated.
Hard extrapolation limit: Despite the theoretical extrapolation claim, practical performance drops for sequences significantly longer than training. Use RoPE + YaRN or ALiBi for real long-context requirements.
BERT uses learned PE: BERT does not use sinusoidal PE — it uses learned absolute position embeddings with a max of 512 positions. Don't confuse the two.

SECTION 09

Positional Encoding Properties

Property	Sinusoidal	RoPE	ALiBi
Relative distance awareness	Implicit (via frequencies)	Explicit (rotation)	Explicit (bias)
Extrapolates to longer seqs	No	Yes (with scaling)	Yes
Learnable	Can be	No	No
Memory efficient	Very	Very	Very

Why sinusoids for position encoding: Sinusoidal functions have the property that PE[pos + k] can be represented as a linear combination of PE[pos] for any offset k. This means relative position information is naturally encoded—the attention mechanism can learn relative distances without explicit biasing. Different frequency components (high-frequency for nearby tokens, low-frequency for distant ones) capture multi-scale positional structure, enabling the model to interpolate position information for sequences longer than seen during training (to some extent).

The base frequency 10000 was chosen somewhat arbitrarily in the original Transformer paper. The log-spaced frequencies (10000^(2i/d_model)) ensure that frequency differences double every two dimensions, creating a balanced representation across timescales. Modern research has explored alternative bases (like 2^(2i/d_model)) and found that the exact base matters less than the log-spacing property and the frequency range covered.

Learned positional embeddings (concatenated with token embeddings) offer more flexibility than sinusoidal encodings but don't extrapolate. Rotary embeddings (RoPE) apply sinusoidal-inspired rotations directly to the query and key vectors, combining the relative position awareness of sinusoids with the simplicity of rotations. ALiBi takes a different approach with linear biases. Understanding the tradeoffs guides architecture choices for different problem domains and sequence length requirements.

EXTRA

Positional Encoding Research Directions

Recent work questions whether any position encoding is necessary. Some architectures achieve competitive performance using only token-relative positions derived implicitly from attention patterns. Others use explicit continuous position functions instead of discrete encodings. The field hasn't converged on a single best approach, indicating that position encoding design remains an open research problem.

T5 bias terms (similar in spirit to ALiBi) and other bias-based approaches show that explicit relative position encoding isn't necessary if the attention mechanism learns to infer relative positions. This suggests position information is somewhat redundant if the model has sufficient capacity and appropriate inductive biases. Simplifying architectures by removing position encodings could be viable if the model learns position implicitly.

Extrapolation beyond training sequence lengths remains imperfect for most position encodings. Linear interpolation (scaling position indices for longer sequences) degrades performance. Interpolation strategies like YaRN (designed for RoPE) and ALiBi's inherent extrapolation properties are active research areas. Understanding fundamentally why and how models can interpolate position knowledge could improve generalization to out-of-distribution lengths.

Analyzing positional encodings in the Fourier domain provides theoretical insights. Sinusoidal encodings decompose positions into frequency components, where high frequencies represent local details and low frequencies represent large-scale structure. This multi-scale decomposition is why sinusoids work: the model naturally separates position information across frequency bands. Other encodings (learned, RoPE, ALiBi) use different mechanisms to achieve similar multi-scale information.

The attention mechanism's ability to extract relative positions from mixed sinusoidal encodings is remarkable. Even without explicit teaching, attention heads learn to compute relative distances by comparing position encoding patterns. This implicit relative position learning explains sinusoidal encodings' effectiveness despite their mathematical simplicity. Understanding this mechanism deepens appreciation for transformer architecture design.

Future work on position encodings might explore other mathematical structures: wavelets for better localization, sparse encodings for ultra-long sequences, or continuous position functions for fine-grained position control. The fundamental requirement—enabling the model to distinguish positions and learn relative distances—can be satisfied by many mechanisms. The field continues exploring trade-offs between simplicity, parameter efficiency, and generalization.

BOOST

Positional Encoding in Vision and Multimodal Models

While positional encodings originated in NLP with transformers, they're now essential in vision transformers (ViTs) and multimodal models. 2D positional encodings for image patches, 3D for video, and hierarchical encodings for graph structures extend the concept. The fundamental requirement—enabling the model to understand spatial or structural relationships—applies across domains. Extending sinusoidal or RoPE encodings to multiple dimensions and non-Euclidean spaces remains an active research area.

In multimodal models (text + image + audio), aligning positional encodings across modalities is non-trivial. Audio has continuous time, images have 2D space, text has sequence order. Some models learn joint positional encodings; others use modality-specific encodings with learned fusion layers. Understanding positional encodings' role in multimodal fusion helps debug models that fail to integrate information across modalities effectively.

Absolute vs. relative positional encodings remain debated in certain domains. Relative encodings (like RoPE or ALiBi) theoretically enable better length generalization, but empirical results are mixed. Some practitioners prefer simplicity and robustness of learned absolute embeddings. Domain-specific empirical evaluation remains necessary—no universal winner exists yet.

Sinusoidal PE

Table of Contents

Why positional encoding

Sinusoidal PE formula

Implementation

Properties: why it works

Learned vs sinusoidal

The road to RoPE

Gotchas

Sinusoidal Positional Encoding Code

Positional Encoding Properties

Positional Encoding Research Directions

Positional Encoding in Vision and Multimodal Models

Sinusoidal PE

Table of Contents

Why positional encoding

Sinusoidal PE formula

Implementation

Properties: why it works

Learned vs sinusoidal

The road to RoPE

Gotchas

Sinusoidal Positional Encoding Code

Positional Encoding Properties

Positional Encoding Research Directions

Positional Encoding in Vision and Multimodal Models

Related concepts