Attention with Linear Biases: add a static, head-specific bias to attention scores that grows linearly with distance. Enables extrapolation to sequences longer than training without any learned positional embeddings.
Standard positional encodings (sinusoidal or learned) fail when you try to generate sequences longer than the training context window â the model has never seen those position indices and performance degrades sharply. ALiBi (Press et al. 2021) addresses this by replacing positional embeddings with a simple bias on attention scores that doesn't require any learnable parameters and generalises gracefully to unseen lengths.
Instead of adding positional information to token embeddings, ALiBi adds a static bias matrix to the query-key attention scores before the softmax. For a query at position i attending to a key at position j, the bias is:
bias(i, j) = -m ¡ |i - j|
where m is a head-specific slope. The slopes form a geometric sequence: for H heads, m_h = 2^(-8h/H). With 8 heads: slopes are 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256.
Different heads use different slopes â small slopes allow attending far back (large receptive field); large slopes focus on local context. This gives the model multi-scale temporal structure for free.
import torch
import math
def get_alibi_slopes(n_heads: int) -> torch.Tensor:
# Geometric sequence of slopes as in the paper
def get_slopes_power_of_2(n):
start = 2 ** (-(2 ** -(math.log2(n) - 3)))
ratio = start
return [start * ratio**i for i in range(n)]
if math.log2(n_heads).is_integer():
slopes = get_slopes_power_of_2(n_heads)
else:
# Interpolate for non-power-of-2 head counts
closest_power = 2 ** math.floor(math.log2(n_heads))
base_slopes = get_slopes_power_of_2(closest_power)
extra = get_slopes_power_of_2(2 * closest_power)[0::2][:n_heads - closest_power]
slopes = base_slopes + extra
return torch.tensor(slopes, dtype=torch.float32)
def build_alibi_bias(n_heads: int, seq_len: int) -> torch.Tensor:
slopes = get_alibi_slopes(n_heads) # (n_heads,)
# positions: relative distances 0, -1, -2, ..., -(seq_len-1)
positions = torch.arange(seq_len).unsqueeze(0) - torch.arange(seq_len).unsqueeze(1)
# positions shape: (seq_len, seq_len), values: j - i
# bias = -slope * |i - j| = slope * (j - i) for causal (j <= i always negative)
bias = slopes.unsqueeze(-1).unsqueeze(-1) * positions.unsqueeze(0) # (n_heads, seq, seq)
return bias
# Usage: add to attention scores before softmax
# scores: (batch, n_heads, seq, seq)
# bias: (n_heads, seq, seq) â broadcast over batch
bias = build_alibi_bias(n_heads=8, seq_len=512)
# scores = scores + bias
ALiBi can extrapolate to sequences 2â4Ă longer than training without modification. The bias penalises distant tokens with a fixed linear rate, so at longer sequences the model naturally relies more on recent context â a sensible inductive bias for language. In the original paper, models trained with 1024-token contexts achieved near-identical perplexity at 2048 tokens, while sinusoidal-PE models degraded sharply beyond the training length.
BLOOM (BigScience, 2022): 176B multilingual model, one of the first large models to use ALiBi. MPT (MosaicML): 7B, 30B models trained on long contexts; ALiBi enables efficient context extension. BloombergGPT: Financial domain LLM based on BLOOM architecture. ALiBi has largely been superseded by RoPE in newer models, but it remains relevant for understanding the design space of positional encodings.
alibi_slopes argument to flash_attn_func.import torch
def alibi_bias(seq_len, num_heads, device):
"""
ALiBi: Attention with Linear Biases.
Compute m values (slopes) and position biases.
"""
# Each head gets a different slope m_i
slopes = torch.Tensor([1 / (2 ** (i + 1)) for i in range(num_heads)])
slopes = slopes.to(device)
# Create position indices [0, 1, 2, ..., seq_len-1]
positions = torch.arange(seq_len, device=device).unsqueeze(0)
# Compute relative distance matrix
distance = positions - positions.T
# Apply slope to each head's bias
alibi = slopes.unsqueeze(-1).unsqueeze(-1) * distance.unsqueeze(0)
return alibi # Shape: (num_heads, seq_len, seq_len)| Aspect | ALiBi | RoPE | Sinusoidal PE |
|---|---|---|---|
| Extrapolation | Excellent | Good (with adjustments) | Poor |
| Compute overhead | Minimal | Minimal | Minimal |
| No. of parameters | 0 (hardcoded) | 0 | 0 |
| Length generalization | Strong | Moderate | Weak |
| Adoption | MPT, BLOOM | LLaMA, Mistral | Original Transformer |
Why ALiBi slopes decrease geometrically: The slope values 1/2, 1/4, 1/8, ... form a geometric series. This design ensures that the attention bias magnitude is calibrated per head in a learnable-free way. Head 0 (with slope 1/2) attends to nearby tokens more strongly, while later heads (with smaller slopes) develop longer-range dependencies. This geometric progression mirrors frequency decomposition in sinusoidal positional encodings.
ALiBi's strength lies in its explicit extrapolation guarantees. When a model trained on sequences of length 2048 encounters a 4096-token prompt at inference, ALiBi's linear bias naturally extends beyond the training horizon. RoPE requires careful interpolation or extrapolation methods (like YaRN) to achieve similar performance on longer sequences, making ALiBi simpler for length generalization in some cases.
Production deployment of ALiBi-based models requires no special handlingâthe bias is added before softmax, making it compatible with all standard attention implementations including flash-attention variants. However, the geometric slope selection is a hyperparameter that can be tuned; research on optimal slope schedules continues.
ALiBi emerged from research questioning whether explicit position encodings were necessary. By studying attention patterns in transformers, researchers found that self-attention naturally learns relative positional biases if given implicit linear biases. This insight led to ALiBi, which requires no learnable position embeddings at all.
Variants and improvements continue to emerge: grouped ALiBi applies different slope schedules to different attention heads (like multi-head attention itself), frequency-combined ALiBi mixes geometric and arithmetic progressions of slopes, and dynamic ALiBi adjusts slopes during training. Each variant targets specific extrapolation scenarios or sequence length distributions.
Empirical comparisons across ALiBi, RoPE, and sinusoidal encodings depend heavily on the task and training setup. Some models show ALiBi superior for length extrapolation, while others find RoPE more robust with fewer hyperparameters. The field continues evolving, with no clear universal winnerâpractitioners should benchmark on their specific task.
Implementing ALiBi efficiently in modern hardware-accelerated attention kernels requires careful design. The bias must be added before softmax to affect attention distributions. Flash-attention variants maintain kernel efficiency while incorporating ALiBi biases. CUDA kernel development for ALiBi taught practitioners valuable lessons about attention computation optimization that influenced subsequent efficient attention research.
ALiBi's design is theoretically motivated: linear relative position biases are the simplest form that preserves relative position information while remaining parameter-free. This simplicity carries computational benefitsâno position embedding matrices to store or compute. The lack of trainable position encodings also means ALiBi models can be more easily adapted to different sequence lengths without fine-tuning position embeddings.
Training ALiBi models requires no special considerationsâgradients flow normally through the bias addition. The fixed slopes mean no position-related hyperparameters to tune, simplifying architecture choices. Research has explored learned slopes but found that geometric progressions provide good default behavior. This simplicity vs. flexibility trade-off favored simplicity in practice.
Deploying ALiBi models requires no special attention kernels or position embedding lookups, making integration with existing inference systems straightforward. vLLM, text-generation-webui, and other inference frameworks support ALiBi without code changes. The overhead is minimalâjust adding a static bias before softmax. This simplicity is one reason ALiBi adoption is growing in production systems where engineering simplicity correlates with reliability.
Fine-tuning ALiBi models on task-specific data is identical to fine-tuning other transformers. No position-related hyperparameters change. The learned weights adapt to the new task while positional biases remain fixed. This stability makes ALiBi attractive for practitioners who want fewer hyperparameters to tune during transfer learning, enabling faster experimentation cycles.
Longer-context fine-tuning (training on sequences longer than pretraining) works well with ALiBi because linear biases naturally extrapolate. Fine-tuning a 4K-pretrained ALiBi model on 32K sequences requires no interpolation tricksâjust train longer and the model quickly adapts. This extrapolation advantage translates to significant engineering wins in production where context length must flex.