ALiBi

What ALiBi solves
ALiBi mechanism
Implementation in PyTorch
Extrapolation properties
ALiBi vs RoPE vs sinusoidal
Models using ALiBi
Gotchas

SECTION 01

What ALiBi solves

Standard positional encodings (sinusoidal or learned) fail when you try to generate sequences longer than the training context window — the model has never seen those position indices and performance degrades sharply. ALiBi (Press et al. 2021) addresses this by replacing positional embeddings with a simple bias on attention scores that doesn't require any learnable parameters and generalises gracefully to unseen lengths.

SECTION 02

ALiBi mechanism

Instead of adding positional information to token embeddings, ALiBi adds a static bias matrix to the query-key attention scores before the softmax. For a query at position i attending to a key at position j, the bias is:

bias(i, j) = -m · |i - j|

where m is a head-specific slope. The slopes form a geometric sequence: for H heads, m_h = 2^(-8h/H). With 8 heads: slopes are 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256.

Different heads use different slopes — small slopes allow attending far back (large receptive field); large slopes focus on local context. This gives the model multi-scale temporal structure for free.

SECTION 03

Implementation in PyTorch

import torch
import math

def get_alibi_slopes(n_heads: int) -> torch.Tensor:
    # Geometric sequence of slopes as in the paper
    def get_slopes_power_of_2(n):
        start = 2 ** (-(2 ** -(math.log2(n) - 3)))
        ratio = start
        return [start * ratio**i for i in range(n)]

    if math.log2(n_heads).is_integer():
        slopes = get_slopes_power_of_2(n_heads)
    else:
        # Interpolate for non-power-of-2 head counts
        closest_power = 2 ** math.floor(math.log2(n_heads))
        base_slopes = get_slopes_power_of_2(closest_power)
        extra = get_slopes_power_of_2(2 * closest_power)[0::2][:n_heads - closest_power]
        slopes = base_slopes + extra

    return torch.tensor(slopes, dtype=torch.float32)

def build_alibi_bias(n_heads: int, seq_len: int) -> torch.Tensor:
    slopes = get_alibi_slopes(n_heads)  # (n_heads,)
    # positions: relative distances 0, -1, -2, ..., -(seq_len-1)
    positions = torch.arange(seq_len).unsqueeze(0) - torch.arange(seq_len).unsqueeze(1)
    # positions shape: (seq_len, seq_len), values: j - i
    # bias = -slope * |i - j| = slope * (j - i) for causal (j <= i always negative)
    bias = slopes.unsqueeze(-1).unsqueeze(-1) * positions.unsqueeze(0)  # (n_heads, seq, seq)
    return bias

# Usage: add to attention scores before softmax
# scores: (batch, n_heads, seq, seq)
# bias: (n_heads, seq, seq)  — broadcast over batch
bias = build_alibi_bias(n_heads=8, seq_len=512)
# scores = scores + bias

SECTION 04

Extrapolation properties

ALiBi can extrapolate to sequences 2–4× longer than training without modification. The bias penalises distant tokens with a fixed linear rate, so at longer sequences the model naturally relies more on recent context — a sensible inductive bias for language. In the original paper, models trained with 1024-token contexts achieved near-identical perplexity at 2048 tokens, while sinusoidal-PE models degraded sharply beyond the training length.

SECTION 05

ALiBi vs RoPE vs sinusoidal

Sinusoidal PE: Added to input embeddings. Fixed pattern; poor extrapolation. Used in original Transformer, BERT.
Learned PE: Trained position embeddings; hard cutoff at max train length. Used in GPT-2.
ALiBi: Bias on attention scores; no parameters; good extrapolation. Used in MPT, BLOOM, BloombergGPT.
RoPE: Rotational encoding applied to Q/K; better extrapolation with YaRN/dynamic NTK scaling; dominant in modern LLMs (Llama, Mistral). RoPE wins on most benchmarks; ALiBi is simpler to implement.

SECTION 06

Models using ALiBi

BLOOM (BigScience, 2022): 176B multilingual model, one of the first large models to use ALiBi. MPT (MosaicML): 7B, 30B models trained on long contexts; ALiBi enables efficient context extension. BloombergGPT: Financial domain LLM based on BLOOM architecture. ALiBi has largely been superseded by RoPE in newer models, but it remains relevant for understanding the design space of positional encodings.

SECTION 07

Gotchas

Causal vs bidirectional: For causal language models, the bias only applies to j ≤ i (no future attending). For bidirectional models (BERT-style), apply the bias symmetrically.
Not combined with positional embeddings: ALiBi replaces positional embeddings — don't use both. If your model has learned position embeddings in the token embedding layer, remove them before adding ALiBi.
Flash Attention compatibility: Flash Attention 2 has native ALiBi support via the alibi_slopes argument to flash_attn_func.

SECTION 09

ALiBi vs RoPE Comparison

Aspect	ALiBi	RoPE	Sinusoidal PE
Extrapolation	Excellent	Good (with adjustments)	Poor
Compute overhead	Minimal	Minimal	Minimal
No. of parameters	0 (hardcoded)	0	0
Length generalization	Strong	Moderate	Weak
Adoption	MPT, BLOOM	LLaMA, Mistral	Original Transformer

Why ALiBi slopes decrease geometrically: The slope values 1/2, 1/4, 1/8, ... form a geometric series. This design ensures that the attention bias magnitude is calibrated per head in a learnable-free way. Head 0 (with slope 1/2) attends to nearby tokens more strongly, while later heads (with smaller slopes) develop longer-range dependencies. This geometric progression mirrors frequency decomposition in sinusoidal positional encodings.

ALiBi's strength lies in its explicit extrapolation guarantees. When a model trained on sequences of length 2048 encounters a 4096-token prompt at inference, ALiBi's linear bias naturally extends beyond the training horizon. RoPE requires careful interpolation or extrapolation methods (like YaRN) to achieve similar performance on longer sequences, making ALiBi simpler for length generalization in some cases.

Production deployment of ALiBi-based models requires no special handling—the bias is added before softmax, making it compatible with all standard attention implementations including flash-attention variants. However, the geometric slope selection is a hyperparameter that can be tuned; research on optimal slope schedules continues.

EXTRA

ALiBi Research and Variants

ALiBi emerged from research questioning whether explicit position encodings were necessary. By studying attention patterns in transformers, researchers found that self-attention naturally learns relative positional biases if given implicit linear biases. This insight led to ALiBi, which requires no learnable position embeddings at all.

Variants and improvements continue to emerge: grouped ALiBi applies different slope schedules to different attention heads (like multi-head attention itself), frequency-combined ALiBi mixes geometric and arithmetic progressions of slopes, and dynamic ALiBi adjusts slopes during training. Each variant targets specific extrapolation scenarios or sequence length distributions.

Empirical comparisons across ALiBi, RoPE, and sinusoidal encodings depend heavily on the task and training setup. Some models show ALiBi superior for length extrapolation, while others find RoPE more robust with fewer hyperparameters. The field continues evolving, with no clear universal winner—practitioners should benchmark on their specific task.

Implementing ALiBi efficiently in modern hardware-accelerated attention kernels requires careful design. The bias must be added before softmax to affect attention distributions. Flash-attention variants maintain kernel efficiency while incorporating ALiBi biases. CUDA kernel development for ALiBi taught practitioners valuable lessons about attention computation optimization that influenced subsequent efficient attention research.

ALiBi's design is theoretically motivated: linear relative position biases are the simplest form that preserves relative position information while remaining parameter-free. This simplicity carries computational benefits—no position embedding matrices to store or compute. The lack of trainable position encodings also means ALiBi models can be more easily adapted to different sequence lengths without fine-tuning position embeddings.

Training ALiBi models requires no special considerations—gradients flow normally through the bias addition. The fixed slopes mean no position-related hyperparameters to tune, simplifying architecture choices. Research has explored learned slopes but found that geometric progressions provide good default behavior. This simplicity vs. flexibility trade-off favored simplicity in practice.

BOOST

ALiBi in Production LLM Inference

Deploying ALiBi models requires no special attention kernels or position embedding lookups, making integration with existing inference systems straightforward. vLLM, text-generation-webui, and other inference frameworks support ALiBi without code changes. The overhead is minimal—just adding a static bias before softmax. This simplicity is one reason ALiBi adoption is growing in production systems where engineering simplicity correlates with reliability.

Fine-tuning ALiBi models on task-specific data is identical to fine-tuning other transformers. No position-related hyperparameters change. The learned weights adapt to the new task while positional biases remain fixed. This stability makes ALiBi attractive for practitioners who want fewer hyperparameters to tune during transfer learning, enabling faster experimentation cycles.

Longer-context fine-tuning (training on sequences longer than pretraining) works well with ALiBi because linear biases naturally extrapolate. Fine-tuning a 4K-pretrained ALiBi model on 32K sequences requires no interpolation tricks—just train longer and the model quickly adapts. This extrapolation advantage translates to significant engineering wins in production where context length must flex.

ALiBi

Table of Contents

What ALiBi solves

ALiBi mechanism

Implementation in PyTorch

Extrapolation properties

ALiBi vs RoPE vs sinusoidal

Models using ALiBi

Gotchas

ALiBi Implementation Details

ALiBi vs RoPE Comparison

ALiBi Research and Variants

ALiBi in Production LLM Inference

ALiBi

Table of Contents

What ALiBi solves

ALiBi mechanism

Implementation in PyTorch

Extrapolation properties

ALiBi vs RoPE vs sinusoidal

Models using ALiBi

Gotchas

ALiBi Implementation Details

ALiBi vs RoPE Comparison

ALiBi Research and Variants

ALiBi in Production LLM Inference

Related concepts