Transformer Architecture

Mamba SSM

Selective state space model that replaces attention with input-dependent state transitions. Linear time complexity in sequence length, constant memory per token, and competitive with transformers on language modelling.

O(L) time
vs O(L²) attention
Selective
Input-dependent SSM
Mamba-2
Improved parallelism

Table of Contents

SECTION 01

State space models primer

State space models (SSMs) are a family of sequence models from control theory. They model a sequence as a linear dynamical system: a hidden state h(t) evolves over time according to h'(t) = Ah(t) + Bx(t), and the output y(t) = Ch(t) + Dx(t). Discretised for sequences: h_t = Āh_{t-1} + B̄x_t and y_t = Ch_t. The matrices A, B, C, D govern the dynamics. SSMs process sequences in O(L) time with O(1) memory per step (no KV cache), making them appealing for long sequences where attention's O(L²) cost is prohibitive.

Earlier SSMs like S4 and H3 used fixed (input-independent) state matrices. This made them fast but limited expressiveness — they couldn't selectively filter information like attention can.

SECTION 02

Selective state spaces: Mamba's key innovation

Mamba (Gu and Dao, 2023) makes the SSM parameters input-dependent: B, C, and Δ (which controls the discretisation step) are functions of the current input x_t. This "selective" mechanism lets Mamba decide which inputs to incorporate into the state and which to forget — mimicking the selectivity of attention without the quadratic cost. Crucially, this selection is applied per token, per channel, allowing the model to focus on relevant inputs across the full context.

The catch: making parameters input-dependent breaks the convolution trick that made early SSMs fast. Mamba solves this with a hardware-aware parallel scan that runs efficiently on GPUs via CUDA kernels.

SECTION 03

Mamba architecture

A Mamba block replaces the attention + FFN structure of a transformer layer with a single selective SSM sub-block:

  1. Input x → Linear projection → splits into two branches
  2. Branch 1: SSM path — 1D depthwise conv → activation → selective SSM
  3. Branch 2: Gating path — SiLU activation
  4. Elementwise multiply branches, project back to model dimension

This design is similar to the gated MLP in modern transformers but replaces the token-mixing attention with the SSM. The entire model can be stacked like transformer layers.

SECTION 04

Using Mamba in Python

pip install mamba-ssm causal-conv1d

from mamba_ssm import Mamba
import torch

# Single Mamba layer
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

mamba_layer = Mamba(
    d_model=dim,        # model dimension
    d_state=16,         # SSM state dimension (N)
    d_conv=4,           # local convolution width
    expand=2,           # inner dimension = d_model * expand
).to("cuda")

y = mamba_layer(x)
print(y.shape)  # (2, 64, 16)

# Full Mamba LM (using transformers or mamba-minimal)
from transformers import AutoTokenizer, MambaForCausalLM

model = MambaForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

input_ids = tokenizer("Mamba is a sequence model", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
SECTION 05

Mamba vs Transformer trade-offs

SECTION 06

Hybrid approaches

Because Mamba and transformers have complementary strengths, hybrid architectures interleave SSM and attention layers. Jamba (AI21 Labs): 52B model with alternating Mamba and transformer layers plus MoE — achieves long-context efficiency of SSMs and reasoning quality of attention. Zamba: Dense hybrid with 7B parameters. OLMo-Hybrid: Research model showing hybrid beats pure Mamba and pure transformer at equivalent FLOP budgets. The consensus emerging is: SSM layers for context compression, attention layers for precise retrieval.

SECTION 07

Gotchas

Mamba practical deployment and benchmarks

Mamba's linear-time inference is its defining practical advantage over transformers for long-sequence tasks. Processing a 16K token sequence with Mamba requires approximately the same compute as processing a 1K token sequence (8x sequences fit in the same time), while a transformer requires 256x more attention computation (sequence_length² scaling). This efficiency makes Mamba compelling for genomics, long document understanding, and time-series forecasting, where sequence lengths of tens of thousands of tokens are common and transformer quadratic scaling becomes prohibitive.

ModelContext scalingInference VRAMBest for
TransformerO(n²) attentionGrows with sequenceShort-medium sequences, NLU
Mamba (SSM)O(n) recurrenceFixed (state size)Very long sequences
Jamba (hybrid)O(n) mostlyLowLong context + reasoning
from mamba_ssm import Mamba
import torch

# Single Mamba block
batch, length, d_model = 2, 64, 16
x = torch.randn(batch, length, d_model).to("cuda")

model = Mamba(
    d_model=d_model,   # model dimension (D)
    d_state=16,        # SSM state expansion factor
    d_conv=4,          # local convolution width
    expand=2,          # block expansion factor
).to("cuda")

y = model(x)
print(y.shape)  # (2, 64, 16) — same shape as input

# For full models, use pretrained Mamba from HuggingFace:
# from transformers import MambaForCausalLM, AutoTokenizer
# model = MambaForCausalLM.from_pretrained("state-spaces/mamba-2.8b-hf")

Sequence Modeling with State Space Models: Theory and Practice

State Space Models (SSMs) represent sequences through differential equations dx/dt = A(t) × x(t) + B(t) × u(t), y(t) = C(t) × x(t) + D(t) × u(t), where x is the state, u is input, y is output. Discretizing via bilinear transformation (Tustin approximation, typical in SSM literature) converts continuous equations to recurrent form: x[n+1] = A × x[n] + B × u[n], enabling efficient O(n) parallel inference. Traditional recurrent models (LSTMs, GRUs) use fixed recurrence relations; SSMs parameterize state matrices A, B, C, D learnable, allowing adaptation to task-specific dynamics. Theory proves SSMs can approximate any bounded linear time-invariant (LTI) system with sufficient hidden dimension; in practice, SSM layers alternate with nonlinearities to model complex sequence dependencies. Mamba's innovation is bidirectional selectivity: the "selection mechanism" makes A, B, C depend on input tokens, allowing dynamic attention-like behavior where each position determines how much history to remember. This solves SSMs' core limitation — fixed state mixing that can't implement selection (recall specific tokens from long ago) — making SSMs competitive with transformers on long-context tasks.

Mamba vs. Transformers: Computational and Architectural Trade-offs

Transformers compute attention O(n²) by comparing every pair of tokens; Mamba processes sequences in O(n) with constant memory per step. For short sequences (n<2048), transformer attention dominates speed (highly optimized in CUDA kernels via Flash Attention); transformers are faster per token. For long sequences (n>4096), Mamba's O(n) complexity becomes advantageous — a 64k token sequence runs 10–50× faster on Mamba vs transformer (with linearized attention approximations). Memory requirements: transformers store KV cache O(n) growing with sequence length, Mamba's state is fixed dimension (e.g., state_size=256) regardless of sequence length. However, Mamba's selectivity mechanism (computing A, B, C per token) requires more computation per step than transformer attention; empirically, Mamba-7B achieves 3.5 tokens/second vs GPT-2-like transformer's 4.2 tokens/second on A100 GPU. This gap narrows at longer contexts: processing 32k tokens, Mamba sustains 3.0 tokens/sec while transformers drop to 0.5 tokens/sec due to KV cache pressure. Mamba's lack of explicit attention also changes interpretability: attention maps directly show which tokens influence each prediction, Mamba's state-based processing is opaque, complicating debugging. In practice, Mamba excels for long-document QA (legal contracts, arxiv papers), time-series forecasting, and DNA sequence modeling; transformers remain superior for short-context tasks and datasets where pre-training scale dominates (most LLMs), since transformer-based models have vastly more pre-training corpus available (GPT-4 > 10B examples).

Hybrid Architectures and Multi-Modal Sequence Processing

Hybrid architectures interleave attention (short-range, interpretable) and SSM (long-range, efficient) layers to balance strengths. Mamba-2 and Opus model designs use transformer attention for first 2–4 layers (capturing local patterns), then transition to Mamba layers for remaining depth (efficiently modeling long dependencies). This hybrid approach achieves 99% of transformer accuracy on standard benchmarks while maintaining Mamba's efficiency gains, measured as tokens-per-second-per-dollar on different GPU tiers. For multi-modal tasks (text + images), hybrid models apply attention within each modality (image patches, text tokens) and SSM layers for cross-modal alignment, enabling efficient processing of image-caption pairs without quadratic attention over entire image. Streaming inference (processing token streams with no lookahead) is a major advantage for Mamba: the recurrent form x[n+1] = A × x[n] + B × u[n] processes each token in constant time, perfect for live transcription or chat. Transformers require the full sequence upfront for attention computation; streaming transformers use local attention windows or approximations, always sub-optimal compared to Mamba's native streaming. Practical deployment: LLaMA-2-70B chat on 8×H100 serves ~50 tokens/sec per GPU; Mamba-7B achieves ~1000 tokens/sec per GPU due to O(n) complexity, allowing >10× throughput increase for long-context applications. The trade-off: Mamba models lack massive pre-training and instruction-tuning datasets, so capabilities lag behind equivalently-sized transformers trained on vast corpora; as Mamba gains adoption, pre-training corpus will expand and this gap will narrow.