Self-attention powers every modern LLM — the architecture that changed everything
Recurrent Neural Networks (RNNs, LSTMs) process sequences token-by-token, one step at a time. This sequential dependency made them slow to train (no parallelization) and poor at capturing long-range dependencies. Vanishing gradients meant information from early tokens faded by the time you reached token 512.
Transformers (Vaswani et al., 2017) solved this with a radical insight: replace sequential processing with parallel attention. Every token attends to every other token simultaneously, computing relationships in a single pass. This enables massive parallelization and removes the gradient flow bottleneck.
| Property | RNN/LSTM | Transformer |
|---|---|---|
| Parallelization | Sequential only | Fully parallel |
| Long-range dependencies | Weak (vanishing gradients) | Strong (direct connections) |
| Training speed | Slow (O(n) steps) | Fast (O(1) steps) |
| Memory efficiency | O(d) hidden state | O(n²) attention |
| Inference latency | Low latency | Higher latency (until KV cache) |
Self-attention computes a weighted sum of token representations, where the weights reflect relevance between pairs of tokens. The mechanism requires three transformations: Query (Q), Key (K), Value (V).
For a sequence of tokens, we project each token into three spaces:
The division by sqrt(d_k) stabilizes gradients; the softmax ensures weights sum to 1. The result is that each token becomes a weighted blend of all tokens' values, guided by query-key similarity.
A single attention head captures one type of relationship. Multi-head attention uses 8–12 parallel heads, each learning different attention patterns (syntax, semantics, long-range, short-range). The outputs are concatenated and projected back to the model dimension.
Causal (decoder-only, GPT-style): Token i can only attend to tokens 0..i (past and present), not future tokens. Enforced by setting scores to -∞ for j > i before softmax. Non-causal (BERT-style): All tokens attend to all tokens. Used in encoder-only models where future context is available.
Transformers come in three flavors, each suited to different tasks.
Bidirectional attention: tokens see all other tokens. Trained with masked language modeling (15% of tokens masked, model predicts them). Excels at understanding and classification tasks (sentiment, NER, token classification). No autoregressive generation.
Causal attention: each token attends only to past tokens. Trained with next-token prediction (given tokens 0..i, predict i+1). Naturally generates sequences autoregressively, one token at a time. All modern LLMs are decoder-only.
Encoder processes input bidirectionally; decoder attends causally to its own tokens and cross-attends to encoder outputs. Designed for tasks with separate input and output (translation, summarization, question answering).
| Architecture | Attention type | Tasks | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional | Classification, tagging | BERT, RoBERTa |
| Decoder-only | Causal | Language generation | GPT-4, Llama, Falcon |
| Encoder-decoder | Causal + cross | Seq2seq, translation | T5, BART, Pegasus |
Transformers have no notion of sequence order — attention is permutation-invariant. To inject position information, we add positional encodings to token embeddings before feeding them to the first layer.
Sinusoidal (Vaswani, 2017): Position p, dimension d_model. Compute sin and cos at different frequencies. Learnable across context length. Still used in many models.
RoPE (Rotary Position Embeddings, Su et al., 2021) applies rotation matrices in the query-key dot product. More sample-efficient than sinusoidal; naturally extrapolates to longer sequences. Used in Llama, PaLM, modern models.
Attention with Linear Biases (ALiBi, Press et al., 2022) adds a position-dependent bias directly to attention logits before softmax. Extremely simple; extrapolates well. Used in some smaller models.
Transformers are powerful but expensive. Three key innovations dramatically reduce computation and memory.
During inference, we generate tokens one at a time. Without caching, we'd recompute K and V for all previous tokens on every generation step. KV cache stores K, V for all past tokens, reusing them. This cuts attention computation from O(n²) to O(n) during inference.
Dao et al. (2022) observed that standard attention implementations read/write to slow GPU memory repeatedly. Flash Attention reorders computation to exploit fast SRAM (on-chip memory). Same result, 4–10x speedup, 2–5x less memory. Now standard in modern libraries.
Standard multi-head attention has 8–12 separate K, V projections (one per head), ballooning KV cache size. GQA uses a small number of shared K, V heads (e.g., 2) with multiple Q heads attending to them. Reduces KV cache by 8x; minimal quality loss.
Full attention is O(n²); for very long contexts, we can't afford it. Sliding window attention limits each token to attend to the last W tokens (e.g., W=4096). Reduces complexity to O(n*W). Used in Mistral, LLaMA 2 Long.
| Technique | Impact | Trade-off |
|---|---|---|
| KV cache | 4–100x speedup at inference | Memory for cache storage |
| Flash Attention | 4–10x training speedup | Implementation complexity |
| Grouped-Query Attention | 8x KV cache reduction | Minimal quality loss |
| Sliding window | Linear complexity for long context | Loses very long-range attention |
Let's load a pre-trained transformer and inspect its internal representations, attention weights, and how self-attention works in practice.
last_hidden_state: [1, 10, 768] — The final layer representations for each token. 768 is the hidden dimension of GPT-2. attentions: Tuple of 12 tensors (one per layer). Each is [batch, num_heads, seq_len, seq_len]. Shows which tokens the model attends to. tokens: ["The", "Ġtransformer", "Ġarchitecture", ...] — Note BPE tokenization breaks words; "Ġ" is space.
We've covered the architecture and efficiency. The ecosystem of transformer variants is vast. Here are four child concept pages that dive deeper:
Dive into self-attention, cross-attention, different mask types, and modern variants like sparse attention and linear attention. Compare trade-offs.
→ Attention MechanismsUnderstand how transformers encode position, why it matters, and which methods extrapolate best. Explore interpolation vs. extrapolation.
→ Positional EncodingDeep dive into encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) designs. When to use each. Training objectives and fine-tuning.
→ Transformer ArchitecturesMaster key-value caching, grouped-query attention, Flash Attention, speculative decoding. How production systems generate tokens at scale.
→ KV Cache & EfficiencyThe transformer is the architectural foundation of all modern LLMs. Build your understanding layer by layer:
Write softmax(Q @ K.T / sqrt(d_k)) @ V from scratch. This 5-line calculation is the entire core of the transformer. Everything else is scaffolding.
The 2017 paper is still the clearest description of the architecture. Focus on Figure 1 and Section 3. The math follows directly from your NumPy implementation.
Andrej Karpathy's "nanoGPT" (GitHub) is the canonical exercise. Train a character-level model on Shakespeare. You'll see all the components fit together in <300 lines.