ARCHITECTURE DEEP DIVE

Transformer Architecture

Attention, feed-forward layers, positional encoding, and how the pieces create a language model

Q, K, V, O the four weight matrices
12–96 layers depth of modern LLMs
Attention is O(n²) the scaling bottleneck
Contents
  1. The big picture
  2. Self-attention
  3. Multi-head attention
  4. Feed-forward networks
  5. Positional encoding
  6. Training stability
  7. Architecture variants
01 — THE STACK

The Big Picture

A Transformer is a stack of identical layers. Each layer contains two main sublayers: Multi-Head Attention and Feed-Forward Network. The input is token embeddings (vocabulary → d_model dimensional vector), and the output is a probability distribution over the vocabulary for the next token.

Model Hyperparameters

The following hyperparameters define model size and capacity:

Example Model Configurations

Modeld_modelLayersHeadsd_ffParams
GPT-2 Small76812123072117M
Llama 2-7B40963232110087B
Llama 3-70B819280642867270B
GPT-4 (est)12288969649152~1.8T (MoE)
02 — THE CORE MECHANISM

Self-Attention: The Core Mechanism

Each token attends to all other tokens in the sequence, asking "which other tokens are relevant to me?" This is the fundamental operation that gives Transformers their power and expressiveness.

The Computation

For input X, three learned weight matrices project the input into Query (Q), Key (K), and Value (V) spaces:

Attention score is then: softmax(QK^T / sqrt(d_k)) → weighted sum of V

Causal Masking

In decoder-only models (for generation), each token can only attend to previous tokens (causal masking). The upper triangle of the attention matrix is masked to -inf before softmax, preventing the model from looking into the future.

Scaled Dot-Product Attention in NumPy

import numpy as np def attention(Q, K, V, mask=None): d_k = Q.shape[-1] scores = Q @ K.T / np.sqrt(d_k) # [seq, seq] if mask is not None: scores = scores + mask * -1e9 # causal mask weights = np.exp(scores) / np.exp(scores).sum(-1, keepdims=True) # softmax return weights @ V # [seq, d_v] # For a 4-token sequence, d_k=64 seq_len, d_k = 4, 64 Q = np.random.randn(seq_len, d_k) K = np.random.randn(seq_len, d_k) V = np.random.randn(seq_len, d_k) # Causal mask: token i cannot see token j > i mask = np.triu(np.ones((seq_len, seq_len)), k=1) out = attention(Q, K, V, mask)
⚠️ The sqrt(d_k) scaling prevents dot products from becoming too large, which would push softmax into saturated regions with near-zero gradients during backprop. This is crucial for training stability.
03 — MULTIPLE RELATIONSHIP TYPES

Multi-Head Attention

Run attention H times in parallel, each with different learned W_Q, W_K, W_V projections. Different heads learn different relationship types: one head might focus on syntax, another on coreference, another on positional relationships. This diversity improves representation quality.

Head Output Combination

Concatenate all head outputs and project back to d_model with W_O (output projection). This allows the model to combine diverse attention perspectives.

GQA: Grouped Query Attention

Modern models use Grouped Query Attention (GQA) instead of full Multi-Head Attention. GQA shares K/V heads across groups of Q heads, reducing KV cache memory by 4-8× with minimal quality loss. This is critical for long-context inference.

Attention Head Variants Comparison

VariantK/V heads per Q headKV cache sizeUsed by
MHA (Multi-Head Attention)1:1 ratio100%GPT-2, early BERT
GQA (Grouped Query Attention, g=4)1:4 ratio25%Llama 3, Mistral, Gemma
MQA (Multi-Query Attention)1:all ratio1/HFalcon, early PaLM
Modern frontier models use GQA as standard. If benchmarking memory for a model, check whether it uses MHA or GQA — it changes KV cache requirements significantly. This is why newer models have better memory efficiency at scale.
04 — KNOWLEDGE STORAGE

Feed-Forward Network (FFN)

After attention, each token representation passes through a 2-layer FFN independently (no cross-token interaction). The FFN acts as a "knowledge store": research shows factual knowledge is stored in FFN weights, while attention routes information.

FFN Structure

FFN = Linear(d_model → d_ff) → Activation → Linear(d_ff → d_model). The d_ff is typically 4× d_model, creating a bottleneck-then-expand pattern.

SwiGLU Activation

Modern models (Llama, PaLM, Gemini) use SwiGLU activation instead of ReLU or GELU. SwiGLU splits the intermediate into two paths — one through SiLU activation, multiply together. Empirically superior to older activations.

SwiGLU FFN in PyTorch

import torch import torch.nn as nn import torch.nn.functional as F class SwiGLUFFN(nn.Module): def __init__(self, d_model, d_ff): super().__init__() self.w1 = nn.Linear(d_model, d_ff, bias=False) self.w3 = nn.Linear(d_model, d_ff, bias=False) self.w2 = nn.Linear(d_ff, d_model, bias=False) def forward(self, x): return self.w2(F.silu(self.w1(x)) * self.w3(x))

Parameter Distribution

The FFN accounts for roughly 2/3 of a model's parameters. This is where most factual knowledge lives. When fine-tuning, research shows that LoRA targeting FFN layers often outperforms targeting only attention layers for knowledge-intensive tasks.

FFN is where factual knowledge lives. When implementing low-rank adaptation (LoRA) for fine-tuning, targeting FFN layers often gives better results than attention-only approaches for knowledge-heavy domains.
05 — WORD ORDER

Positional Encoding

Attention is permutation-invariant — without position information, the model has no sense of word order. Positional encoding adds position information to embeddings so the model understands sequence structure.

Positional Encoding Methods

Positional Encoding Comparison

MethodExtrapolationRelative positionsUsed by
SinusoidalPoorNoOriginal Transformer
Learned absoluteNone (hard cap)NoGPT-2, BERT
RoPEGood (with extension)YesLlama, Mistral, Gemini
ALiBiGoodYesBLOOM, MPT

RoPE Extension for Long Context

RoPE can be extended beyond training context length using "RoPE scaling" techniques like YaRN or dynamic NTK scaling. This is how Llama 3's 8K training context was extended to 128K without major quality loss.

⚠️ RoPE scaling is critical for long-context LLMs. When extending context beyond training length, proper positional encoding scaling prevents the model from "losing" position information in the middle of long sequences.
06 — NORMALIZATION AND RESIDUALS

LayerNorm, Residual Connections, and Training Stability

Deep networks require techniques to prevent gradient vanishing and activation explosion. Modern Transformers use residual connections and LayerNorm to maintain stable gradient flow and activation magnitudes throughout training.

Residual Connections

Output = x + sublayer(x). Allows gradients to flow directly back to earlier layers without being multiplied by small values, preventing vanishing gradients in deep networks.

LayerNorm

Normalize activations across features (not batch) at each sublayer. Prevents activation explosion and keeps training stable.

Pre-norm vs Post-norm

Modern models use Pre-LayerNorm: normalize input BEFORE sublayer. This improves training stability compared to the original post-norm design (normalize AFTER).

Transformer Block in PyTorch (Pre-norm Architecture)

class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads, d_ff): super().__init__() self.attn = MultiHeadAttention(d_model, n_heads) self.ffn = SwiGLUFFN(d_model, d_ff) self.norm1 = nn.RMSNorm(d_model) # pre-norm self.norm2 = nn.RMSNorm(d_model) def forward(self, x, mask=None): # Pre-norm: normalize BEFORE sublayer x = x + self.attn(self.norm1(x), mask) # residual + attn x = x + self.ffn(self.norm2(x)) # residual + ffn return x

RMSNorm

Root Mean Square Norm: simpler than LayerNorm (no mean centering). Used by Llama, Mistral — slightly faster and empirically equivalent.

07 — ENCODER, DECODER, OR BOTH

Encoder-Only, Decoder-Only, Encoder-Decoder

Different task requirements call for different architectural patterns. Understanding when to use each is essential for model selection and fine-tuning design.

Architecture Variants Comparison

ArchitectureMaskingKV cacheUse casesExamples
Encoder-onlyBidirectionalNoEmbeddings, classification, NERBERT, RoBERTa, DeBERTa
Decoder-onlyCausal (left-to-right)YesText generation, chat, codingGPT, Llama, Mistral, Claude
Encoder-decoderEnc: bidir, Dec: causalYesTranslation, summarization, T2TT5, FLAN-T5, mT5, Whisper

Modern LLM Landscape

Modern LLMs are almost exclusively decoder-only: simpler to scale, KV cache enables fast autoregressive generation, in-context learning works naturally. Encoder-only models (BERT-family) remain dominant for: embedding generation, token classification, and sentence similarity tasks.

When to Use Each

Tools for Architecture Implementation

transformers (HF)
Framework
Standard library for all model architectures. Essential reference for training and inference.
PyTorch
Framework
Build custom layers and architectures. Used by all major implementations.
FlashAttention-2
Optimization
Fast attention implementation. Reduces memory and compute for long sequences.
xformers
Optimization
Memory-efficient transformer implementations. Experimental but powerful.
DeepSpeed
Scaling
Distributed training for huge models. ZeRO stages enable billion-parameter training.
FSDP
Scaling
PyTorch's native fully sharded data parallel. Standard for distributed training.
Further Reading

References

Foundational Papers
Position Encoding & Improvements
Optimization & Efficiency
Activation Functions & Components