INFRASTRUCTURE

AI Hardware Guide

H100 vs A100 vs MI300X, cloud vs on-prem, GPU memory math, and picking the right hardware for training and inference.

HBM3e memory bandwidth the inference bottleneck
NVLink + NVSwitch the multi-GPU interconnect
TCO over 3 years the real cost metric
Contents
  1. Why hardware matters
  2. H100 deep dive
  3. A100 vs H100 vs H200
  4. AMD MI300X & alternatives
  5. Cloud vs on-prem
  6. Memory math & sizing
  7. Multi-GPU topology
01 — Fundamentals

Why Hardware Matters for LLMs

Critical insight: LLM inference is memory-bandwidth bound, not compute bound. The bottleneck is moving weights from HBM to compute cores, not doing the multiplications.

Memory bandwidth determines tokens/second: more bandwidth = faster inference. This is why H100 SXM (3.35 TB/s) is ~3× faster at inference than A100 (2 TB/s) for the same model.

VRAM determines maximum model size: model weights + KV cache + activations must fit in GPU memory. An A100 (80 GB) can serve a 40B model in FP16; H100 has the same capacity but more bandwidth.

GPUVRAMMemory BWFP16 TFLOPSFP8 TFLOPSTDPStreet price
A100 SXM 80GB80 GB HBM2e2 TB/s312400W~$25K
H100 SXM 80GB80 GB HBM33.35 TB/s9891979700W~$30K
H200 SXM 141GB141 GB HBM3e4.8 TB/s9891979700W~$35K
A10G 24GB24 GB GDDR6600 GB/s125150W~$3.5K
L4 24GB24 GB GDDR6300 GB/s24248572W~$2.5K
L40S 48GB48 GB GDDR6864 GB/s362733350W~$10K
💡 Memory bandwidth is the real bottleneck for inference. A lower-cost GPU with higher bandwidth can outperform a more expensive GPU with lower bandwidth on serving latency. Always check TB/s, not just TFLOPS.
02 — NVIDIA's Flagship

H100 Deep Dive

Three SKUs: H100 SXM (fastest, only in DGX/HGX nodes, NVLink), H100 PCIe (fits standard servers, slower), H100 NVL (dual-GPU module).

Key Features

Transformer Engine: hardware-accelerated FP8 matrix multiply — 2× throughput vs FP16 at near-identical quality with per-tensor scaling.

NVLink 4.0: 900 GB/s GPU-to-GPU bandwidth within a node (8 GPUs). Critical for tensor parallelism.

FlashAttention-3 optimized: uses async tensor cores + warp specialization for maximum utilization on Hopper architecture.

H100 SXM is the standard for serious training and high-throughput inference. H100 PCIe is 30–40% slower due to PCIe bandwidth bottleneck — avoid for large models or high QPS.
03 — Selection Guide

A100 vs H100 vs H200

GPUBest forAvoid when
A100 80GBCost-effective inference, fine-tuningFrontier training, FP8 needed
H100 SXMTraining, high-QPS inference, FP8Budget constrained
H200Very large models needing 141GB VRAMDon't need the extra memory
A10G/L4Dev/staging, small model servingAny large model
L40SInference (no NVLink), renderingMulti-GPU training

H200 vs H100: The Decision Point

H200 vs H100: same compute (Hopper architecture), but 141 GB HBM3e vs 80 GB HBM3, and 4.8 TB/s vs 3.35 TB/s bandwidth. H200 fits Llama 3 405B in BF16 on 4 GPUs vs 8 GPUs for H100.

⚠️ The H200's main advantage is VRAM capacity, not compute. For models that fit in H100's 80 GB, the performance difference is bandwidth only (~43% faster). Calculate whether you actually need 141 GB before paying the premium.
04 — Competition

AMD MI300X and Alternatives

AMD MI300X

192 GB HBM3 (largest VRAM of any GPU), 5.3 TB/s memory bandwidth, competitive FP16 throughput (383 TFLOPS theoretical vs H100's 989 — but bandwidth often limits inference anyway).

MI300X advantage: fits Llama 3 405B entirely in one 8-GPU node (8 × 192 GB = 1.5 TB). No model parallelism needed.

Software maturity: ROCm is improving but still lags CUDA ecosystem. Some libraries (FlashAttention, some kernels) are CUDA-only or slower on ROCm.

AspectNVIDIA H100AMD MI300XGoogle TPU v5p
VRAM80 GB192 GB96 GB
Memory BW3.35 TB/s5.3 TB/s2.8 TB/s
SoftwareCUDA (excellent)ROCm (good)XLA/JAX (excellent)
AvailabilityCloud + on-premCloud + on-premGoogle Cloud only
Large model inferenceNeeds model parallelismOften fits in one nodeNeeds model parallelism

Google TPUs: TPU v5p is competitive for large training runs. Only available on Google Cloud. Excellent for JAX/XLA workloads (Gemma training).

05 — Deployment Model

Cloud vs On-Premises

Cloud

No upfront cost, instant scaling, latest hardware (H100/H200/MI300X available), no maintenance. Pay per hour.

On-Prem

3-year TCO often 2–5× cheaper than cloud at sustained utilization (>60%), data sovereignty, no egress fees, customizable networking.

Break-Even Analysis

8×H100 server ~$400K + $50K/year opex. Cloud equivalent: ~$32/hr × 8760 hr/year = $280K/year. Break-even ≈ 18 months at 100% utilization.

InstanceGPUGPUs$/hrProvider
p4d.24xlargeA100 40GB8$32AWS
p4de.24xlargeA100 80GB8$40AWS
p5.48xlargeH100 80GB8$98AWS
a3-highgpu-8gH100 80GB8$98GCP
Standard_ND96asrA100 80GB8$36Azure
H100 x8H100 80GB8$32Lambda Labs
Lambda Labs and CoreWeave often have H100 capacity at $2–4/GPU/hr vs $12/GPU/hr on major clouds. For batch training jobs that aren't latency-sensitive, these GPU clouds are 3–5× cheaper.
06 — Sizing Deployments

Memory Math: Sizing Your Deployment

Three Components

Model weights: parameters × bytes per parameter (FP16 = 2, INT4 = 0.5)

KV cache: 2 × layers × heads × head_dim × seq_len × batch_size × 2 bytes (FP16)

Rule of thumb: for inference at max batch, budget 1.5–2× model weight size for KV cache + activations

Memory Calculator

def estimate_vram(params_B: float, dtype_bytes: float, seq_len: int, batch_size: int, n_layers: int, n_heads: int, head_dim: int) -> float: weights_gb = params_B * 1e9 * dtype_bytes / 1e9 kv_cache_gb = (2 * n_layers * n_heads * head_dim * seq_len * batch_size * 2) / 1e9 # FP16 overhead_gb = weights_gb * 0.2 # activations, framework return weights_gb + kv_cache_gb + overhead_gb # Llama 3.1 70B in INT4, batch=4, seq=4096 # n_layers=80, n_heads=64, head_dim=128 vram = estimate_vram(70, 0.5, 4096, 4, 80, 64, 128) print(f"Estimated: {vram:.1f} GB") # ~42 GB — fits in 1×H100 # Same model in FP16: vram_fp16 = estimate_vram(70, 2.0, 4096, 4, 80, 64, 128) print(f"FP16: {vram_fp16:.1f} GB") # ~145 GB — needs 2×H100

Key takeaway: Quantization (INT4, INT8) dramatically reduces VRAM footprint — often making 2× more models fit in the same hardware.

07 — Distributed Architecture

Multi-GPU Topology and Interconnects

NVLink (NVIDIA): proprietary high-bandwidth GPU-to-GPU interconnect. NVLink 4.0 = 900 GB/s bidirectional per GPU pair. NVSwitch allows any-to-any at full bandwidth in an 8-GPU node (DGX H100).

PCIe: all GPUs in a standard server connect via PCIe. PCIe 5.0 = 128 GB/s per GPU. Sufficient for data parallelism; limiting for tensor parallelism.

Infiniband vs Ethernet: for multi-node training, InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s) provides low-latency all-reduce. RoCE (RDMA over Converged Ethernet) is the alternative.

TopologyBandwidthUse case
NVLink 4.0 (intra-node)900 GB/sTensor parallelism across 8 GPUs
PCIe 5.0 (intra-node)128 GB/sData parallelism only
InfiniBand NDR (inter-node)400 Gb/sDistributed training across nodes
RoCE v2 (inter-node)100–400 Gb/sLower-cost alternative to IB
⚠️ Tensor parallelism requires NVLink bandwidth. If you're running tensor parallel across GPUs connected only by PCIe, performance degrades dramatically. Use NVLink-connected nodes for tensor parallel; save PCIe setups for data parallel only.

Tools & Frameworks

Container
NVIDIA NGC
Optimized containers for training and inference.
Serving
vLLM
High-throughput LLM serving with paged attention.
Optimization
TensorRT-LLM
NVIDIA's inference optimization library.
Distributed Training
DeepSpeed
Microsoft's distributed training framework.
Distributed Training
Megatron-LM
NVIDIA's model parallelism library.
GPU Cloud
Lambda Labs
Affordable H100 cloud with burst capacity.
GPU Cloud
CoreWeave
GPU cloud with competitive H100/MI300X pricing.
Cloud
AWS p5 / p5e
AWS's latest H100/H200 instances with NVLink.
Python · GPU memory calculator: will your model fit?
def estimate_vram_gb(
    params_billions: float,
    precision: str = "float16",
    batch_size: int = 1,
    seq_len: int = 2048,
    kv_heads: int = 8,
    head_dim: int = 128,
    num_layers: int = 32,
    safety_margin: float = 1.2
) -> dict:
    """Estimate VRAM needed for LLM inference."""
    bytes_per_param = {"float32": 4, "float16": 2, "bfloat16": 2,
                       "int8": 1, "int4": 0.5}.get(precision, 2)

    # Model weights
    model_gb = params_billions * 1e9 * bytes_per_param / 1e9

    # KV cache: 2 (K+V) × layers × heads × head_dim × seq_len × batch × dtype
    kv_gb = (2 * num_layers * kv_heads * head_dim * seq_len * batch_size
             * bytes_per_param) / 1e9

    # Activations (rough estimate)
    activations_gb = (batch_size * seq_len * 4096 * bytes_per_param) / 1e9

    total = (model_gb + kv_gb + activations_gb) * safety_margin

    return {
        "model_weights_gb": round(model_gb, 2),
        "kv_cache_gb": round(kv_gb, 2),
        "activations_gb": round(activations_gb, 2),
        "total_estimated_gb": round(total, 2),
        "fits_on": [gpu for gpu, mem in
                    [("RTX 4090", 24), ("A10G", 24), ("A100 40GB", 40),
                     ("A100 80GB", 80), ("H100 80GB", 80), ("H100 NVL 94GB", 94)]
                    if mem >= total]
    }

# Common models
for name, params, prec in [
    ("Llama-3-8B fp16",   8,  "float16"),
    ("Llama-3-8B int4",   8,  "int4"),
    ("Llama-3-70B fp16",  70, "float16"),
    ("Llama-3-70B int4",  70, "int4"),
    ("Llama-3-405B int4", 405, "int4"),
]:
    r = estimate_vram_gb(params, prec)
    print(f"{name}: {r['total_estimated_gb']:.1f}GB → fits: {r['fits_on']}")
Python · Multi-GPU memory and bandwidth benchmarking
import torch, time

def benchmark_gpu() -> dict:
    """Measure key GPU specs that matter for LLM workloads."""
    if not torch.cuda.is_available():
        return {"error": "No CUDA GPU found"}

    device = torch.device("cuda:0")
    props = torch.cuda.get_device_properties(device)

    # Memory bandwidth test (crucial for LLM inference — memory-bound)
    size = 1024 * 1024 * 256  # 256M float32 = 1GB
    x = torch.randn(size, device=device, dtype=torch.float16)
    torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(10):
        y = x * 2.0 + 1.0   # elementwise — bandwidth-bound
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    bytes_moved = size * 2 * 2 * 10  # read + write, float16=2 bytes, 10 iters
    bandwidth_gb_s = bytes_moved / elapsed / 1e9

    # FLOPS test (matters more for prefill than decode)
    A = torch.randn(4096, 4096, device=device, dtype=torch.float16)
    B = torch.randn(4096, 4096, device=device, dtype=torch.float16)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(100):
        C = A @ B
    torch.cuda.synchronize()
    flops_elapsed = time.perf_counter() - start
    tflops = (2 * 4096**3 * 100) / flops_elapsed / 1e12

    return {
        "gpu": props.name,
        "vram_gb": round(props.total_memory / 1e9, 1),
        "memory_bandwidth_gb_s": round(bandwidth_gb_s, 1),
        "compute_tflops_fp16": round(tflops, 1),
        "sm_count": props.multi_processor_count,
    }

stats = benchmark_gpu()
print(stats)
# H100: {'bandwidth_gb_s': 3350, 'compute_tflops_fp16': 989, ...}
# A100: {'bandwidth_gb_s': 2000, 'compute_tflops_fp16': 312, ...}
08 — Further Reading

References

Official Specifications
Academic Benchmarks
Provider Resources
Practitioner Resources