AI Hardware Guide

Contents

Why hardware matters
H100 deep dive
A100 vs H100 vs H200
AMD MI300X & alternatives
Cloud vs on-prem
Memory math & sizing
Multi-GPU topology

01 — Fundamentals

Why Hardware Matters for LLMs

Critical insight: LLM inference is memory-bandwidth bound, not compute bound. The bottleneck is moving weights from HBM to compute cores, not doing the multiplications.

Memory bandwidth determines tokens/second: more bandwidth = faster inference. This is why H100 SXM (3.35 TB/s) is ~3× faster at inference than A100 (2 TB/s) for the same model.

VRAM determines maximum model size: model weights + KV cache + activations must fit in GPU memory. An A100 (80 GB) can serve a 40B model in FP16; H100 has the same capacity but more bandwidth.

GPU	VRAM	Memory BW	FP16 TFLOPS	FP8 TFLOPS	TDP	Street price
A100 SXM 80GB	80 GB HBM2e	2 TB/s	312	—	400W	~$25K
H100 SXM 80GB	80 GB HBM3	3.35 TB/s	989	1979	700W	~$30K
H200 SXM 141GB	141 GB HBM3e	4.8 TB/s	989	1979	700W	~$35K
A10G 24GB	24 GB GDDR6	600 GB/s	125	—	150W	~$3.5K
L4 24GB	24 GB GDDR6	300 GB/s	242	485	72W	~$2.5K
L40S 48GB	48 GB GDDR6	864 GB/s	362	733	350W	~$10K

💡 Memory bandwidth is the real bottleneck for inference. A lower-cost GPU with higher bandwidth can outperform a more expensive GPU with lower bandwidth on serving latency. Always check TB/s, not just TFLOPS.

02 — NVIDIA's Flagship

H100 Deep Dive

Three SKUs: H100 SXM (fastest, only in DGX/HGX nodes, NVLink), H100 PCIe (fits standard servers, slower), H100 NVL (dual-GPU module).

Key Features

Transformer Engine: hardware-accelerated FP8 matrix multiply — 2× throughput vs FP16 at near-identical quality with per-tensor scaling.

NVLink 4.0: 900 GB/s GPU-to-GPU bandwidth within a node (8 GPUs). Critical for tensor parallelism.

FlashAttention-3 optimized: uses async tensor cores + warp specialization for maximum utilization on Hopper architecture.

✓ H100 SXM is the standard for serious training and high-throughput inference. H100 PCIe is 30–40% slower due to PCIe bandwidth bottleneck — avoid for large models or high QPS.

03 — Selection Guide

A100 vs H100 vs H200

GPU	Best for	Avoid when
A100 80GB	Cost-effective inference, fine-tuning	Frontier training, FP8 needed
H100 SXM	Training, high-QPS inference, FP8	Budget constrained
H200	Very large models needing 141GB VRAM	Don't need the extra memory
A10G/L4	Dev/staging, small model serving	Any large model
L40S	Inference (no NVLink), rendering	Multi-GPU training

H200 vs H100: The Decision Point

H200 vs H100: same compute (Hopper architecture), but 141 GB HBM3e vs 80 GB HBM3, and 4.8 TB/s vs 3.35 TB/s bandwidth. H200 fits Llama 3 405B in BF16 on 4 GPUs vs 8 GPUs for H100.

⚠️ The H200's main advantage is VRAM capacity, not compute. For models that fit in H100's 80 GB, the performance difference is bandwidth only (~43% faster). Calculate whether you actually need 141 GB before paying the premium.

04 — Competition

AMD MI300X and Alternatives

AMD MI300X

192 GB HBM3 (largest VRAM of any GPU), 5.3 TB/s memory bandwidth, competitive FP16 throughput (383 TFLOPS theoretical vs H100's 989 — but bandwidth often limits inference anyway).

MI300X advantage: fits Llama 3 405B entirely in one 8-GPU node (8 × 192 GB = 1.5 TB). No model parallelism needed.

Software maturity: ROCm is improving but still lags CUDA ecosystem. Some libraries (FlashAttention, some kernels) are CUDA-only or slower on ROCm.

Aspect	NVIDIA H100	AMD MI300X	Google TPU v5p
VRAM	80 GB	192 GB	96 GB
Memory BW	3.35 TB/s	5.3 TB/s	2.8 TB/s
Software	CUDA (excellent)	ROCm (good)	XLA/JAX (excellent)
Availability	Cloud + on-prem	Cloud + on-prem	Google Cloud only
Large model inference	Needs model parallelism	Often fits in one node	Needs model parallelism

Google TPUs: TPU v5p is competitive for large training runs. Only available on Google Cloud. Excellent for JAX/XLA workloads (Gemma training).

05 — Deployment Model

Cloud vs On-Premises

Cloud

No upfront cost, instant scaling, latest hardware (H100/H200/MI300X available), no maintenance. Pay per hour.

On-Prem

3-year TCO often 2–5× cheaper than cloud at sustained utilization (>60%), data sovereignty, no egress fees, customizable networking.

Break-Even Analysis

8×H100 server ~$400K + $50K/year opex. Cloud equivalent: ~$32/hr × 8760 hr/year = $280K/year. Break-even ≈ 18 months at 100% utilization.

Instance	GPU	GPUs	$/hr	Provider
p4d.24xlarge	A100 40GB	8	$32	AWS
p4de.24xlarge	A100 80GB	8	$40	AWS
p5.48xlarge	H100 80GB	8	$98	AWS
a3-highgpu-8g	H100 80GB	8	$98	GCP
Standard_ND96asr	A100 80GB	8	$36	Azure
H100 x8	H100 80GB	8	$32	Lambda Labs

✓ Lambda Labs and CoreWeave often have H100 capacity at $2–4/GPU/hr vs $12/GPU/hr on major clouds. For batch training jobs that aren't latency-sensitive, these GPU clouds are 3–5× cheaper.

06 — Sizing Deployments

Memory Math: Sizing Your Deployment

Three Components

Model weights: parameters × bytes per parameter (FP16 = 2, INT4 = 0.5)

KV cache: 2 × layers × heads × head_dim × seq_len × batch_size × 2 bytes (FP16)

Rule of thumb: for inference at max batch, budget 1.5–2× model weight size for KV cache + activations

Memory Calculator

def estimate_vram(params_B: float, dtype_bytes: float, seq_len: int, batch_size: int, n_layers: int, n_heads: int, head_dim: int) -> float: weights_gb = params_B * 1e9 * dtype_bytes / 1e9 kv_cache_gb = (2 * n_layers * n_heads * head_dim * seq_len * batch_size * 2) / 1e9 # FP16 overhead_gb = weights_gb * 0.2 # activations, framework return weights_gb + kv_cache_gb + overhead_gb # Llama 3.1 70B in INT4, batch=4, seq=4096 # n_layers=80, n_heads=64, head_dim=128 vram = estimate_vram(70, 0.5, 4096, 4, 80, 64, 128) print(f"Estimated: {vram:.1f} GB") # ~42 GB — fits in 1×H100 # Same model in FP16: vram_fp16 = estimate_vram(70, 2.0, 4096, 4, 80, 64, 128) print(f"FP16: {vram_fp16:.1f} GB") # ~145 GB — needs 2×H100

Key takeaway: Quantization (INT4, INT8) dramatically reduces VRAM footprint — often making 2× more models fit in the same hardware.

07 — Distributed Architecture

Multi-GPU Topology and Interconnects

NVLink (NVIDIA): proprietary high-bandwidth GPU-to-GPU interconnect. NVLink 4.0 = 900 GB/s bidirectional per GPU pair. NVSwitch allows any-to-any at full bandwidth in an 8-GPU node (DGX H100).

PCIe: all GPUs in a standard server connect via PCIe. PCIe 5.0 = 128 GB/s per GPU. Sufficient for data parallelism; limiting for tensor parallelism.

Infiniband vs Ethernet: for multi-node training, InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s) provides low-latency all-reduce. RoCE (RDMA over Converged Ethernet) is the alternative.

Topology	Bandwidth	Use case
NVLink 4.0 (intra-node)	900 GB/s	Tensor parallelism across 8 GPUs
PCIe 5.0 (intra-node)	128 GB/s	Data parallelism only
InfiniBand NDR (inter-node)	400 Gb/s	Distributed training across nodes
RoCE v2 (inter-node)	100–400 Gb/s	Lower-cost alternative to IB

⚠️ Tensor parallelism requires NVLink bandwidth. If you're running tensor parallel across GPUs connected only by PCIe, performance degrades dramatically. Use NVLink-connected nodes for tensor parallel; save PCIe setups for data parallel only.

Tools & Frameworks

Container

NVIDIA NGC

Optimized containers for training and inference.

Serving

vLLM

High-throughput LLM serving with paged attention.

Optimization

TensorRT-LLM

NVIDIA's inference optimization library.

Distributed Training

DeepSpeed

Microsoft's distributed training framework.

Distributed Training

Megatron-LM

NVIDIA's model parallelism library.

GPU Cloud

Lambda Labs

Affordable H100 cloud with burst capacity.

GPU Cloud

CoreWeave

GPU cloud with competitive H100/MI300X pricing.

Cloud

AWS p5 / p5e

AWS's latest H100/H200 instances with NVLink.

Python · GPU memory calculator: will your model fit?

def estimate_vram_gb(
    params_billions: float,
    precision: str = "float16",
    batch_size: int = 1,
    seq_len: int = 2048,
    kv_heads: int = 8,
    head_dim: int = 128,
    num_layers: int = 32,
    safety_margin: float = 1.2
) -> dict:
    """Estimate VRAM needed for LLM inference."""
    bytes_per_param = {"float32": 4, "float16": 2, "bfloat16": 2,
                       "int8": 1, "int4": 0.5}.get(precision, 2)

    # Model weights
    model_gb = params_billions * 1e9 * bytes_per_param / 1e9

    # KV cache: 2 (K+V) × layers × heads × head_dim × seq_len × batch × dtype
    kv_gb = (2 * num_layers * kv_heads * head_dim * seq_len * batch_size
             * bytes_per_param) / 1e9

    # Activations (rough estimate)
    activations_gb = (batch_size * seq_len * 4096 * bytes_per_param) / 1e9

    total = (model_gb + kv_gb + activations_gb) * safety_margin

    return {
        "model_weights_gb": round(model_gb, 2),
        "kv_cache_gb": round(kv_gb, 2),
        "activations_gb": round(activations_gb, 2),
        "total_estimated_gb": round(total, 2),
        "fits_on": [gpu for gpu, mem in
                    [("RTX 4090", 24), ("A10G", 24), ("A100 40GB", 40),
                     ("A100 80GB", 80), ("H100 80GB", 80), ("H100 NVL 94GB", 94)]
                    if mem >= total]
    }

# Common models
for name, params, prec in [
    ("Llama-3-8B fp16",   8,  "float16"),
    ("Llama-3-8B int4",   8,  "int4"),
    ("Llama-3-70B fp16",  70, "float16"),
    ("Llama-3-70B int4",  70, "int4"),
    ("Llama-3-405B int4", 405, "int4"),
]:
    r = estimate_vram_gb(params, prec)
    print(f"{name}: {r['total_estimated_gb']:.1f}GB → fits: {r['fits_on']}")

Python · Multi-GPU memory and bandwidth benchmarking

import torch, time

def benchmark_gpu() -> dict:
    """Measure key GPU specs that matter for LLM workloads."""
    if not torch.cuda.is_available():
        return {"error": "No CUDA GPU found"}

    device = torch.device("cuda:0")
    props = torch.cuda.get_device_properties(device)

    # Memory bandwidth test (crucial for LLM inference — memory-bound)
    size = 1024 * 1024 * 256  # 256M float32 = 1GB
    x = torch.randn(size, device=device, dtype=torch.float16)
    torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(10):
        y = x * 2.0 + 1.0   # elementwise — bandwidth-bound
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    bytes_moved = size * 2 * 2 * 10  # read + write, float16=2 bytes, 10 iters
    bandwidth_gb_s = bytes_moved / elapsed / 1e9

    # FLOPS test (matters more for prefill than decode)
    A = torch.randn(4096, 4096, device=device, dtype=torch.float16)
    B = torch.randn(4096, 4096, device=device, dtype=torch.float16)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(100):
        C = A @ B
    torch.cuda.synchronize()
    flops_elapsed = time.perf_counter() - start
    tflops = (2 * 4096**3 * 100) / flops_elapsed / 1e12

    return {
        "gpu": props.name,
        "vram_gb": round(props.total_memory / 1e9, 1),
        "memory_bandwidth_gb_s": round(bandwidth_gb_s, 1),
        "compute_tflops_fp16": round(tflops, 1),
        "sm_count": props.multi_processor_count,
    }

stats = benchmark_gpu()
print(stats)
# H100: {'bandwidth_gb_s': 3350, 'compute_tflops_fp16': 989, ...}
# A100: {'bandwidth_gb_s': 2000, 'compute_tflops_fp16': 312, ...}

08 — Further Reading

References

Official Specifications

Docs NVIDIA H100 Datasheet. — nvidia.com/en-us/data-center/h100 ↗
Docs AMD MI300X Datasheet. — amd.com/en/products/accelerators ↗
Docs Google Cloud TPU v5p Documentation. — cloud.google.com/tpu/docs ↗

Academic Benchmarks

Paper MLCommons. MLPerf Training Benchmarks. — mlcommons.org/benchmarks ↗
Paper Chen, J. et al. (2024). Efficient Large Language Model Inference: A Practical Guide. arXiv:2312.12456. — arxiv:2312.12456 ↗

Provider Resources

Guide Lambda Labs GPU Cloud. lambdalabs.com ↗
Guide CoreWeave GPU Cloud. coreweave.com ↗
Guide AWS GPU Instances. aws.amazon.com/ec2/p5 ↗
Guide GCP A3 & A3M Instances. cloud.google.com/compute/docs/gpus ↗

Practitioner Resources

Blog vLLM Team. High-Throughput LLM Serving: Architecture and Implementation. — blog.vllm.ai ↗
Blog Hugging Face. Hardware Optimization for LLM Inference. — huggingface.co/blog ↗
Blog Together AI. GPU Memory Optimization: A Practical Guide. — together.ai/blog ↗

AI Hardware Guide

Why Hardware Matters for LLMs

H100 Deep Dive

Key Features

A100 vs H100 vs H200

H200 vs H100: The Decision Point

AMD MI300X and Alternatives

AMD MI300X

Cloud vs On-Premises

Cloud

On-Prem

Break-Even Analysis

Memory Math: Sizing Your Deployment

Three Components

Memory Calculator

Multi-GPU Topology and Interconnects

Tools & Frameworks

References

Related concepts