H100 vs A100 vs MI300X, cloud vs on-prem, GPU memory math, and picking the right hardware for training and inference.
Critical insight: LLM inference is memory-bandwidth bound, not compute bound. The bottleneck is moving weights from HBM to compute cores, not doing the multiplications.
Memory bandwidth determines tokens/second: more bandwidth = faster inference. This is why H100 SXM (3.35 TB/s) is ~3× faster at inference than A100 (2 TB/s) for the same model.
VRAM determines maximum model size: model weights + KV cache + activations must fit in GPU memory. An A100 (80 GB) can serve a 40B model in FP16; H100 has the same capacity but more bandwidth.
| GPU | VRAM | Memory BW | FP16 TFLOPS | FP8 TFLOPS | TDP | Street price |
|---|---|---|---|---|---|---|
| A100 SXM 80GB | 80 GB HBM2e | 2 TB/s | 312 | — | 400W | ~$25K |
| H100 SXM 80GB | 80 GB HBM3 | 3.35 TB/s | 989 | 1979 | 700W | ~$30K |
| H200 SXM 141GB | 141 GB HBM3e | 4.8 TB/s | 989 | 1979 | 700W | ~$35K |
| A10G 24GB | 24 GB GDDR6 | 600 GB/s | 125 | — | 150W | ~$3.5K |
| L4 24GB | 24 GB GDDR6 | 300 GB/s | 242 | 485 | 72W | ~$2.5K |
| L40S 48GB | 48 GB GDDR6 | 864 GB/s | 362 | 733 | 350W | ~$10K |
Three SKUs: H100 SXM (fastest, only in DGX/HGX nodes, NVLink), H100 PCIe (fits standard servers, slower), H100 NVL (dual-GPU module).
Transformer Engine: hardware-accelerated FP8 matrix multiply — 2× throughput vs FP16 at near-identical quality with per-tensor scaling.
NVLink 4.0: 900 GB/s GPU-to-GPU bandwidth within a node (8 GPUs). Critical for tensor parallelism.
FlashAttention-3 optimized: uses async tensor cores + warp specialization for maximum utilization on Hopper architecture.
| GPU | Best for | Avoid when |
|---|---|---|
| A100 80GB | Cost-effective inference, fine-tuning | Frontier training, FP8 needed |
| H100 SXM | Training, high-QPS inference, FP8 | Budget constrained |
| H200 | Very large models needing 141GB VRAM | Don't need the extra memory |
| A10G/L4 | Dev/staging, small model serving | Any large model |
| L40S | Inference (no NVLink), rendering | Multi-GPU training |
H200 vs H100: same compute (Hopper architecture), but 141 GB HBM3e vs 80 GB HBM3, and 4.8 TB/s vs 3.35 TB/s bandwidth. H200 fits Llama 3 405B in BF16 on 4 GPUs vs 8 GPUs for H100.
192 GB HBM3 (largest VRAM of any GPU), 5.3 TB/s memory bandwidth, competitive FP16 throughput (383 TFLOPS theoretical vs H100's 989 — but bandwidth often limits inference anyway).
MI300X advantage: fits Llama 3 405B entirely in one 8-GPU node (8 × 192 GB = 1.5 TB). No model parallelism needed.
Software maturity: ROCm is improving but still lags CUDA ecosystem. Some libraries (FlashAttention, some kernels) are CUDA-only or slower on ROCm.
| Aspect | NVIDIA H100 | AMD MI300X | Google TPU v5p |
|---|---|---|---|
| VRAM | 80 GB | 192 GB | 96 GB |
| Memory BW | 3.35 TB/s | 5.3 TB/s | 2.8 TB/s |
| Software | CUDA (excellent) | ROCm (good) | XLA/JAX (excellent) |
| Availability | Cloud + on-prem | Cloud + on-prem | Google Cloud only |
| Large model inference | Needs model parallelism | Often fits in one node | Needs model parallelism |
Google TPUs: TPU v5p is competitive for large training runs. Only available on Google Cloud. Excellent for JAX/XLA workloads (Gemma training).
No upfront cost, instant scaling, latest hardware (H100/H200/MI300X available), no maintenance. Pay per hour.
3-year TCO often 2–5× cheaper than cloud at sustained utilization (>60%), data sovereignty, no egress fees, customizable networking.
8×H100 server ~$400K + $50K/year opex. Cloud equivalent: ~$32/hr × 8760 hr/year = $280K/year. Break-even ≈ 18 months at 100% utilization.
| Instance | GPU | GPUs | $/hr | Provider |
|---|---|---|---|---|
| p4d.24xlarge | A100 40GB | 8 | $32 | AWS |
| p4de.24xlarge | A100 80GB | 8 | $40 | AWS |
| p5.48xlarge | H100 80GB | 8 | $98 | AWS |
| a3-highgpu-8g | H100 80GB | 8 | $98 | GCP |
| Standard_ND96asr | A100 80GB | 8 | $36 | Azure |
| H100 x8 | H100 80GB | 8 | $32 | Lambda Labs |
Model weights: parameters × bytes per parameter (FP16 = 2, INT4 = 0.5)
KV cache: 2 × layers × heads × head_dim × seq_len × batch_size × 2 bytes (FP16)
Rule of thumb: for inference at max batch, budget 1.5–2× model weight size for KV cache + activations
Key takeaway: Quantization (INT4, INT8) dramatically reduces VRAM footprint — often making 2× more models fit in the same hardware.
NVLink (NVIDIA): proprietary high-bandwidth GPU-to-GPU interconnect. NVLink 4.0 = 900 GB/s bidirectional per GPU pair. NVSwitch allows any-to-any at full bandwidth in an 8-GPU node (DGX H100).
PCIe: all GPUs in a standard server connect via PCIe. PCIe 5.0 = 128 GB/s per GPU. Sufficient for data parallelism; limiting for tensor parallelism.
Infiniband vs Ethernet: for multi-node training, InfiniBand HDR (200 Gb/s) or NDR (400 Gb/s) provides low-latency all-reduce. RoCE (RDMA over Converged Ethernet) is the alternative.
| Topology | Bandwidth | Use case |
|---|---|---|
| NVLink 4.0 (intra-node) | 900 GB/s | Tensor parallelism across 8 GPUs |
| PCIe 5.0 (intra-node) | 128 GB/s | Data parallelism only |
| InfiniBand NDR (inter-node) | 400 Gb/s | Distributed training across nodes |
| RoCE v2 (inter-node) | 100–400 Gb/s | Lower-cost alternative to IB |
def estimate_vram_gb(
params_billions: float,
precision: str = "float16",
batch_size: int = 1,
seq_len: int = 2048,
kv_heads: int = 8,
head_dim: int = 128,
num_layers: int = 32,
safety_margin: float = 1.2
) -> dict:
"""Estimate VRAM needed for LLM inference."""
bytes_per_param = {"float32": 4, "float16": 2, "bfloat16": 2,
"int8": 1, "int4": 0.5}.get(precision, 2)
# Model weights
model_gb = params_billions * 1e9 * bytes_per_param / 1e9
# KV cache: 2 (K+V) × layers × heads × head_dim × seq_len × batch × dtype
kv_gb = (2 * num_layers * kv_heads * head_dim * seq_len * batch_size
* bytes_per_param) / 1e9
# Activations (rough estimate)
activations_gb = (batch_size * seq_len * 4096 * bytes_per_param) / 1e9
total = (model_gb + kv_gb + activations_gb) * safety_margin
return {
"model_weights_gb": round(model_gb, 2),
"kv_cache_gb": round(kv_gb, 2),
"activations_gb": round(activations_gb, 2),
"total_estimated_gb": round(total, 2),
"fits_on": [gpu for gpu, mem in
[("RTX 4090", 24), ("A10G", 24), ("A100 40GB", 40),
("A100 80GB", 80), ("H100 80GB", 80), ("H100 NVL 94GB", 94)]
if mem >= total]
}
# Common models
for name, params, prec in [
("Llama-3-8B fp16", 8, "float16"),
("Llama-3-8B int4", 8, "int4"),
("Llama-3-70B fp16", 70, "float16"),
("Llama-3-70B int4", 70, "int4"),
("Llama-3-405B int4", 405, "int4"),
]:
r = estimate_vram_gb(params, prec)
print(f"{name}: {r['total_estimated_gb']:.1f}GB → fits: {r['fits_on']}")
import torch, time
def benchmark_gpu() -> dict:
"""Measure key GPU specs that matter for LLM workloads."""
if not torch.cuda.is_available():
return {"error": "No CUDA GPU found"}
device = torch.device("cuda:0")
props = torch.cuda.get_device_properties(device)
# Memory bandwidth test (crucial for LLM inference — memory-bound)
size = 1024 * 1024 * 256 # 256M float32 = 1GB
x = torch.randn(size, device=device, dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10):
y = x * 2.0 + 1.0 # elementwise — bandwidth-bound
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
bytes_moved = size * 2 * 2 * 10 # read + write, float16=2 bytes, 10 iters
bandwidth_gb_s = bytes_moved / elapsed / 1e9
# FLOPS test (matters more for prefill than decode)
A = torch.randn(4096, 4096, device=device, dtype=torch.float16)
B = torch.randn(4096, 4096, device=device, dtype=torch.float16)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(100):
C = A @ B
torch.cuda.synchronize()
flops_elapsed = time.perf_counter() - start
tflops = (2 * 4096**3 * 100) / flops_elapsed / 1e12
return {
"gpu": props.name,
"vram_gb": round(props.total_memory / 1e9, 1),
"memory_bandwidth_gb_s": round(bandwidth_gb_s, 1),
"compute_tflops_fp16": round(tflops, 1),
"sm_count": props.multi_processor_count,
}
stats = benchmark_gpu()
print(stats)
# H100: {'bandwidth_gb_s': 3350, 'compute_tflops_fp16': 989, ...}
# A100: {'bandwidth_gb_s': 2000, 'compute_tflops_fp16': 312, ...}