NVIDIA GPUs

GPU Tiers for LLMs
Memory Bandwidth vs Compute
H100 vs A100 vs L40S
Consumer GPU Options
Multi-GPU Configurations
Practical Selection Guide

SECTION 01

GPU Tiers for LLMs

Data centre tier (H100, A100): maximum memory (80GB), NVLink interconnect for multi-GPU, Tensor Cores for FP16/BF16/FP8, ECC memory. Professional tier (L40S, A40): good balance of VRAM (48GB) and cost, ideal for inference. Consumer tier (RTX 4090, 3090): 24GB VRAM, no NVLink, but very cost-effective for fine-tuning small models.

SECTION 02

Memory Bandwidth vs Compute

LLM inference is memory-bandwidth bound during decode (one token at a time, all model weights must be loaded). Compute (FLOPS) matters more for prefill (processing the prompt). H100 SXM5: 3.35 TB/s bandwidth — can decode 70B at ~30 tokens/s with int4. A100 SXM4: 2.0 TB/s. RTX 4090: 1.0 TB/s. Rule: for inference, bandwidth wins. For training, compute wins.

SECTION 03

H100 vs A100 vs L40S

H100 SXM5 80GB: top training + inference GPU, $2.50–$4/hr cloud, NVLink 4.0 at 900 GB/s inter-GPU, FP8 support. A100 SXM4 80GB: previous gen but widely available, $1.50–$3/hr, excellent for most training workloads. L40S 48GB: inference-optimised, $0.80–$1.50/hr, PCIe (no NVLink), best cost-per-token for serving 7B–34B models. Choose H100 for training large models; L40S for production inference serving.

SECTION 04

Consumer GPU Options

RTX 4090 24GB: $1,500–$2,000 new, excellent for LoRA fine-tuning of 7B–13B models, QLoRA of 70B with quantisation. Not suitable for 70B+ full fine-tuning. RTX 3090 24GB: cheaper ($700–$1,000 used), same 24GB VRAM, lower bandwidth (936 GB/s). RTX 4080 16GB: insufficient for 13B+ models. For home labs: two 4090s give 48GB combined with PCIe (no NVLink, slower inter-GPU).

SECTION 05

Multi-GPU Configurations

NVLink: high-bandwidth GPU-to-GPU interconnect, standard in H/A100 DGX systems. " "8×H100 SXM5 DGX H100: 8 GPUs, 640GB total VRAM, 900 GB/s NVLink — train 70B+ in BF16. " "PCIe multi-GPU: slower (32 GB/s vs 900 GB/s) but available on consumer cards. " "Use tensor parallelism (Megatron-LM) for NVLink, pipeline parallelism for PCIe.

# Check your GPU setup:
import torch
print(torch.cuda.device_count())          # number of GPUs
print(torch.cuda.get_device_name(0))      # e.g. "NVIDIA H100 SXM5"
print(torch.cuda.mem_get_info(0))         # (free_bytes, total_bytes)
# Check bandwidth with nvidia-smi:
# nvidia-smi --query-gpu=memory.total,memory.bandwidth --format=csv

SECTION 06

Practical Selection Guide

Budget <$500/month (cloud): rent L40S or A10G for inference, spot A100 for training. Budget $500–$5K/month: L40S ×2 for serving, A100 for training. Home lab: RTX 4090 ×2 for LoRA fine-tuning + serving 7B. Large-scale training: H100 DGX pod. Always check: VRAM ≥ model weights in inference dtype (e.g. 70B × 2 bytes/param in BF16 = 140GB → need 2× H100 80GB).

SECTION 07

Advanced Implementation

This section covers advanced patterns and implementation considerations for production environments. Understanding these concepts ensures robust and scalable deployments.

import torch import torch.nn as nn # Check GPU availability print(torch.cuda.is_available()) # True print(torch.cuda.get_device_name(0)) # GPU device name device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Move model to GPU model = nn.Linear(10, 5).to(device)

// NVIDIA CUDA-optimized computation
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)

// This runs on GPU with CUDA
result = torch.matmul(x, y)

// Use torch.compile for optimization
from torch import compile
optimized_fn = compile(your_function)

Production deployments require careful consideration of operational characteristics including resource consumption, latency profiles, and failure modes. Comprehensive testing against real-world scenarios helps validate assumptions and identify edge cases.

Community adoption and ecosystem maturity directly impact long-term viability. Active maintenance, thorough documentation, and responsive support channels significantly reduce implementation friction and maintenance burden.

Cost considerations extend beyond initial implementation to include ongoing operational expenses, training requirements, and opportunity costs of technology choices. A holistic cost analysis accounts for both direct and indirect expenses over the system lifetime.

Integration patterns and interoperability with existing infrastructure determine deployment success. Compatibility layers, standardized interfaces, and clear migration paths smooth the adoption process for teams with legacy systems.

Monitoring and observability are critical aspects of production systems. Establishing comprehensive metrics, logging, and alerting mechanisms enables rapid detection and resolution of issues before they impact end users.

Understanding the fundamentals enables practitioners to make informed decisions about tool selection and implementation strategy. These foundational concepts shape how systems are architected and operated in production environments. Key considerations include performance characteristics, resource utilization patterns, and integration requirements that vary significantly based on specific use cases and organizational constraints.

Production deployments require careful consideration of operational characteristics including resource consumption, latency profiles, failure modes, and recovery mechanisms. Comprehensive testing against real-world scenarios helps validate assumptions, identify edge cases, and stress-test systems under realistic conditions. Automation of testing pipelines ensures consistent quality and reduces manual effort during deployment cycles.

Community adoption and ecosystem maturity directly impact long-term viability and maintenance burden. Active development communities, thorough documentation, responsive support channels, and regular updates significantly reduce implementation friction. The availability of third-party integrations, plugins, and extensions extends functionality and accelerates time-to-value for organizations adopting these technologies.

Cost considerations extend beyond initial implementation to include ongoing operational expenses, training requirements, infrastructure costs, and opportunity costs of technology choices. A holistic cost analysis accounts for both direct expenses and indirect costs spanning acquisition, deployment, operational overhead, and eventual maintenance or replacement. Return on investment calculations must consider these multifaceted dimensions.

Integration patterns and interoperability with existing infrastructure determine deployment success and organizational impact. Compatibility layers, standardized interfaces, clear migration paths, and backward compatibility mechanisms smooth adoption for teams managing legacy systems. Understanding integration points and potential bottlenecks helps avoid common pitfalls and ensures smooth operational transitions.

Monitoring and observability are critical aspects of modern production systems and operational excellence. Establishing comprehensive metrics, structured logging, distributed tracing, and alerting mechanisms enables rapid detection and resolution of issues before they impact end users. Instrumentation at multiple layers provides visibility into system behavior and helps drive continuous improvements.

Security considerations span multiple dimensions including authentication, authorization, encryption, data protection, and compliance with regulatory frameworks. Implementing defense-in-depth strategies with multiple layers of security controls reduces risk exposure. Regular security audits, penetration testing, and vulnerability assessments help identify and remediate weaknesses proactively before they become exploitable.

Scalability architecture decisions influence system behavior under load and determine capacity for future growth. Horizontal and vertical scaling approaches present different tradeoffs in terms of complexity, cost, and operational overhead. Designing systems with scalability in mind from inception prevents costly refactoring and ensures smooth expansion as demand increases.

Criteria	Description	Consideration
Performance	Latency and throughput metrics	Measure against baselines
Scalability	Horizontal and vertical scaling	Plan for growth
Integration	Compatibility with ecosystem	Reduce friction
Cost	Operational and infrastructure costs	Total cost of ownership

NVIDIA GPUs

Table of Contents

GPU Tiers for LLMs

Memory Bandwidth vs Compute

H100 vs A100 vs L40S

Consumer GPU Options

Multi-GPU Configurations

Practical Selection Guide

Advanced Implementation

Comparison & Evaluation

NVIDIA GPUs

Table of Contents

GPU Tiers for LLMs

Memory Bandwidth vs Compute

H100 vs A100 vs L40S

Consumer GPU Options

Multi-GPU Configurations

Practical Selection Guide

Advanced Implementation

Comparison & Evaluation

Related concepts