AWS, GCP, and Azure model serving — managed APIs vs self-hosted, auto-scaling, and cost optimisation
Managed LLM APIs (OpenAI, Anthropic, Bedrock) trade control for simplicity. You send a request; they handle infrastructure, scaling, model updates. Self-hosted models (LLaMA, Mistral) run on your infrastructure. You control fine-tuning, caching, data residency, cost.
| Option | Latency SLA | Cost model | Control | Cold start |
|---|---|---|---|---|
| Managed API | 50–300ms | Per-token | Model & params only | None |
| Container on GPU | 500–2000ms | Hourly instance | Full (infra + model) | 1–2 min |
| Serverless GPU | 2–10sec | Per-second | Full (model only) | 2–5 sec |
| Spot cluster | 1–5sec | Spot rate (50–80% off) | Full | 30 sec–1 min |
Managed endpoint serving. Deploy any model; SageMaker handles autoscaling and load balancing. Good for production workloads with consistent traffic. Pricing: you pay for instance time (e.g., ml.p4d.24xlarge at $37/hour). Cold starts are fast; scaling is automatic.
Kubernetes on AWS. Deploy vLLM (high-throughput LLM inference engine) on EKS clusters. More control; scales via Karpenter. Better for burstable traffic and cost-sensitive workloads. Requires DevOps expertise.
AWS custom silicon. Optimized for inference; ~70% cheaper than GPUs for throughput workloads. Tradeoff: limited model compatibility. Good for specific models (BERT, smaller LLMs).
Managed API. Claude, Llama, Mixtral available via API. No infrastructure to manage. Per-token pricing. Best for teams wanting API convenience with data residency on AWS.
GCP's hosted LLM platform. Deploy fine-tuned versions of Claude, Gemini, Llama. Autoscaling, monitoring included. Simpler than raw GKE; less control than containers.
Google Kubernetes Engine. Deploy vLLM for maximum throughput and cost control. Scale with Workload Identity for IAM. Good for high-volume batch inference and real-time serving.
Google custom silicon. Extremely fast for LLMs optimized for TPUs. v5e is newer, cheaper; v4 is proven. Requires model optimization; not all models support TPUs well.
Serverless containers. Deploy lightweight models (ORCA, fine-tuned small models) to Cloud Run for autoscaling without managing servers. Cold starts ~5 sec.
Managed API for GPT-4, GPT-3.5, embeddings. HIPAA/FedRAMP compliant. Per-token pricing. No infrastructure to manage. Best for enterprises needing Microsoft integration and compliance.
Azure Kubernetes Service with NVIDIA Triton Inference Server. Full control over model serving, batching, GPU optimization. Good for complex serving pipelines and custom kernels.
Virtual machines with H100 GPUs. Pay-as-you-go or reserved instances (30% discount). Deploy vLLM or TGI. More expensive than AWS/GCP but with Azure ecosystem integration.
Managed inference endpoints. Deploy Hugging Face models, fine-tuned models, ONNX. Autoscaling, monitoring included. Simpler than raw VMs; less flexible than AKS.
Scale based on GPU utilization. If average GPU use > 70%, spawn new pods. Works well for steady traffic; slower for traffic spikes (30–60 sec to provision).
Scale based on request queue length, not resource use. If queue length > 10, scale out. Faster response to spikes but requires queue implementation (AWS SQS, Kubernetes custom metric).
Open-source auto-scaler for Kubernetes. Provisions optimal node types (GPU, CPU, spot). Faster than HPA; integrates with spot instances for 50–80% cost savings. Requires AWS or GCP.
import time, logging
from openai import OpenAI
import anthropic
logger = logging.getLogger(__name__)
class MultiCloudRouter:
"""Route LLM requests across providers with latency tracking and failover."""
def __init__(self):
self.providers = {
"openai": {"client": OpenAI(), "model": "gpt-4o-mini",
"latencies": [], "errors": 0},
"anthropic": {"client": anthropic.Anthropic(), "model": "claude-haiku-4-5-20251001",
"latencies": [], "errors": 0},
}
def _call_openai(self, p: dict, prompt: str) -> str:
return p["client"].chat.completions.create(
model=p["model"],
messages=[{"role": "user", "content": prompt}],
max_tokens=256
).choices[0].message.content
def _call_anthropic(self, p: dict, prompt: str) -> str:
return p["client"].messages.create(
model=p["model"],
max_tokens=256,
messages=[{"role": "user", "content": prompt}]
).content[0].text
def route(self, prompt: str, prefer_fast: bool = True) -> dict:
# Sort providers by recent avg latency
def avg_lat(name):
lats = self.providers[name]["latencies"][-5:]
return sum(lats)/len(lats) if lats else float('inf')
order = sorted(self.providers.keys(), key=avg_lat) if prefer_fast else list(self.providers.keys())
for name in order:
p = self.providers[name]
t0 = time.perf_counter()
try:
call_fn = self._call_openai if name == "openai" else self._call_anthropic
result = call_fn(p, prompt)
latency = time.perf_counter() - t0
p["latencies"].append(latency)
return {"provider": name, "response": result, "latency_ms": round(latency*1000)}
except Exception as e:
p["errors"] += 1
logger.warning(f"{name} failed: {e}, trying next provider")
raise RuntimeError("All providers failed")
router = MultiCloudRouter()
result = router.route("What is 42 * 37?")
print(f"[{result['provider']}] {result['latency_ms']}ms: {result['response']}")
AWS spot / GCP preemptible instances cost 50–80% less than on-demand. Tradeoff: can be terminated with 2-min notice. Works for stateless inference; use with Karpenter for seamless failover.
Commit to 1-year or 3-year terms; save 30–50% on instance costs. Lock in capacity; requires demand forecasting. Good for baseline traffic.
Use smallest GPU type that meets latency SLA. A100 is overkill for many models; try H100, L40S, or even A10G. Monitor GPU utilization; oversized instances waste money.
Batch inference multiplies throughput. Instead of 10 sequential requests (10 sec), batch them (1 sec for 100 requests). Trade latency for throughput; good for async workloads.
Speculative decoding reduces tokens generated (faster, cheaper). Prompt caching (Claude, Gemini) reuses computation on repeated context (30–50% token savings).
AWS managed endpoints. Deploy, scale, monitor LLMs. Native integration with AWS ecosytem.
GCP managed platform. Fine-tune, deploy, monitor. Native for Gemini, Claude.
Managed API on Azure. GPT-4, compliance, Entra ID integration.
High-throughput LLM inference. Deploy on any cloud. PagedAttention for efficiency.
NVIDIA inference server. Multi-model, batching, custom kernels. Complex but powerful.
Kubernetes-native serving. Model registry, traffic splitting, autoscaling.
Fast node provisioning for Kubernetes. Spot instance support, cost optimization.
Container orchestration. Industry standard; steep learning curve but full control.
Serverless GPU. Deploy functions; autoscale serverlessly. Simple; less control.