Production · Infrastructure

Cloud Deployment for LLMs

AWS, GCP, and Azure model serving — managed APIs vs self-hosted, auto-scaling, and cost optimisation

3 clouds
6 sections
9 services
Contents
  1. Managed API vs self-hosted
  2. AWS options
  3. GCP options
  4. Azure options
  5. Auto-scaling patterns
  6. Cost optimisation
  7. Tools & platforms
  8. References
01 — Architecture

Managed API vs Self-Hosted

Managed LLM APIs (OpenAI, Anthropic, Bedrock) trade control for simplicity. You send a request; they handle infrastructure, scaling, model updates. Self-hosted models (LLaMA, Mistral) run on your infrastructure. You control fine-tuning, caching, data residency, cost.

OptionLatency SLACost modelControlCold start
Managed API50–300msPer-tokenModel & params onlyNone
Container on GPU500–2000msHourly instanceFull (infra + model)1–2 min
Serverless GPU2–10secPer-secondFull (model only)2–5 sec
Spot cluster1–5secSpot rate (50–80% off)Full30 sec–1 min
💡 Decision rule: Use managed API if privacy is not critical and cost per request is < $0.01. Use self-hosted if you need sub-100ms latency, custom fine-tuning, or full data control.
02 — AWS Ecosystem

AWS Options

1. SageMaker Real-Time Endpoints

Managed endpoint serving. Deploy any model; SageMaker handles autoscaling and load balancing. Good for production workloads with consistent traffic. Pricing: you pay for instance time (e.g., ml.p4d.24xlarge at $37/hour). Cold starts are fast; scaling is automatic.

2. EKS + vLLM

Kubernetes on AWS. Deploy vLLM (high-throughput LLM inference engine) on EKS clusters. More control; scales via Karpenter. Better for burstable traffic and cost-sensitive workloads. Requires DevOps expertise.

3. Inferentia2 Chips

AWS custom silicon. Optimized for inference; ~70% cheaper than GPUs for throughput workloads. Tradeoff: limited model compatibility. Good for specific models (BERT, smaller LLMs).

4. AWS Bedrock

Managed API. Claude, Llama, Mixtral available via API. No infrastructure to manage. Per-token pricing. Best for teams wanting API convenience with data residency on AWS.

Python boto3 Example

import boto3 sm_client = boto3.client('sagemaker-runtime') response = sm_client.invoke_endpoint( EndpointName='my-llm-endpoint', Body=json.dumps({'messages': [...]}), ContentType='application/json' ) result = json.loads(response['Body'].read().decode()) print(result['generated_text'])
💡 SageMaker vs Bedrock: SageMaker for production inference; Bedrock for third-party models. Bedrock is simpler but has less flexibility.
03 — GCP Ecosystem

GCP Options

1. Vertex AI Model Garden

GCP's hosted LLM platform. Deploy fine-tuned versions of Claude, Gemini, Llama. Autoscaling, monitoring included. Simpler than raw GKE; less control than containers.

2. GKE + vLLM

Google Kubernetes Engine. Deploy vLLM for maximum throughput and cost control. Scale with Workload Identity for IAM. Good for high-volume batch inference and real-time serving.

3. TPU v4/v5e

Google custom silicon. Extremely fast for LLMs optimized for TPUs. v5e is newer, cheaper; v4 is proven. Requires model optimization; not all models support TPUs well.

4. Cloud Run with Models

Serverless containers. Deploy lightweight models (ORCA, fine-tuned small models) to Cloud Run for autoscaling without managing servers. Cold starts ~5 sec.

⚠️ TPU commitment: TPU blocks require 1-year commitment. Good for stable workloads; risky for experimental projects. Start with on-demand A100 GPUs on Compute Engine, move to TPU after 3 months of stable demand.
04 — Azure Ecosystem

Azure Options

1. Azure OpenAI Service

Managed API for GPT-4, GPT-3.5, embeddings. HIPAA/FedRAMP compliant. Per-token pricing. No infrastructure to manage. Best for enterprises needing Microsoft integration and compliance.

2. AKS + Triton

Azure Kubernetes Service with NVIDIA Triton Inference Server. Full control over model serving, batching, GPU optimization. Good for complex serving pipelines and custom kernels.

3. NC-series (H100) VMs

Virtual machines with H100 GPUs. Pay-as-you-go or reserved instances (30% discount). Deploy vLLM or TGI. More expensive than AWS/GCP but with Azure ecosystem integration.

4. Azure Machine Learning

Managed inference endpoints. Deploy Hugging Face models, fine-tuned models, ONNX. Autoscaling, monitoring included. Simpler than raw VMs; less flexible than AKS.

💡 Azure for enterprise: If you're already in the Microsoft ecosystem (Entra ID, Dynamics, Power BI), Azure's integrated auth and compliance make it compelling despite slightly higher costs.
05 — Scaling Patterns

Auto-Scaling Patterns

Horizontal Pod Autoscaler (HPA)

Scale based on GPU utilization. If average GPU use > 70%, spawn new pods. Works well for steady traffic; slower for traffic spikes (30–60 sec to provision).

Queue-Depth Scaling

Scale based on request queue length, not resource use. If queue length > 10, scale out. Faster response to spikes but requires queue implementation (AWS SQS, Kubernetes custom metric).

Karpenter for Node Provisioning

Open-source auto-scaler for Kubernetes. Provisions optimal node types (GPU, CPU, spot). Faster than HPA; integrates with spot instances for 50–80% cost savings. Requires AWS or GCP.

YAML Example: HPA on GPU Utilization

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-server-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-server minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 70
⚠️ Scaling latency: HPA adds 30–60 sec to spin up new pods. For bursty traffic, pre-scale or use spot instances with Karpenter. For real-time, maintain headroom (run at 50% capacity).
Python · Multi-cloud LLM router with latency-based failover
import time, logging
from openai import OpenAI
import anthropic

logger = logging.getLogger(__name__)

class MultiCloudRouter:
    """Route LLM requests across providers with latency tracking and failover."""
    def __init__(self):
        self.providers = {
            "openai":    {"client": OpenAI(), "model": "gpt-4o-mini",
                          "latencies": [], "errors": 0},
            "anthropic": {"client": anthropic.Anthropic(), "model": "claude-haiku-4-5-20251001",
                          "latencies": [], "errors": 0},
        }

    def _call_openai(self, p: dict, prompt: str) -> str:
        return p["client"].chat.completions.create(
            model=p["model"],
            messages=[{"role": "user", "content": prompt}],
            max_tokens=256
        ).choices[0].message.content

    def _call_anthropic(self, p: dict, prompt: str) -> str:
        return p["client"].messages.create(
            model=p["model"],
            max_tokens=256,
            messages=[{"role": "user", "content": prompt}]
        ).content[0].text

    def route(self, prompt: str, prefer_fast: bool = True) -> dict:
        # Sort providers by recent avg latency
        def avg_lat(name):
            lats = self.providers[name]["latencies"][-5:]
            return sum(lats)/len(lats) if lats else float('inf')

        order = sorted(self.providers.keys(), key=avg_lat) if prefer_fast                 else list(self.providers.keys())

        for name in order:
            p = self.providers[name]
            t0 = time.perf_counter()
            try:
                call_fn = self._call_openai if name == "openai" else self._call_anthropic
                result = call_fn(p, prompt)
                latency = time.perf_counter() - t0
                p["latencies"].append(latency)
                return {"provider": name, "response": result, "latency_ms": round(latency*1000)}
            except Exception as e:
                p["errors"] += 1
                logger.warning(f"{name} failed: {e}, trying next provider")
        raise RuntimeError("All providers failed")

router = MultiCloudRouter()
result = router.route("What is 42 * 37?")
print(f"[{result['provider']}] {result['latency_ms']}ms: {result['response']}")
06 — Cost Optimisation

Cost Optimisation

Spot Instances & Preemptible VMs

AWS spot / GCP preemptible instances cost 50–80% less than on-demand. Tradeoff: can be terminated with 2-min notice. Works for stateless inference; use with Karpenter for seamless failover.

Reserved Instances

Commit to 1-year or 3-year terms; save 30–50% on instance costs. Lock in capacity; requires demand forecasting. Good for baseline traffic.

Right-Sizing

Use smallest GPU type that meets latency SLA. A100 is overkill for many models; try H100, L40S, or even A10G. Monitor GPU utilization; oversized instances waste money.

Batching Requests

Batch inference multiplies throughput. Instead of 10 sequential requests (10 sec), batch them (1 sec for 100 requests). Trade latency for throughput; good for async workloads.

Speculative Decoding & Caching

Speculative decoding reduces tokens generated (faster, cheaper). Prompt caching (Claude, Gemini) reuses computation on repeated context (30–50% token savings).

💡 Cost math: 100K requests/day, $0.01/request = $1000/day. Optimize model routing (40% to cheap model): $0.007/avg = $700/day. Implement caching: $490/day. Spot instances: $245/day. Total: 75% savings.
07 — Ecosystem

Tools & Platforms

SageMaker

AWS managed endpoints. Deploy, scale, monitor LLMs. Native integration with AWS ecosytem.

Vertex AI

GCP managed platform. Fine-tune, deploy, monitor. Native for Gemini, Claude.

Azure OpenAI

Managed API on Azure. GPT-4, compliance, Entra ID integration.

vLLM

High-throughput LLM inference. Deploy on any cloud. PagedAttention for efficiency.

Triton

NVIDIA inference server. Multi-model, batching, custom kernels. Complex but powerful.

KServe

Kubernetes-native serving. Model registry, traffic splitting, autoscaling.

Karpenter

Fast node provisioning for Kubernetes. Spot instance support, cost optimization.

Kubernetes

Container orchestration. Industry standard; steep learning curve but full control.

Modal

Serverless GPU. Deploy functions; autoscale serverlessly. Simple; less control.

08 — Further Reading

References

Documentation & Guides
Practitioner Writing