Production & Infra

LiteLLM

Call 100+ LLM models with a unified OpenAI-compatible API. Handle routing, fallbacks, cost tracking, and observability.

100+
Providers
OpenAI-SDK
Interface
Built-in
Cost Tracking

Table of Contents

SECTION 01

The Provider Fragmentation Problem

LLM providers each have unique SDKs, parameter names, authentication schemes, and error formats. Building a production system that can switch providers or use multiple providers simultaneously becomes a maintenance nightmare.

The Fragmentation

The Result: Code tightly coupled to specific providers. Switching models requires rewriting core logic. Fallback chains require extensive boilerplate. Cost tracking is manual. Observability is provider-specific.

LiteLLM Solution

LiteLLM provides a single, unified OpenAI-compatible interface for all providers. Write code once, swap providers with a single parameter change.

Core Value: "The fastest way to call any LLM is with one line of code." LiteLLM abstracts provider differences so your application code stays clean and portable.
SECTION 02

Core API

LiteLLM's API is OpenAI-compatible but works with any provider:

completion() - Text Generation

from litellm import completion # Works with OpenAI response = completion( model="gpt-4", messages=[{"role": "user", "content": "Hello"}], temperature=0.7 ) # Same code, different provider response = completion( model="claude-3-5-sonnet-20241022", messages=[{"role": "user", "content": "Hello"}], temperature=0.7 ) # Or Gemini response = completion( model="gemini-pro", messages=[{"role": "user", "content": "Hello"}], temperature=0.7 ) print(response.choices[0].message.content)

acompletion() - Async Calls

import asyncio from litellm import acompletion async def call_llm(): response = await acompletion( model="gpt-4", messages=[{"role": "user", "content": "Hello"}] ) return response result = asyncio.run(call_llm())

Response Format

Responses are normalized to OpenAI format regardless of provider:

# All providers return this structure response.choices[0].message.content # The text response.choices[0].message.tool_calls # If tool use response.usage.prompt_tokens response.usage.completion_tokens response.model # "gpt-4", "claude-...", etc

Environment Setup

Set API keys as env vars, LiteLLM auto-detects them:

# Set env vars OPENAI_API_KEY=... ANTHROPIC_API_KEY=... GOOGLE_API_KEY=... # LiteLLM finds them automatically response = completion(model="gpt-4", messages=...) response = completion(model="claude-3-5-sonnet-20241022", messages=...)
Unified Model Naming: LiteLLM normalizes model names. Use `model="gpt-4"` for OpenAI, `model="claude-3-5-sonnet-20241022"` for Anthropic, etc. The interface stays the same.
SECTION 03

Provider Routing

LiteLLM's killer feature: seamless routing between providers for fallback, load balancing, and cost optimization.

Fallback: Try Expensive Model, Fall Back to Cheap

from litellm import completion # Try GPT-4 (more capable, more expensive) # If it fails, try Claude (fallback) # If that fails, use Llama (cheapest) fallback_models = [ "gpt-4", "claude-3-5-sonnet-20241022", "replicate/llama-2-70b" ] for model in fallback_models: try: response = completion( model=model, messages=[{"role": "user", "content": "Solve this math problem"}] ) print(f"Success with {model}") break except Exception as e: print(f"{model} failed: {e}") continue

Load Balancing: Distribute Across Providers

import random models = ["gpt-4", "gpt-3.5-turbo", "claude-3-5-sonnet-20241022"] # Random load balancing chosen_model = random.choice(models) response = completion(model=chosen_model, messages=...) # Or weighted (e.g., 60% GPT-4, 40% Claude) models_weighted = ["gpt-4"] * 6 + ["claude-3-5-sonnet-20241022"] * 4 chosen_model = random.choice(models_weighted) response = completion(model=chosen_model, messages=...)

Cost-Based Routing

Route based on input length and cost:

# For short inputs, use cheap model # For complex inputs, use expensive model input_length = len(messages[0]["content"]) if input_length < 500: model = "gpt-3.5-turbo" # ~$0.0005 per 1k tokens else: model = "gpt-4" # ~$0.03 per 1k tokens response = completion(model=model, messages=messages)

Retry Logic with Backoff

import time def call_with_retries(model, messages, max_retries=3): for attempt in range(max_retries): try: response = completion(model=model, messages=messages) return response except Exception as e: if attempt < max_retries - 1: wait_time = 2 ** attempt # Exponential backoff print(f"Attempt {attempt+1} failed, retrying in {wait_time}s") time.sleep(wait_time) else: raise result = call_with_retries("gpt-4", messages)
Production Pattern: Implement a multi-tier routing strategy: try primary model, fallback to secondary, finally fallback to reliable but expensive model. This maximizes cost efficiency while maintaining reliability.
SECTION 04

LiteLLM Proxy Server

The LiteLLM Proxy is a local OpenAI-compatible server that sits between your app and all LLM providers. Enables virtual keys, unified logging, cost tracking, and easier deployment.

Installation & Start

pip install litellm[proxy] # Start proxy server litellm --model gpt-4 --port 8000 # Or with config file litellm --config ./proxy_config.yaml

config.yaml: Define Models & Keys

model_list: - model_name: "gpt-4" litellm_params: model: "gpt-4" api_key: "${OPENAI_API_KEY}" - model_name: "claude" litellm_params: model: "claude-3-5-sonnet-20241022" api_key: "${ANTHROPIC_API_KEY}" - model_name: "cheap" litellm_params: model: "replicate/llama-2-70b" api_key: "${REPLICATE_API_KEY}" router_settings: routing_strategy: "cost-based" # Route by cost enable_load_balancing: true general_settings: master_key: "sk-1234..." # Proxy key database_url: "postgresql://..." # Log to DB

Client Calls to Proxy

from openai import OpenAI # Point to local proxy instead of OpenAI client = OpenAI( base_url="http://localhost:8000", api_key="sk-1234..." # Proxy master key ) # Use like normal OpenAI response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Hello"}] ) # Or request any model the proxy knows about response = client.chat.completions.create( model="claude", messages=[...] )

Virtual Keys: Per-User Cost Tracking

# Proxy admin creates virtual key for each user # POST /key/generate { "user_id": "user-123", "spend_limit": 10, # $10/month "models": ["gpt-4", "claude"] } # Returns: sk-virtual-user-123 # Client uses virtual key client = OpenAI( base_url="http://localhost:8000", api_key="sk-virtual-user-123" ) # All calls logged under user-123 # Proxy enforces $10/month spend limit
Why Proxy Architecture: Instead of embedding LiteLLM in your app, run it as a service. Multiple apps can use it. Virtual keys let you track per-user spending. One place to update API keys and routing logic.
SECTION 05

Cost & Token Tracking

LiteLLM automatically calculates cost and token usage for every API call.

Get Cost from Response

from litellm import completion response = completion( model="gpt-4", messages=[{"role": "user", "content": "Hello"}] ) print(response.usage.prompt_tokens) # Tokens in prompt print(response.usage.completion_tokens) # Tokens in response print(response.usage.total_tokens) # Sum # LiteLLM automatically calculates cost # Based on official provider pricing cost = response.cost # $ (float) print(f"This call cost ${cost:.4f}")

Track Cost Across Calls

from litellm import completion total_cost = 0.0 total_tokens = 0 messages = [ "What is machine learning?", "Explain neural networks", "How do transformers work?" ] for query in messages: response = completion( model="gpt-4", messages=[{"role": "user", "content": query}] ) total_cost += response.cost total_tokens += response.usage.total_tokens print(f"Query cost: ${response.cost:.4f}") print(f"Total cost: ${total_cost:.2f} for {total_tokens} tokens")

Set Spend Limits

import litellm litellm.max_budget = 10.0 # $10 per session litellm.success_callback = log_cost # Log each call try: for i in range(100): response = completion( model="gpt-4", messages=[{"role": "user", "content": f"Prompt {i}"}] ) # If total spend > $10, raises BudgetExceeded except litellm.BudgetExceeded: print("Monthly budget exceeded!")

Pricing Reference

LiteLLM has built-in pricing for 100+ models. Add custom pricing if needed:

import litellm # Add custom model pricing litellm.model_cost["custom-model"] = { "input_cost_per_token": 0.001, "output_cost_per_token": 0.002 } response = completion( model="custom-model", messages=[...] ) # Cost auto-calculated with custom pricing print(response.cost)
Best Practice: Log cost alongside every LLM call. This data helps identify expensive queries, optimize routing, and forecast monthly spend.
SECTION 06

Observability Integration

LiteLLM integrates with leading observability platforms for monitoring, debugging, and auditing LLM calls.

LangSmith Integration

import litellm from langsmith import Client litellm.langsmith_client = Client(api_key="ls_...") response = completion( model="gpt-4", messages=[{"role": "user", "content": "Hello"}] ) # Automatically logged to LangSmith # View in LangSmith dashboard with latency, cost, errors

Langfuse Integration

import litellm from langfuse import Langfuse litellm.langfuse_client = Langfuse( public_key="pk_...", secret_key="sk_..." ) response = completion( model="gpt-4", messages=[...] ) # Call logged to Langfuse dashboard

Custom Callbacks

Define custom logging for any system:

import litellm def log_to_datadog(kwargs): """Called after every completion.""" import datadog datadog.statsd.gauge( "llm.completion_time", kwargs["response_time"] ) datadog.statsd.gauge( "llm.cost", kwargs.get("cost", 0) ) litellm.success_callback = [log_to_datadog] response = completion(model="gpt-4", messages=...) # log_to_datadog automatically called with stats

Built-in Monitoring Metrics

Production Observability: Connect LiteLLM to your monitoring stack. Track cost, latency, and errors across all providers. This visibility is crucial for cost control and debugging.
SECTION 07

Advanced Patterns

Beyond basic completion calls, LiteLLM supports advanced use cases:

Streaming Responses

from litellm import completion response = completion( model="gpt-4", messages=[{"role": "user", "content": "Write a poem"}], stream=True # Stream tokens as they arrive ) for chunk in response: delta = chunk.choices[0].delta.content print(delta, end="", flush=True)

Function Calling (Tool Use)

tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} } } } } ] response = completion( model="gpt-4", messages=[{"role": "user", "content": "What's the weather in NYC?"}], tools=tools ) # Works across providers (OpenAI, Claude, Gemini) if response.choices[0].message.tool_calls: tool_call = response.choices[0].message.tool_calls[0] print(f"Call {tool_call.function.name} with {tool_call.function.arguments}")

Vision Models

response = completion( model="gpt-4-vision", # Or "claude-3-5-sonnet-20241022" with vision messages=[ { "role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": "https://..."}} ] } ] ) print(response.choices[0].message.content)

Embedding Unification

from litellm import embedding # All embedding models through same interface embeddings = embedding( model="text-embedding-3-small", # OpenAI input="Hello world" ) # Or switch to different provider embeddings = embedding( model="voyage-2", # Voyage AI input="Hello world" ) # Same response format, auto-cost calculation
Unified Abstractions: Streaming, tool calls, vision, embeddingsβ€”all work consistently across providers. LiteLLM normalizes these advanced features so you don't need provider-specific code.
SECTION 08

Provider Capability Comparison

LiteLLM abstracts provider differences, but understanding what each provider actually supports helps you avoid silent fallbacks and write more reliable routing logic.

FeatureOpenAIAnthropicGoogle GeminiAWS Bedrock
Streamingβœ“βœ“βœ“βœ“ (model-dependent)
Function/tool callingβœ“ (parallel)βœ“ (parallel)βœ“βœ“ (model-dependent)
Vision inputβœ“ (GPT-4V+)βœ“ (Claude 3+)βœ“ (Gemini 1.5+)βœ“ (via Claude/Titan)
JSON modeβœ“ (native)βœ“ (via prompt)βœ“ (native)Model-dependent
Max context (flagship)128K200K1M+Varies by model
Prompt cachingβœ“ (auto)βœ“ (explicit)βœ“ (auto)Model-dependent

When routing across providers with LiteLLM, specify model_list with explicit capability tags so the router can make intelligent fallback decisions. Set allowed_fails=2 and cooldown_time=60 to handle transient provider outages without cascading failures to your primary provider.