SECTION 01
The Provider Fragmentation Problem
LLM providers each have unique SDKs, parameter names, authentication schemes, and error formats. Building a production system that can switch providers or use multiple providers simultaneously becomes a maintenance nightmare.
The Fragmentation
- OpenAI: `client.chat.completions.create()`, parameters like `temperature`, `max_tokens`, `messages`
- Anthropic: `client.messages.create()`, parameters like `max_tokens`, `system`, `messages`
- Google Gemini: Different request format, different parameter names
- Llama (via Together/Replicate): Different endpoints, different APIs
- Local Models (Ollama): Custom server, different protocol
The Result: Code tightly coupled to specific providers. Switching models requires rewriting core logic. Fallback chains require extensive boilerplate. Cost tracking is manual. Observability is provider-specific.
LiteLLM Solution
LiteLLM provides a single, unified OpenAI-compatible interface for all providers. Write code once, swap providers with a single parameter change.
Core Value: "The fastest way to call any LLM is with one line of code." LiteLLM abstracts provider differences so your application code stays clean and portable.
SECTION 02
Core API
LiteLLM's API is OpenAI-compatible but works with any provider:
completion() - Text Generation
from litellm import completion
# Works with OpenAI
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7
)
# Same code, different provider
response = completion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7
)
# Or Gemini
response = completion(
model="gemini-pro",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7
)
print(response.choices[0].message.content)
acompletion() - Async Calls
import asyncio
from litellm import acompletion
async def call_llm():
response = await acompletion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
return response
result = asyncio.run(call_llm())
Response Format
Responses are normalized to OpenAI format regardless of provider:
# All providers return this structure
response.choices[0].message.content # The text
response.choices[0].message.tool_calls # If tool use
response.usage.prompt_tokens
response.usage.completion_tokens
response.model # "gpt-4", "claude-...", etc
Environment Setup
Set API keys as env vars, LiteLLM auto-detects them:
# Set env vars
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
GOOGLE_API_KEY=...
# LiteLLM finds them automatically
response = completion(model="gpt-4", messages=...)
response = completion(model="claude-3-5-sonnet-20241022", messages=...)
Unified Model Naming: LiteLLM normalizes model names. Use `model="gpt-4"` for OpenAI, `model="claude-3-5-sonnet-20241022"` for Anthropic, etc. The interface stays the same.
SECTION 03
Provider Routing
LiteLLM's killer feature: seamless routing between providers for fallback, load balancing, and cost optimization.
Fallback: Try Expensive Model, Fall Back to Cheap
from litellm import completion
# Try GPT-4 (more capable, more expensive)
# If it fails, try Claude (fallback)
# If that fails, use Llama (cheapest)
fallback_models = [
"gpt-4",
"claude-3-5-sonnet-20241022",
"replicate/llama-2-70b"
]
for model in fallback_models:
try:
response = completion(
model=model,
messages=[{"role": "user", "content": "Solve this math problem"}]
)
print(f"Success with {model}")
break
except Exception as e:
print(f"{model} failed: {e}")
continue
Load Balancing: Distribute Across Providers
import random
models = ["gpt-4", "gpt-3.5-turbo", "claude-3-5-sonnet-20241022"]
# Random load balancing
chosen_model = random.choice(models)
response = completion(model=chosen_model, messages=...)
# Or weighted (e.g., 60% GPT-4, 40% Claude)
models_weighted = ["gpt-4"] * 6 + ["claude-3-5-sonnet-20241022"] * 4
chosen_model = random.choice(models_weighted)
response = completion(model=chosen_model, messages=...)
Cost-Based Routing
Route based on input length and cost:
# For short inputs, use cheap model
# For complex inputs, use expensive model
input_length = len(messages[0]["content"])
if input_length < 500:
model = "gpt-3.5-turbo" # ~$0.0005 per 1k tokens
else:
model = "gpt-4" # ~$0.03 per 1k tokens
response = completion(model=model, messages=messages)
Retry Logic with Backoff
import time
def call_with_retries(model, messages, max_retries=3):
for attempt in range(max_retries):
try:
response = completion(model=model, messages=messages)
return response
except Exception as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Attempt {attempt+1} failed, retrying in {wait_time}s")
time.sleep(wait_time)
else:
raise
result = call_with_retries("gpt-4", messages)
Production Pattern: Implement a multi-tier routing strategy: try primary model, fallback to secondary, finally fallback to reliable but expensive model. This maximizes cost efficiency while maintaining reliability.
SECTION 04
LiteLLM Proxy Server
The LiteLLM Proxy is a local OpenAI-compatible server that sits between your app and all LLM providers. Enables virtual keys, unified logging, cost tracking, and easier deployment.
Installation & Start
pip install litellm[proxy]
# Start proxy server
litellm --model gpt-4 --port 8000
# Or with config file
litellm --config ./proxy_config.yaml
config.yaml: Define Models & Keys
model_list:
- model_name: "gpt-4"
litellm_params:
model: "gpt-4"
api_key: "${OPENAI_API_KEY}"
- model_name: "claude"
litellm_params:
model: "claude-3-5-sonnet-20241022"
api_key: "${ANTHROPIC_API_KEY}"
- model_name: "cheap"
litellm_params:
model: "replicate/llama-2-70b"
api_key: "${REPLICATE_API_KEY}"
router_settings:
routing_strategy: "cost-based" # Route by cost
enable_load_balancing: true
general_settings:
master_key: "sk-1234..." # Proxy key
database_url: "postgresql://..." # Log to DB
Client Calls to Proxy
from openai import OpenAI
# Point to local proxy instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000",
api_key="sk-1234..." # Proxy master key
)
# Use like normal OpenAI
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# Or request any model the proxy knows about
response = client.chat.completions.create(
model="claude",
messages=[...]
)
Virtual Keys: Per-User Cost Tracking
# Proxy admin creates virtual key for each user
# POST /key/generate
{
"user_id": "user-123",
"spend_limit": 10, # $10/month
"models": ["gpt-4", "claude"]
}
# Returns: sk-virtual-user-123
# Client uses virtual key
client = OpenAI(
base_url="http://localhost:8000",
api_key="sk-virtual-user-123"
)
# All calls logged under user-123
# Proxy enforces $10/month spend limit
Why Proxy Architecture: Instead of embedding LiteLLM in your app, run it as a service. Multiple apps can use it. Virtual keys let you track per-user spending. One place to update API keys and routing logic.
SECTION 05
Cost & Token Tracking
LiteLLM automatically calculates cost and token usage for every API call.
Get Cost from Response
from litellm import completion
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.usage.prompt_tokens) # Tokens in prompt
print(response.usage.completion_tokens) # Tokens in response
print(response.usage.total_tokens) # Sum
# LiteLLM automatically calculates cost
# Based on official provider pricing
cost = response.cost # $ (float)
print(f"This call cost ${cost:.4f}")
Track Cost Across Calls
from litellm import completion
total_cost = 0.0
total_tokens = 0
messages = [
"What is machine learning?",
"Explain neural networks",
"How do transformers work?"
]
for query in messages:
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": query}]
)
total_cost += response.cost
total_tokens += response.usage.total_tokens
print(f"Query cost: ${response.cost:.4f}")
print(f"Total cost: ${total_cost:.2f} for {total_tokens} tokens")
Set Spend Limits
import litellm
litellm.max_budget = 10.0 # $10 per session
litellm.success_callback = log_cost # Log each call
try:
for i in range(100):
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": f"Prompt {i}"}]
)
# If total spend > $10, raises BudgetExceeded
except litellm.BudgetExceeded:
print("Monthly budget exceeded!")
Pricing Reference
LiteLLM has built-in pricing for 100+ models. Add custom pricing if needed:
import litellm
# Add custom model pricing
litellm.model_cost["custom-model"] = {
"input_cost_per_token": 0.001,
"output_cost_per_token": 0.002
}
response = completion(
model="custom-model",
messages=[...]
)
# Cost auto-calculated with custom pricing
print(response.cost)
Best Practice: Log cost alongside every LLM call. This data helps identify expensive queries, optimize routing, and forecast monthly spend.
SECTION 06
Observability Integration
LiteLLM integrates with leading observability platforms for monitoring, debugging, and auditing LLM calls.
LangSmith Integration
import litellm
from langsmith import Client
litellm.langsmith_client = Client(api_key="ls_...")
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# Automatically logged to LangSmith
# View in LangSmith dashboard with latency, cost, errors
Langfuse Integration
import litellm
from langfuse import Langfuse
litellm.langfuse_client = Langfuse(
public_key="pk_...",
secret_key="sk_..."
)
response = completion(
model="gpt-4",
messages=[...]
)
# Call logged to Langfuse dashboard
Custom Callbacks
Define custom logging for any system:
import litellm
def log_to_datadog(kwargs):
"""Called after every completion."""
import datadog
datadog.statsd.gauge(
"llm.completion_time",
kwargs["response_time"]
)
datadog.statsd.gauge(
"llm.cost",
kwargs.get("cost", 0)
)
litellm.success_callback = [log_to_datadog]
response = completion(model="gpt-4", messages=...)
# log_to_datadog automatically called with stats
Built-in Monitoring Metrics
- Latency (TTFB, full response time)
- Cost per call and aggregate
- Token usage (prompt, completion, total)
- Error rates and error types
- Model usage distribution
- API availability and uptime
Production Observability: Connect LiteLLM to your monitoring stack. Track cost, latency, and errors across all providers. This visibility is crucial for cost control and debugging.
SECTION 07
Advanced Patterns
Beyond basic completion calls, LiteLLM supports advanced use cases:
Streaming Responses
from litellm import completion
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True # Stream tokens as they arrive
)
for chunk in response:
delta = chunk.choices[0].delta.content
print(delta, end="", flush=True)
Function Calling (Tool Use)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
]
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "What's the weather in NYC?"}],
tools=tools
)
# Works across providers (OpenAI, Claude, Gemini)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
print(f"Call {tool_call.function.name} with {tool_call.function.arguments}")
Vision Models
response = completion(
model="gpt-4-vision", # Or "claude-3-5-sonnet-20241022" with vision
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "https://..."}}
]
}
]
)
print(response.choices[0].message.content)
Embedding Unification
from litellm import embedding
# All embedding models through same interface
embeddings = embedding(
model="text-embedding-3-small", # OpenAI
input="Hello world"
)
# Or switch to different provider
embeddings = embedding(
model="voyage-2", # Voyage AI
input="Hello world"
)
# Same response format, auto-cost calculation
Unified Abstractions: Streaming, tool calls, vision, embeddingsβall work consistently across providers. LiteLLM normalizes these advanced features so you don't need provider-specific code.