Exponential backoff with jitter for transient LLM API errors — rate limits, 529s, and network blips are inevitable at scale. A good retry policy turns 99.5% availability into 99.95% from the client's perspective.
LLM API calls fail transiently more often than most APIs. Rate limits (429) are by design — providers enforce token and request quotas per minute. Overload errors (529 from Anthropic, 503 from OpenAI) occur when inference clusters are under load. Network blips drop connections mid-stream. These are transient failures — retrying after a brief wait usually succeeds.
Without retry logic, a 1% per-request failure rate means ~10% of 10-step agent pipelines fail. With good retry logic (3 attempts, exponential backoff), the same 1% failure rate produces ~0.001% pipeline failures — three orders of magnitude better.
The naive fix — immediate retry — creates a thundering herd problem: if 1000 clients all hit a rate limit at once and all retry immediately, they all hit the rate limit again simultaneously. Exponential backoff spreads retries out; jitter randomises them further to prevent synchronisation across clients.
import time
import random
def backoff_delay(attempt: int,
base: float = 1.0,
max_delay: float = 60.0,
jitter: bool = True) -> float:
# Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
delay = min(base * (2 ** attempt), max_delay)
if jitter:
# Full jitter: random in [0, delay] — best for reducing thundering herd
delay = random.uniform(0, delay)
return delay
# Manual retry loop
def call_with_retry(fn, max_attempts=5, retryable_codes=(429, 500, 503, 529)):
for attempt in range(max_attempts):
try:
return fn()
except Exception as e:
status = getattr(e, 'status_code', None)
if status not in retryable_codes or attempt == max_attempts - 1:
raise
delay = backoff_delay(attempt)
print(f"Attempt {attempt+1} failed ({status}). Retrying in {delay:.1f}s...")
time.sleep(delay)
# Delay sequence with full jitter (example):
for i in range(6):
print(f"attempt {i}: delay = {backoff_delay(i):.2f}s")
# attempt 0: delay = 0.43s
# attempt 1: delay = 0.87s
# attempt 2: delay = 2.13s
# attempt 3: delay = 4.99s
# attempt 4: delay = 13.7s
# attempt 5: delay = 31.2s
pip install tenacity
from tenacity import (
retry, stop_after_attempt, wait_exponential_jitter,
retry_if_exception, before_sleep_log
)
import anthropic
import logging
logger = logging.getLogger(__name__)
client = anthropic.Anthropic()
def is_retryable(exc: Exception) -> bool:
if isinstance(exc, anthropic.RateLimitError): return True
if isinstance(exc, anthropic.APIStatusError):
return exc.status_code in (500, 503, 529)
if isinstance(exc, anthropic.APIConnectionError): return True
return False
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=60), # exp backoff + jitter
retry=retry_if_exception(is_retryable),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_claude(prompt: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# Usage — automatic retry with logging
result = call_claude("Explain backoff algorithms.")
from anthropic import (
RateLimitError, # 429 — too many requests
APIStatusError, # wraps HTTP errors
APIConnectionError, # network failure
APITimeoutError, # request timed out
)
RETRY_STATUS_CODES = {
429, # Rate limit — always retry with backoff
500, # Internal server error — usually transient
503, # Service unavailable — temporary overload
529, # Anthropic overload — retry with longer backoff
}
DO_NOT_RETRY = {
400, # Bad request — your input is malformed, retrying won't help
401, # Unauthorised — fix your API key
403, # Forbidden — permissions issue
404, # Not found — wrong endpoint
422, # Unprocessable entity — invalid parameters
}
def classify_error(exc: Exception) -> str:
if isinstance(exc, RateLimitError):
return "retry" # always retry 429
if isinstance(exc, APIStatusError):
code = exc.status_code
if code in RETRY_STATUS_CODES: return "retry"
if code in DO_NOT_RETRY: return "fail"
return "retry" # unknown 5xx — optimistic retry
if isinstance(exc, (APIConnectionError, APITimeoutError)):
return "retry"
return "fail" # unknown exception — don't mask bugs with retries
# Special case: respect Retry-After headers
def get_retry_delay(exc: Exception) -> float:
if isinstance(exc, APIStatusError):
retry_after = exc.response.headers.get("retry-after")
if retry_after:
return float(retry_after) + 0.1 # honour the header + small buffer
return None # fall back to exponential backoff
import time
from dataclasses import dataclass, field
@dataclass
class RetryBudget:
total_deadline_s: float # absolute wall-clock deadline for the operation
max_attempts: int = 5
_attempts: int = field(default=0, init=False)
_start: float = field(default_factory=time.time, init=False)
@property
def time_remaining(self) -> float:
return self.total_deadline_s - (time.time() - self._start)
@property
def can_retry(self) -> bool:
return self._attempts < self.max_attempts and self.time_remaining > 0.5
def record_attempt(self):
self._attempts += 1
def next_delay(self) -> float:
base_delay = min(1.0 * (2 ** self._attempts), 30.0)
import random
delay = random.uniform(0, base_delay)
# Never sleep longer than we have remaining
return min(delay, self.time_remaining - 0.5)
def call_with_budget(fn, deadline_s: float = 30.0):
budget = RetryBudget(total_deadline_s=deadline_s)
last_error = None
while budget.can_retry:
budget.record_attempt()
try:
return fn()
except Exception as e:
if classify_error(e) != "retry": raise
last_error = e
delay = budget.next_delay()
if delay > 0: time.sleep(delay)
raise last_error or TimeoutError("Retry budget exhausted")
import asyncio
import random
from anthropic import AsyncAnthropic
async_client = AsyncAnthropic()
async def async_call_with_retry(prompt: str, max_attempts: int = 5) -> str:
for attempt in range(max_attempts):
try:
response = await async_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
except Exception as e:
if classify_error(e) != "retry" or attempt == max_attempts - 1:
raise
delay = random.uniform(0, min(1.0 * (2 ** attempt), 30.0))
await asyncio.sleep(delay) # non-blocking sleep
# Run multiple requests with retry, in parallel
async def batch_with_retry(prompts: list[str]) -> list[str]:
tasks = [async_call_with_retry(p) for p in prompts]
return await asyncio.gather(*tasks, return_exceptions=True)
# tenacity also has async support:
from tenacity import AsyncRetrying
async def tenacity_async(prompt: str) -> str:
async for attempt in AsyncRetrying(stop=stop_after_attempt(5),
wait=wait_exponential_jitter()):
with attempt:
resp = await async_client.messages.create(
model="claude-haiku-4-5-20251001", max_tokens=512,
messages=[{"role": "user", "content": prompt}])
return resp.content[0].text
Don't retry non-idempotent operations blindly. If your call has side effects (sends an email, creates a record, charges a payment), retrying on failure can cause duplicates. Use idempotency keys (Stripe's approach) or check-then-act patterns before retrying operations with side effects. LLM completions are idempotent; actions your agents take downstream may not be.
Honour Retry-After headers. When Anthropic returns a 429, the response headers often include retry-after: 20 — wait at least 20 seconds before retrying. Ignoring this and retrying immediately based on your own backoff schedule will get your requests rejected again, wasting time. Always check for and respect Retry-After before applying your own backoff.
Log retry attempts with context. A retry loop that silently sleeps and retries is a debugging nightmare. Log at WARNING level on each retry: the attempt number, error code, delay, and the first ~100 chars of the prompt. This lets you distinguish "normal rate limiting" from "stuck in an infinite 429 loop due to a quota bug" during incident investigation.
| Error Type | HTTP Code | Retry? | Strategy | Max Attempts |
|---|---|---|---|---|
| Rate limit | 429 | Yes | Exponential backoff + jitter; respect Retry-After header | 5 |
| Server error | 500, 529 | Yes | Exponential backoff from 1s | 3 |
| Gateway timeout | 502, 503, 504 | Yes | Linear backoff from 2s | 3 |
| Auth error | 401, 403 | No | Alert immediately; token may be expired | 0 |
| Bad request | 400 | No | Fix request; retrying will always fail | 0 |
| Context too long | 400 (specific) | Maybe | Truncate input and retry once | 1 |
In async high-concurrency applications, add per-request deadline tracking alongside retry logic. A request that has already consumed 80% of its SLA budget should not attempt a full exponential backoff sequence — it should fail fast and let the caller handle degradation. Use asyncio.wait_for with a deadline calculated from the original request timestamp, not from the retry attempt timestamp.
When using the Anthropic Python SDK, built-in retry logic (max_retries parameter) handles 429 and 529 errors automatically. Disable it only if you need custom logic such as falling back to a different model on the third retry. Set anthropic.Anthropic(max_retries=0) and implement your own wrapper with model-switching using tenacity.
For distributed microservice architectures, implement circuit-breaker logic alongside retry logic. After 5 consecutive failures to the same upstream provider, open the circuit and fail fast for 60 seconds before attempting a probe request. This prevents cascading failures where a slow LLM API causes your entire request queue to back up with retrying requests. Libraries like pybreaker implement this pattern with minimal boilerplate.