Reliability Patterns

Retry & Backoff

Exponential backoff with jitter for transient LLM API errors — rate limits, 529s, and network blips are inevitable at scale. A good retry policy turns 99.5% availability into 99.95% from the client's perspective.

Exponential
Backoff
Jitter
Thundering herd
tenacity
Library

Table of Contents

SECTION 01

Why LLM APIs need retry logic

LLM API calls fail transiently more often than most APIs. Rate limits (429) are by design — providers enforce token and request quotas per minute. Overload errors (529 from Anthropic, 503 from OpenAI) occur when inference clusters are under load. Network blips drop connections mid-stream. These are transient failures — retrying after a brief wait usually succeeds.

Without retry logic, a 1% per-request failure rate means ~10% of 10-step agent pipelines fail. With good retry logic (3 attempts, exponential backoff), the same 1% failure rate produces ~0.001% pipeline failures — three orders of magnitude better.

The naive fix — immediate retry — creates a thundering herd problem: if 1000 clients all hit a rate limit at once and all retry immediately, they all hit the rate limit again simultaneously. Exponential backoff spreads retries out; jitter randomises them further to prevent synchronisation across clients.

SECTION 02

Exponential backoff with jitter

import time
import random

def backoff_delay(attempt: int,
                   base: float = 1.0,
                   max_delay: float = 60.0,
                   jitter: bool = True) -> float:
    # Exponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped)
    delay = min(base * (2 ** attempt), max_delay)
    if jitter:
        # Full jitter: random in [0, delay] — best for reducing thundering herd
        delay = random.uniform(0, delay)
    return delay

# Manual retry loop
def call_with_retry(fn, max_attempts=5, retryable_codes=(429, 500, 503, 529)):
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception as e:
            status = getattr(e, 'status_code', None)
            if status not in retryable_codes or attempt == max_attempts - 1:
                raise
            delay = backoff_delay(attempt)
            print(f"Attempt {attempt+1} failed ({status}). Retrying in {delay:.1f}s...")
            time.sleep(delay)

# Delay sequence with full jitter (example):
for i in range(6):
    print(f"attempt {i}: delay = {backoff_delay(i):.2f}s")
# attempt 0: delay = 0.43s
# attempt 1: delay = 0.87s
# attempt 2: delay = 2.13s
# attempt 3: delay = 4.99s
# attempt 4: delay = 13.7s
# attempt 5: delay = 31.2s
SECTION 03

Using tenacity

pip install tenacity

from tenacity import (
    retry, stop_after_attempt, wait_exponential_jitter,
    retry_if_exception, before_sleep_log
)
import anthropic
import logging

logger = logging.getLogger(__name__)
client = anthropic.Anthropic()

def is_retryable(exc: Exception) -> bool:
    if isinstance(exc, anthropic.RateLimitError):    return True
    if isinstance(exc, anthropic.APIStatusError):
        return exc.status_code in (500, 503, 529)
    if isinstance(exc, anthropic.APIConnectionError): return True
    return False

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(initial=1, max=60),  # exp backoff + jitter
    retry=retry_if_exception(is_retryable),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_claude(prompt: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# Usage — automatic retry with logging
result = call_claude("Explain backoff algorithms.")
SECTION 04

Which errors to retry

from anthropic import (
    RateLimitError,        # 429 — too many requests
    APIStatusError,        # wraps HTTP errors
    APIConnectionError,    # network failure
    APITimeoutError,       # request timed out
)

RETRY_STATUS_CODES = {
    429,   # Rate limit — always retry with backoff
    500,   # Internal server error — usually transient
    503,   # Service unavailable — temporary overload
    529,   # Anthropic overload — retry with longer backoff
}

DO_NOT_RETRY = {
    400,   # Bad request — your input is malformed, retrying won't help
    401,   # Unauthorised — fix your API key
    403,   # Forbidden — permissions issue
    404,   # Not found — wrong endpoint
    422,   # Unprocessable entity — invalid parameters
}

def classify_error(exc: Exception) -> str:
    if isinstance(exc, RateLimitError):
        return "retry"  # always retry 429
    if isinstance(exc, APIStatusError):
        code = exc.status_code
        if code in RETRY_STATUS_CODES: return "retry"
        if code in DO_NOT_RETRY: return "fail"
        return "retry"  # unknown 5xx — optimistic retry
    if isinstance(exc, (APIConnectionError, APITimeoutError)):
        return "retry"
    return "fail"  # unknown exception — don't mask bugs with retries

# Special case: respect Retry-After headers
def get_retry_delay(exc: Exception) -> float:
    if isinstance(exc, APIStatusError):
        retry_after = exc.response.headers.get("retry-after")
        if retry_after:
            return float(retry_after) + 0.1   # honour the header + small buffer
    return None  # fall back to exponential backoff
SECTION 05

Retry budgets

import time
from dataclasses import dataclass, field

@dataclass
class RetryBudget:
    total_deadline_s: float        # absolute wall-clock deadline for the operation
    max_attempts: int = 5
    _attempts: int = field(default=0, init=False)
    _start: float = field(default_factory=time.time, init=False)

    @property
    def time_remaining(self) -> float:
        return self.total_deadline_s - (time.time() - self._start)

    @property
    def can_retry(self) -> bool:
        return self._attempts < self.max_attempts and self.time_remaining > 0.5

    def record_attempt(self):
        self._attempts += 1

    def next_delay(self) -> float:
        base_delay = min(1.0 * (2 ** self._attempts), 30.0)
        import random
        delay = random.uniform(0, base_delay)
        # Never sleep longer than we have remaining
        return min(delay, self.time_remaining - 0.5)

def call_with_budget(fn, deadline_s: float = 30.0):
    budget = RetryBudget(total_deadline_s=deadline_s)
    last_error = None
    while budget.can_retry:
        budget.record_attempt()
        try:
            return fn()
        except Exception as e:
            if classify_error(e) != "retry": raise
            last_error = e
            delay = budget.next_delay()
            if delay > 0: time.sleep(delay)
    raise last_error or TimeoutError("Retry budget exhausted")
SECTION 06

Async retry patterns

import asyncio
import random
from anthropic import AsyncAnthropic

async_client = AsyncAnthropic()

async def async_call_with_retry(prompt: str, max_attempts: int = 5) -> str:
    for attempt in range(max_attempts):
        try:
            response = await async_client.messages.create(
                model="claude-haiku-4-5-20251001",
                max_tokens=512,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
        except Exception as e:
            if classify_error(e) != "retry" or attempt == max_attempts - 1:
                raise
            delay = random.uniform(0, min(1.0 * (2 ** attempt), 30.0))
            await asyncio.sleep(delay)   # non-blocking sleep

# Run multiple requests with retry, in parallel
async def batch_with_retry(prompts: list[str]) -> list[str]:
    tasks = [async_call_with_retry(p) for p in prompts]
    return await asyncio.gather(*tasks, return_exceptions=True)

# tenacity also has async support:
from tenacity import AsyncRetrying

async def tenacity_async(prompt: str) -> str:
    async for attempt in AsyncRetrying(stop=stop_after_attempt(5),
                                        wait=wait_exponential_jitter()):
        with attempt:
            resp = await async_client.messages.create(
                model="claude-haiku-4-5-20251001", max_tokens=512,
                messages=[{"role": "user", "content": prompt}])
            return resp.content[0].text
SECTION 07

Gotchas

Don't retry non-idempotent operations blindly. If your call has side effects (sends an email, creates a record, charges a payment), retrying on failure can cause duplicates. Use idempotency keys (Stripe's approach) or check-then-act patterns before retrying operations with side effects. LLM completions are idempotent; actions your agents take downstream may not be.

Honour Retry-After headers. When Anthropic returns a 429, the response headers often include retry-after: 20 — wait at least 20 seconds before retrying. Ignoring this and retrying immediately based on your own backoff schedule will get your requests rejected again, wasting time. Always check for and respect Retry-After before applying your own backoff.

Log retry attempts with context. A retry loop that silently sleeps and retries is a debugging nightmare. Log at WARNING level on each retry: the attempt number, error code, delay, and the first ~100 chars of the prompt. This lets you distinguish "normal rate limiting" from "stuck in an infinite 429 loop due to a quota bug" during incident investigation.

SECTION 08

Retry Strategy Reference

Error TypeHTTP CodeRetry?StrategyMax Attempts
Rate limit429YesExponential backoff + jitter; respect Retry-After header5
Server error500, 529YesExponential backoff from 1s3
Gateway timeout502, 503, 504YesLinear backoff from 2s3
Auth error401, 403NoAlert immediately; token may be expired0
Bad request400NoFix request; retrying will always fail0
Context too long400 (specific)MaybeTruncate input and retry once1

In async high-concurrency applications, add per-request deadline tracking alongside retry logic. A request that has already consumed 80% of its SLA budget should not attempt a full exponential backoff sequence — it should fail fast and let the caller handle degradation. Use asyncio.wait_for with a deadline calculated from the original request timestamp, not from the retry attempt timestamp.

When using the Anthropic Python SDK, built-in retry logic (max_retries parameter) handles 429 and 529 errors automatically. Disable it only if you need custom logic such as falling back to a different model on the third retry. Set anthropic.Anthropic(max_retries=0) and implement your own wrapper with model-switching using tenacity.

For distributed microservice architectures, implement circuit-breaker logic alongside retry logic. After 5 consecutive failures to the same upstream provider, open the circuit and fail fast for 60 seconds before attempting a probe request. This prevents cascading failures where a slow LLM API causes your entire request queue to back up with retrying requests. Libraries like pybreaker implement this pattern with minimal boilerplate.