Production Engineering

Circuit Breaker

The circuit breaker pattern stops cascading failures by short-circuiting calls to a failing service. Essential for LLM APIs where provider outages or rate limits can saturate your thread pool.

3 States
Closed β†’ Open β†’ Half-open
Fail Fast
No queue pile-up
Self-healing
Automatic recovery

Table of Contents

SECTION 01

What circuit breakers solve

When an upstream service (e.g. an LLM provider API) degrades or goes down, naive retry logic causes a cascade: threads pile up waiting for timeouts, memory fills with queued requests, and your entire service grinds to a halt even for requests that don't need the failing service. The circuit breaker pattern detects failure and fast-fails subsequent requests immediately, giving the upstream time to recover while keeping your service responsive.

SECTION 02

Three states explained

SECTION 03

Python implementation

import time, threading
from enum import Enum
from dataclasses import dataclass, field

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5      # failures in window to open
    recovery_timeout: float = 30.0  # seconds before half-open probe
    window_size: int = 10           # rolling window for failure rate

    _state: State = field(default=State.CLOSED, init=False)
    _failures: int = field(default=0, init=False)
    _last_failure_time: float = field(default=0.0, init=False)
    _lock: threading.Lock = field(default_factory=threading.Lock, init=False)

    def call(self, func, *args, **kwargs):
        with self._lock:
            if self._state == State.OPEN:
                if time.time() - self._last_failure_time > self.recovery_timeout:
                    self._state = State.HALF_OPEN
                else:
                    raise RuntimeError("Circuit breaker OPEN β€” fast failing")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self._lock:
            self._failures = 0
            self._state = State.CLOSED

    def _on_failure(self):
        with self._lock:
            self._failures += 1
            self._last_failure_time = time.time()
            if self._failures >= self.failure_threshold:
                self._state = State.OPEN

    @property
    def state(self) -> str:
        return self._state.value
SECTION 04

Integrating with LLM API calls

import openai
from functools import wraps

cb = CircuitBreaker(failure_threshold=3, recovery_timeout=60.0)

def call_llm(prompt: str) -> str:
    def _inner():
        client = openai.OpenAI()
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            timeout=10.0,
        )
        return resp.choices[0].message.content

    try:
        return cb.call(_inner)
    except RuntimeError as e:
        if "OPEN" in str(e):
            # Fallback: cached response, simpler model, or error message
            return "Service temporarily unavailable. Please try again shortly."
        raise

# Usage
response = call_llm("Summarise this document...")
print(f"Circuit state: {cb.state}")
SECTION 05

Thresholds and tuning

The right thresholds depend on your SLA and traffic pattern. As a starting point:

SECTION 06

Monitoring circuit state

import prometheus_client as prom

circuit_state_gauge = prom.Gauge(
    "llm_circuit_breaker_state",
    "Circuit breaker state (0=closed, 1=open, 2=half_open)",
    ["service"],
)

# Update metric on each call
state_map = {"closed": 0, "open": 1, "half_open": 2}
circuit_state_gauge.labels(service="openai").set(state_map[cb.state])

Alert when the circuit stays open for >5 minutes β€” that indicates a prolonged provider outage, not just a transient error.

SECTION 07

Gotchas

Circuit breaker configuration for LLM APIs

LLM API circuit breakers require different threshold calibration than circuit breakers for database or microservice calls. LLM APIs have naturally higher latency and more variable response times than typical backend services, making the default failure thresholds designed for fast services too sensitive for LLM API traffic. A circuit breaker configured for LLM APIs typically uses a longer rolling window (60–120 seconds), a higher failure rate threshold (50–70%), and a longer open-to-half-open reset timeout (30–60 seconds) to avoid tripping on normal LLM latency variance and short-lived API hiccups.

ParameterDefault servicesLLM APIsRationale
Failure threshold5 consecutive50% rate in windowLLM errors are less predictable
Rolling window10–30s60–120sHigher latency variability
Reset timeout5–10s30–60sAPI recovery takes longer
Half-open probe count13–5Verify sustained recovery

Fallback strategies when the circuit is open determine whether the user experience degrades gracefully or fails completely. Common LLM API fallback patterns include: returning a cached response to a similar previous request, routing to a backup model provider, returning a static canned response with a user-visible explanation, or queuing the request for retry when the circuit closes. The appropriate fallback depends on the application's tolerance for degraded-but-present responses versus accurate-or-nothing responses β€” most user-facing applications prefer degraded responses to complete failures.

Fallback Patterns, Graceful Degradation, and Service Tiering

Circuit breaker patterns combine with fallback strategies to maintain availability during service degradation. A two-tier fallback (primary β†’ secondary β†’ cached response) ensures graceful degradation: if primary inference service is down, route requests to a secondary (slower, cheaper) service; if both fail, return cached results from the last successful call. In production LLM applications, this might be: primary=latest-model (GPT-4), fallback=smaller-model (GPT-3.5-turbo), cached=previous response. The circuit breaker detects primary service issues (error_rate > 5% or p99_latency > 2 seconds) and automatically toggles to fallback, monitored by health checks every 30 seconds. For search-based systems, cached results are tagged with age; returning a cached vector embedding from 5 minutes ago introduces low stale-ness (embeddings of new documents missing) but maintains availability. Pybreaker and tenacity Python libraries implement fallback logic: @circuitbreaker(fail_max=5, reset_timeout=60) decorators wrap service calls, automatically opening (rejecting requests) after 5 failures and half-opening (testing recovery) after 60 seconds. Multi-region deployments use geographic fallback: US East requests fail over to US West (different cloud region, <50ms additional latency), critical for maintaining 99.99% uptime SLAs.

Monitoring, Alerting, and Observability for Circuit States

Circuit breaker health requires deep observability: metrics include state transitions (closed β†’ open), failure rates, retry success rates, and response times from each tier. Prometheus metrics (circuit_breaker_state_changes_total, circuit_breaker_calls_total, circuit_breaker_failures_total) enable dashboards showing when each breaker is open, how often, and for how long. Alerts should fire on: (1) circuit breaker open for >5 minutes (service degradation), (2) fallback service error rate rising (cascading failures), (3) cache hit rate dropping (stale-ness increasing). In Kubernetes, circuit breaker metrics feed into HPA decisions: high circuit_breaker_open_duration triggers pod autoscaling to increase primary service capacity. Datadog and New Relic provide pre-built circuit breaker dashboards; self-hosted stacks use Prometheus + Grafana. Trace-level observability (via OpenTelemetry or Jaeger) shows which requests hit fallback paths and why: correlation IDs link a user request through primary β†’ open β†’ fallback, revealing failure causes (timeout vs service error vs rate limiting). Real-world deployments log circuit state changes (timestamp, old_state, new_state, trigger_metric) for post-mortems; weekly circuit breaker event summaries highlight systemic issues deserving engineering attention.

Threshold Tuning, Sensitivity Analysis, and Adaptive Configuration

Circuit breaker thresholds must balance responsiveness (detect failures quickly) against flakiness (avoid toggling on transient hiccups). Failure detection thresholds vary by use case: request-level (fail_max=5 consecutive failures), rate-based (fail_if_error_rate > 5%), and latency-based (fail_if_p99_latency > 2Γ—baseline). For latency-sensitive services, p99_latency threshold should be 1.5–2Γ— of baseline under normal load; for best-effort services, even 5Γ— is acceptable. Reset timeout (how long circuit stays open before attempting recovery) is typically 30–60 seconds; too short causes rapid open/close cycling, too long delays recovery. Empirically, monitoring error rate and latency distributions (via percentile histograms) reveals appropriate thresholds: if 95th percentile latency is 200ms under load, set p99_threshold to 400–500ms to avoid false positives. Kubernetes ConfigMap or environment variables enable per-service threshold tuning without code changes; A/B testing different thresholds via canary deployments (10% of traffic sees stricter thresholds) reveals operational sweet spots. Machine-learning-driven threshold adaptation β€” training models on historical error/latency data to predict imminent failures β€” is emerging in advanced observability platforms but requires significant operational overhead for marginal gains over well-tuned static thresholds.