Circuit Breaker

What circuit breakers solve
Three states explained
Python implementation
Integrating with LLM API calls
Thresholds and tuning
Monitoring circuit state
Gotchas

SECTION 01

What circuit breakers solve

When an upstream service (e.g. an LLM provider API) degrades or goes down, naive retry logic causes a cascade: threads pile up waiting for timeouts, memory fills with queued requests, and your entire service grinds to a halt even for requests that don't need the failing service. The circuit breaker pattern detects failure and fast-fails subsequent requests immediately, giving the upstream time to recover while keeping your service responsive.

SECTION 02

Three states explained

Closed (normal): Requests flow through. The circuit tracks the failure rate over a rolling window. When failures exceed the threshold, the circuit opens.
Open (failing): All requests are immediately rejected with a fallback response — no actual calls made to the upstream. After a configured timeout, the circuit moves to half-open.
Half-open (probing): A single probe request is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens and the timeout resets.

SECTION 03

Python implementation

import time, threading
from enum import Enum
from dataclasses import dataclass, field

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5      # failures in window to open
    recovery_timeout: float = 30.0  # seconds before half-open probe
    window_size: int = 10           # rolling window for failure rate

    _state: State = field(default=State.CLOSED, init=False)
    _failures: int = field(default=0, init=False)
    _last_failure_time: float = field(default=0.0, init=False)
    _lock: threading.Lock = field(default_factory=threading.Lock, init=False)

    def call(self, func, *args, **kwargs):
        with self._lock:
            if self._state == State.OPEN:
                if time.time() - self._last_failure_time > self.recovery_timeout:
                    self._state = State.HALF_OPEN
                else:
                    raise RuntimeError("Circuit breaker OPEN — fast failing")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self._lock:
            self._failures = 0
            self._state = State.CLOSED

    def _on_failure(self):
        with self._lock:
            self._failures += 1
            self._last_failure_time = time.time()
            if self._failures >= self.failure_threshold:
                self._state = State.OPEN

    @property
    def state(self) -> str:
        return self._state.value

SECTION 04

Integrating with LLM API calls

import openai
from functools import wraps

cb = CircuitBreaker(failure_threshold=3, recovery_timeout=60.0)

def call_llm(prompt: str) -> str:
    def _inner():
        client = openai.OpenAI()
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            timeout=10.0,
        )
        return resp.choices[0].message.content

    try:
        return cb.call(_inner)
    except RuntimeError as e:
        if "OPEN" in str(e):
            # Fallback: cached response, simpler model, or error message
            return "Service temporarily unavailable. Please try again shortly."
        raise

# Usage
response = call_llm("Summarise this document...")
print(f"Circuit state: {cb.state}")

SECTION 05

Thresholds and tuning

The right thresholds depend on your SLA and traffic pattern. As a starting point:

failure_threshold: 3–5 failures is appropriate for LLM APIs. Too low causes false trips on transient errors; too high lets failures pile up.
recovery_timeout: 30–60 seconds for LLM providers. Most transient rate-limit windows are 60 seconds; network blips resolve faster.
Error discrimination: Don't count 4xx client errors (bad prompts, content policy) as circuit failures — only count 5xx server errors and timeouts.

SECTION 06

Monitoring circuit state

import prometheus_client as prom

circuit_state_gauge = prom.Gauge(
    "llm_circuit_breaker_state",
    "Circuit breaker state (0=closed, 1=open, 2=half_open)",
    ["service"],
)

# Update metric on each call
state_map = {"closed": 0, "open": 1, "half_open": 2}
circuit_state_gauge.labels(service="openai").set(state_map[cb.state])

Alert when the circuit stays open for >5 minutes — that indicates a prolonged provider outage, not just a transient error.

SECTION 07

Gotchas

Thread safety: Circuit breaker state is shared across threads. Always use a lock around state transitions.
Don't trip on timeouts you caused: If you set an aggressive 2s timeout and the model needs 3s, you'll trip the breaker on healthy infrastructure. Set timeouts generously.
Fallback quality: The open-state fallback is the most visible part of circuit breaker design. A bad fallback ("error 503") is worse than a graceful degraded response.
Libraries: Consider pybreaker or tenacity for production use rather than rolling your own.

Circuit breaker configuration for LLM APIs

LLM API circuit breakers require different threshold calibration than circuit breakers for database or microservice calls. LLM APIs have naturally higher latency and more variable response times than typical backend services, making the default failure thresholds designed for fast services too sensitive for LLM API traffic. A circuit breaker configured for LLM APIs typically uses a longer rolling window (60–120 seconds), a higher failure rate threshold (50–70%), and a longer open-to-half-open reset timeout (30–60 seconds) to avoid tripping on normal LLM latency variance and short-lived API hiccups.

Parameter	Default services	LLM APIs	Rationale
Failure threshold	5 consecutive	50% rate in window	LLM errors are less predictable
Rolling window	10–30s	60–120s	Higher latency variability
Reset timeout	5–10s	30–60s	API recovery takes longer
Half-open probe count	1	3–5	Verify sustained recovery

Fallback strategies when the circuit is open determine whether the user experience degrades gracefully or fails completely. Common LLM API fallback patterns include: returning a cached response to a similar previous request, routing to a backup model provider, returning a static canned response with a user-visible explanation, or queuing the request for retry when the circuit closes. The appropriate fallback depends on the application's tolerance for degraded-but-present responses versus accurate-or-nothing responses — most user-facing applications prefer degraded responses to complete failures.

Fallback Patterns, Graceful Degradation, and Service Tiering

Circuit breaker patterns combine with fallback strategies to maintain availability during service degradation. A two-tier fallback (primary → secondary → cached response) ensures graceful degradation: if primary inference service is down, route requests to a secondary (slower, cheaper) service; if both fail, return cached results from the last successful call. In production LLM applications, this might be: primary=latest-model (GPT-4), fallback=smaller-model (GPT-3.5-turbo), cached=previous response. The circuit breaker detects primary service issues (error_rate > 5% or p99_latency > 2 seconds) and automatically toggles to fallback, monitored by health checks every 30 seconds. For search-based systems, cached results are tagged with age; returning a cached vector embedding from 5 minutes ago introduces low stale-ness (embeddings of new documents missing) but maintains availability. Pybreaker and tenacity Python libraries implement fallback logic: @circuitbreaker(fail_max=5, reset_timeout=60) decorators wrap service calls, automatically opening (rejecting requests) after 5 failures and half-opening (testing recovery) after 60 seconds. Multi-region deployments use geographic fallback: US East requests fail over to US West (different cloud region, <50ms additional latency), critical for maintaining 99.99% uptime SLAs.

Monitoring, Alerting, and Observability for Circuit States

Circuit breaker health requires deep observability: metrics include state transitions (closed → open), failure rates, retry success rates, and response times from each tier. Prometheus metrics (circuit_breaker_state_changes_total, circuit_breaker_calls_total, circuit_breaker_failures_total) enable dashboards showing when each breaker is open, how often, and for how long. Alerts should fire on: (1) circuit breaker open for >5 minutes (service degradation), (2) fallback service error rate rising (cascading failures), (3) cache hit rate dropping (stale-ness increasing). In Kubernetes, circuit breaker metrics feed into HPA decisions: high circuit_breaker_open_duration triggers pod autoscaling to increase primary service capacity. Datadog and New Relic provide pre-built circuit breaker dashboards; self-hosted stacks use Prometheus + Grafana. Trace-level observability (via OpenTelemetry or Jaeger) shows which requests hit fallback paths and why: correlation IDs link a user request through primary → open → fallback, revealing failure causes (timeout vs service error vs rate limiting). Real-world deployments log circuit state changes (timestamp, old_state, new_state, trigger_metric) for post-mortems; weekly circuit breaker event summaries highlight systemic issues deserving engineering attention.

Threshold Tuning, Sensitivity Analysis, and Adaptive Configuration

Circuit breaker thresholds must balance responsiveness (detect failures quickly) against flakiness (avoid toggling on transient hiccups). Failure detection thresholds vary by use case: request-level (fail_max=5 consecutive failures), rate-based (fail_if_error_rate > 5%), and latency-based (fail_if_p99_latency > 2×baseline). For latency-sensitive services, p99_latency threshold should be 1.5–2× of baseline under normal load; for best-effort services, even 5× is acceptable. Reset timeout (how long circuit stays open before attempting recovery) is typically 30–60 seconds; too short causes rapid open/close cycling, too long delays recovery. Empirically, monitoring error rate and latency distributions (via percentile histograms) reveals appropriate thresholds: if 95th percentile latency is 200ms under load, set p99_threshold to 400–500ms to avoid false positives. Kubernetes ConfigMap or environment variables enable per-service threshold tuning without code changes; A/B testing different thresholds via canary deployments (10% of traffic sees stricter thresholds) reveals operational sweet spots. Machine-learning-driven threshold adaptation — training models on historical error/latency data to predict imminent failures — is emerging in advanced observability platforms but requires significant operational overhead for marginal gains over well-tuned static thresholds.