Sync vs Async Serving

Sync vs async fundamentals
FastAPI async inference endpoint
Streaming responses with SSE
Thread pools for CPU-bound inference
Throughput vs latency tradeoffs
When to use each pattern
Gotchas

SECTION 01

Sync vs async fundamentals

When a user sends an LLM request, your server must: receive the request, run inference (the slow part), and return the response. Synchronous serving means the worker thread blocks during inference — simple to reason about but wastes CPU cycles waiting for GPU work to finish. Asynchronous serving uses Python's asyncio or a similar event loop so the same thread can handle other requests while waiting for inference I/O to complete.

For LLM inference specifically, the GPU does most of the work and inference is compute-bound rather than I/O-bound. The main async win comes from batching multiple concurrent requests together, letting the inference engine amortise the fixed overhead of a forward pass over many tokens at once.

SECTION 02

FastAPI async inference endpoint

from fastapi import FastAPI
from pydantic import BaseModel
import asyncio, time

app = FastAPI()

class InferRequest(BaseModel):
    prompt: str
    max_tokens: int = 256

# Async endpoint — doesn't block the event loop
@app.post("/generate")
async def generate(req: InferRequest):
    # run_in_executor offloads blocking inference to a thread pool
    # so the event loop stays free for health checks, other requests
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(
        None,           # default ThreadPoolExecutor
        _blocking_generate,
        req.prompt,
        req.max_tokens,
    )
    return {"text": result, "tokens": len(result.split())}

def _blocking_generate(prompt: str, max_tokens: int) -> str:
    # your actual model call here (transformers, vllm, etc.)
    time.sleep(0.5)  # simulate inference
    return f"Response to: {prompt[:50]}"

SECTION 03

Streaming responses with SSE

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def token_stream(prompt: str):
    # Yields tokens as they are produced — reduces time-to-first-token UX
    words = f"Streaming response to: {prompt}".split()
    for word in words:
        yield f"data: {word}\n\n"
        await asyncio.sleep(0.05)  # simulate per-token delay
    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_generate(prompt: str = "Hello"):
    return StreamingResponse(
        token_stream(prompt),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

# Client-side consumption with httpx:
# async with client.stream("GET", "/stream", params={"prompt": "Hi"}) as r:
#     async for line in r.aiter_lines():
#         if line.startswith("data: ") and line != "data: [DONE]":
#             print(line[6:], end=" ", flush=True)

SECTION 04

Thread pools for CPU-bound inference

Python's asyncio is designed for I/O-bound work. Model inference (even with GPU) involves Python overhead that holds the GIL. Use these patterns to prevent blocking the event loop:

ThreadPoolExecutor: Default approach for inference. FastAPI's run_in_executor uses this. Keep pool size ≤ number of GPU replicas.
ProcessPoolExecutor: For true CPU-bound preprocessing. Bypasses the GIL but has higher IPC overhead.
vLLM AsyncLLMEngine: The right tool for production LLM serving — built-in async, continuous batching, and streaming without any of this manual plumbing.

from concurrent.futures import ThreadPoolExecutor
import asyncio

executor = ThreadPoolExecutor(max_workers=4)  # 4 model replicas

async def async_infer(prompt: str) -> str:
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(executor, sync_model_call, prompt)

SECTION 05

Throughput vs latency tradeoffs

The fundamental tension in LLM serving:

Maximise throughput: Batch many requests together (continuous batching). Each request waits a little longer to start but the system handles more tokens/second overall. Best for offline batch jobs, embeddings APIs.
Minimise latency: Process each request immediately. Low p50 but wasted GPU capacity when traffic is low. Best for interactive chat where p99 matters.
Streaming: Orthogonal to the above — you can stream a batched response. Streaming improves perceived latency (user sees tokens appearing) without changing wall-clock completion time.

SECTION 06

When to use each pattern

Use synchronous endpoints when: latency is more important than throughput (single-user apps, debugging), you're using a simple framework like Flask, or the inference is fast (<100ms) and parallelism isn't needed.

Use async endpoints when: you need to handle many concurrent requests, you're calling remote inference APIs (network I/O bound), or you want to combine inference with other async operations (database lookups, external APIs).

Use streaming when: the response is long and the user is watching — streaming makes a 10-second response feel responsive immediately. Most chat interfaces expect SSE or WebSocket streaming.

SECTION 07

Gotchas

Don't await blocking code: await slow_sync_function() does NOT make it async — it still blocks the event loop. Always use run_in_executor.
Connection pooling: If you're calling a remote inference API, use a single async HTTP client (httpx.AsyncClient) with connection pooling rather than creating a new client per request.
Backpressure: Without a queue, async frameworks silently accept thousands of concurrent requests. Add a semaphore or queue limit to avoid OOM.
SSE buffering: Nginx and some reverse proxies buffer responses by default. Set X-Accel-Buffering: no to ensure streaming reaches the client immediately.

Serving pattern tradeoffs

Selecting the right serving pattern requires matching the concurrency model to the inference workload characteristics. Synchronous serving with thread pools is simple to implement and debug but becomes a bottleneck when request counts exceed thread pool capacity. Async serving enables high concurrency without proportionally increasing memory use but requires the inference library to release the GIL during computation. For GPU-bound LLM inference, the GPU itself serializes requests regardless of the Python concurrency model, making the choice primarily about CPU-side request management and I/O handling.

Pattern	Concurrency model	Best for	Pitfall
Sync + thread pool	OS threads	Simple CPU inference	Thread overhead at high QPS
Async (asyncio)	Event loop	I/O-bound, streaming	Blocking calls stall loop
Async + process pool	Processes + event loop	Multi-GPU, CPU-bound	IPC serialization cost
Batching server (vLLM)	Continuous batching	High-throughput GPU serving	Added infrastructure complexity

Streaming responses with server-sent events (SSE) change the serving pattern requirements significantly. A streaming endpoint must hold a connection open for the full generation duration, which for long outputs can be 10–30 seconds. Synchronous thread-pool implementations quickly exhaust threads under moderate streaming concurrency, making async SSE endpoints the strongly preferred pattern for streaming LLM responses in production.

Connection pool management is a critical consideration for async LLM serving deployments that call external APIs. Each async request to an OpenAI or Anthropic endpoint consumes a TCP connection from the aiohttp or httpx connection pool. Without explicit pool size configuration, the default pool may be too small for high-concurrency scenarios, causing requests to queue waiting for a connection rather than queuing on the upstream API rate limit. Configuring the connector pool size to match the expected concurrency level — typically 50–200 connections for production services — ensures that connection availability does not become the binding constraint on throughput.

Graceful shutdown handling in async LLM services prevents request truncation during deployment updates. When a Kubernetes pod receives a SIGTERM signal, in-flight LLM requests may be mid-generation with tokens actively streaming to clients. A well-implemented shutdown handler stops accepting new requests immediately but allows in-flight requests to complete up to a configurable drain timeout (typically 30–60 seconds). Without this graceful drain, rolling deployments consistently produce a burst of truncated streaming responses for users whose requests were in-flight during the pod shutdown, degrading user experience during routine deploys.

Backpressure implementation prevents async LLM services from accepting more requests than the inference backend can handle, which leads to unbounded queue growth and eventual out-of-memory crashes. A semaphore limiting the number of concurrent inference calls (typically set to the batch size capacity of the GPU) causes excess requests to receive 503 responses immediately rather than queuing indefinitely. Clients implementing exponential backoff retry logic handle 503 responses gracefully, distributing load more evenly over time. This backpressure approach degrades gracefully under traffic spikes rather than accumulating a backlog that degrades latency for all users until the backlog drains.

Request timeout configuration requires separate tuning for the connection timeout, the read timeout, and the total request timeout. LLM API connections typically establish quickly (100–500ms) but the first token may take 500ms–3s depending on server load and prompt length, and the full streaming response for a long generation may take 30–60 seconds. Setting a tight total timeout without distinguishing time-to-first-token from streaming duration causes legitimate long-generation requests to time out unnecessarily. Provider-specific timeout guidance and empirical P99 latency measurements on the target workload should inform timeout configuration rather than generic defaults.

Testing async LLM service endpoints requires async test clients and careful handling of streaming response assertions. The httpx.AsyncClient in combination with pytest-asyncio enables writing async test cases that send requests and validate streaming responses without blocking the event loop. Load testing async LLM endpoints with tools like locust or k6 reveals whether the async serving implementation actually achieves the expected concurrency benefits, as subtle blocking calls — synchronous file I/O, blocking third-party libraries, or CPU-bound processing in the event loop — can serialize requests despite async syntax.