Synchronous serving blocks until inference finishes; async serving uses non-blocking I/O to pipeline requests. Choosing the right model matters enormously for throughput, latency tail, and cost.
When a user sends an LLM request, your server must: receive the request, run inference (the slow part), and return the response. Synchronous serving means the worker thread blocks during inference — simple to reason about but wastes CPU cycles waiting for GPU work to finish. Asynchronous serving uses Python's asyncio or a similar event loop so the same thread can handle other requests while waiting for inference I/O to complete.
For LLM inference specifically, the GPU does most of the work and inference is compute-bound rather than I/O-bound. The main async win comes from batching multiple concurrent requests together, letting the inference engine amortise the fixed overhead of a forward pass over many tokens at once.
from fastapi import FastAPI
from pydantic import BaseModel
import asyncio, time
app = FastAPI()
class InferRequest(BaseModel):
prompt: str
max_tokens: int = 256
# Async endpoint — doesn't block the event loop
@app.post("/generate")
async def generate(req: InferRequest):
# run_in_executor offloads blocking inference to a thread pool
# so the event loop stays free for health checks, other requests
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(
None, # default ThreadPoolExecutor
_blocking_generate,
req.prompt,
req.max_tokens,
)
return {"text": result, "tokens": len(result.split())}
def _blocking_generate(prompt: str, max_tokens: int) -> str:
# your actual model call here (transformers, vllm, etc.)
time.sleep(0.5) # simulate inference
return f"Response to: {prompt[:50]}"
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
app = FastAPI()
async def token_stream(prompt: str):
# Yields tokens as they are produced — reduces time-to-first-token UX
words = f"Streaming response to: {prompt}".split()
for word in words:
yield f"data: {word}\n\n"
await asyncio.sleep(0.05) # simulate per-token delay
yield "data: [DONE]\n\n"
@app.get("/stream")
async def stream_generate(prompt: str = "Hello"):
return StreamingResponse(
token_stream(prompt),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
# Client-side consumption with httpx:
# async with client.stream("GET", "/stream", params={"prompt": "Hi"}) as r:
# async for line in r.aiter_lines():
# if line.startswith("data: ") and line != "data: [DONE]":
# print(line[6:], end=" ", flush=True)
Python's asyncio is designed for I/O-bound work. Model inference (even with GPU) involves Python overhead that holds the GIL. Use these patterns to prevent blocking the event loop:
run_in_executor uses this. Keep pool size ≤ number of GPU replicas.from concurrent.futures import ThreadPoolExecutor
import asyncio
executor = ThreadPoolExecutor(max_workers=4) # 4 model replicas
async def async_infer(prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(executor, sync_model_call, prompt)
The fundamental tension in LLM serving:
Use synchronous endpoints when: latency is more important than throughput (single-user apps, debugging), you're using a simple framework like Flask, or the inference is fast (<100ms) and parallelism isn't needed.
Use async endpoints when: you need to handle many concurrent requests, you're calling remote inference APIs (network I/O bound), or you want to combine inference with other async operations (database lookups, external APIs).
Use streaming when: the response is long and the user is watching — streaming makes a 10-second response feel responsive immediately. Most chat interfaces expect SSE or WebSocket streaming.
await slow_sync_function() does NOT make it async — it still blocks the event loop. Always use run_in_executor.X-Accel-Buffering: no to ensure streaming reaches the client immediately.Selecting the right serving pattern requires matching the concurrency model to the inference workload characteristics. Synchronous serving with thread pools is simple to implement and debug but becomes a bottleneck when request counts exceed thread pool capacity. Async serving enables high concurrency without proportionally increasing memory use but requires the inference library to release the GIL during computation. For GPU-bound LLM inference, the GPU itself serializes requests regardless of the Python concurrency model, making the choice primarily about CPU-side request management and I/O handling.
| Pattern | Concurrency model | Best for | Pitfall |
|---|---|---|---|
| Sync + thread pool | OS threads | Simple CPU inference | Thread overhead at high QPS |
| Async (asyncio) | Event loop | I/O-bound, streaming | Blocking calls stall loop |
| Async + process pool | Processes + event loop | Multi-GPU, CPU-bound | IPC serialization cost |
| Batching server (vLLM) | Continuous batching | High-throughput GPU serving | Added infrastructure complexity |
Streaming responses with server-sent events (SSE) change the serving pattern requirements significantly. A streaming endpoint must hold a connection open for the full generation duration, which for long outputs can be 10–30 seconds. Synchronous thread-pool implementations quickly exhaust threads under moderate streaming concurrency, making async SSE endpoints the strongly preferred pattern for streaming LLM responses in production.
Connection pool management is a critical consideration for async LLM serving deployments that call external APIs. Each async request to an OpenAI or Anthropic endpoint consumes a TCP connection from the aiohttp or httpx connection pool. Without explicit pool size configuration, the default pool may be too small for high-concurrency scenarios, causing requests to queue waiting for a connection rather than queuing on the upstream API rate limit. Configuring the connector pool size to match the expected concurrency level — typically 50–200 connections for production services — ensures that connection availability does not become the binding constraint on throughput.
Graceful shutdown handling in async LLM services prevents request truncation during deployment updates. When a Kubernetes pod receives a SIGTERM signal, in-flight LLM requests may be mid-generation with tokens actively streaming to clients. A well-implemented shutdown handler stops accepting new requests immediately but allows in-flight requests to complete up to a configurable drain timeout (typically 30–60 seconds). Without this graceful drain, rolling deployments consistently produce a burst of truncated streaming responses for users whose requests were in-flight during the pod shutdown, degrading user experience during routine deploys.
Backpressure implementation prevents async LLM services from accepting more requests than the inference backend can handle, which leads to unbounded queue growth and eventual out-of-memory crashes. A semaphore limiting the number of concurrent inference calls (typically set to the batch size capacity of the GPU) causes excess requests to receive 503 responses immediately rather than queuing indefinitely. Clients implementing exponential backoff retry logic handle 503 responses gracefully, distributing load more evenly over time. This backpressure approach degrades gracefully under traffic spikes rather than accumulating a backlog that degrades latency for all users until the backlog drains.
Request timeout configuration requires separate tuning for the connection timeout, the read timeout, and the total request timeout. LLM API connections typically establish quickly (100–500ms) but the first token may take 500ms–3s depending on server load and prompt length, and the full streaming response for a long generation may take 30–60 seconds. Setting a tight total timeout without distinguishing time-to-first-token from streaming duration causes legitimate long-generation requests to time out unnecessarily. Provider-specific timeout guidance and empirical P99 latency measurements on the target workload should inform timeout configuration rather than generic defaults.
Testing async LLM service endpoints requires async test clients and careful handling of streaming response assertions. The httpx.AsyncClient in combination with pytest-asyncio enables writing async test cases that send requests and validate streaming responses without blocking the event loop. Load testing async LLM endpoints with tools like locust or k6 reveals whether the async serving implementation actually achieves the expected concurrency benefits, as subtle blocking calls — synchronous file I/O, blocking third-party libraries, or CPU-bound processing in the event loop — can serialize requests despite async syntax.