FastAPI

Why FastAPI for LLM APIs
Basic LLM endpoint
Streaming responses
Request validation with Pydantic
Background tasks and queues
Middleware and authentication
Gotchas

SECTION 01

Why FastAPI for LLM APIs

FastAPI (Sebastián Ramírez, 2018) has become the dominant framework for building LLM-backed APIs in Python. Its key advantages for LLM workloads: (1) Async-native — built on Starlette/asyncio, handles many concurrent streaming connections efficiently; (2) Pydantic integration — automatic request/response validation and serialisation with clear error messages; (3) Auto-generated docs — OpenAPI + Swagger UI out of the box, essential when your LLM API is consumed by other teams; (4) Type hints — makes the codebase maintainable as it grows.

SECTION 02

Basic LLM endpoint

from fastapi import FastAPI
from pydantic import BaseModel, Field
import openai
import uvicorn

app = FastAPI(title="LLM API", version="1.0")
client = openai.AsyncOpenAI()  # async client

class GenerateRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=4000, description="User prompt")
    model: str = Field(default="gpt-4o-mini", description="Model to use")
    max_tokens: int = Field(default=512, ge=1, le=4096)
    temperature: float = Field(default=0.7, ge=0.0, le=2.0)

class GenerateResponse(BaseModel):
    text: str
    model: str
    prompt_tokens: int
    completion_tokens: int

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
    resp = await client.chat.completions.create(
        model=req.model,
        messages=[{"role": "user", "content": req.prompt}],
        max_tokens=req.max_tokens,
        temperature=req.temperature,
    )
    return GenerateResponse(
        text=resp.choices[0].message.content,
        model=resp.model,
        prompt_tokens=resp.usage.prompt_tokens,
        completion_tokens=resp.usage.completion_tokens,
    )

@app.get("/health")
async def health():
    return {"status": "ok"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
# Auto-generated docs: http://localhost:8000/docs

SECTION 03

Streaming responses

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()
client = openai.AsyncOpenAI()

async def token_generator(prompt: str, model: str):
    stream = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512,
        stream=True,
    )
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield f"data: {delta}\n\n"
    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_generate(prompt: str, model: str = "gpt-4o-mini"):
    return StreamingResponse(
        token_generator(prompt, model),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",      # disable Nginx buffering
            "Connection": "keep-alive",
        },
    )

SECTION 04

Request validation with Pydantic

from pydantic import BaseModel, Field, field_validator
from typing import Literal

class ChatRequest(BaseModel):
    messages: list[dict] = Field(..., min_length=1)
    model: Literal["gpt-4o-mini", "gpt-4o", "claude-3-haiku-20240307"]
    max_tokens: int = Field(default=256, ge=1, le=4096)
    temperature: float = 0.7
    user_id: str = Field(..., pattern=r"^[a-zA-Z0-9_-]{3,64}$")

    @field_validator("messages")
    @classmethod
    def validate_messages(cls, messages):
        for msg in messages:
            if "role" not in msg or "content" not in msg:
                raise ValueError("Each message must have 'role' and 'content'")
            if msg["role"] not in ("user", "assistant", "system"):
                raise ValueError(f"Invalid role: {msg['role']}")
        return messages

# FastAPI automatically returns 422 with detailed errors if validation fails
@app.post("/chat")
async def chat(req: ChatRequest):
    # req is guaranteed to be valid here
    ...

SECTION 05

Background tasks and queues

from fastapi import FastAPI, BackgroundTasks
import asyncio
from collections import deque

app = FastAPI()
job_results = {}  # In production: use Redis

async def run_expensive_generation(job_id: str, prompt: str):
    client = openai.AsyncOpenAI()
    resp = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=2048,
    )
    job_results[job_id] = resp.choices[0].message.content

@app.post("/generate/async")
async def submit_job(prompt: str, background_tasks: BackgroundTasks):
    import uuid
    job_id = str(uuid.uuid4())
    job_results[job_id] = None  # pending
    background_tasks.add_task(run_expensive_generation, job_id, prompt)
    return {"job_id": job_id, "status": "queued"}

@app.get("/generate/async/{job_id}")
async def get_result(job_id: str):
    result = job_results.get(job_id)
    if result is None and job_id in job_results:
        return {"status": "pending"}
    elif result:
        return {"status": "complete", "text": result}
    return {"status": "not_found"}

SECTION 06

Middleware and authentication

from fastapi import FastAPI, Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://yourdomain.com"],
    allow_methods=["*"],
    allow_headers=["*"],
)

bearer = HTTPBearer()
VALID_TOKENS = {"your-secret-token-here"}

async def verify_token(creds: HTTPAuthorizationCredentials = Depends(bearer)):
    if creds.credentials not in VALID_TOKENS:
        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED,
                            detail="Invalid token")
    return creds.credentials

@app.post("/generate", dependencies=[Depends(verify_token)])
async def generate(req: GenerateRequest):
    # Only reached if token is valid
    ...

SECTION 07

Gotchas

Workers for CPU-bound code: FastAPI/uvicorn is single-threaded per worker. For CPU-bound model inference, use multiple workers: uvicorn app:app --workers 4, or use Gunicorn + uvicorn workers. Without this, one slow request blocks all others.
Sync vs async mixing: Calling a synchronous function inside an async endpoint blocks the event loop. Use asyncio.get_event_loop().run_in_executor() for sync inference code.
Pydantic v1 vs v2: FastAPI 0.100+ uses Pydantic v2 by default. APIs differ — @validator becomes @field_validator, .dict() becomes .model_dump(). Check your FastAPI and Pydantic versions are compatible.
Connection timeouts for streaming: Default reverse proxy timeouts (Nginx: 60s, AWS ALB: 60s) will kill long streaming responses. Configure proxy_read_timeout in Nginx and idle timeout in the load balancer.

FastAPI LLM Service Patterns

FastAPI is the most popular Python framework for building LLM API services due to its automatic OpenAPI documentation, Pydantic validation, async support, and high performance. Combining FastAPI with async LLM clients enables non-blocking request handling that maximizes throughput when serving concurrent users.

Pattern	Implementation	Throughput	Use Case
Sync endpoint	def endpoint()	Low (blocks)	Simple prototypes
Async endpoint	async def endpoint()	High	Production services
Streaming response	StreamingResponse + generator	High + low TTFT	Chat UIs
Background tasks	BackgroundTasks	High	Async processing
WebSocket	websocket endpoint	Highest	Bidirectional streaming

Streaming LLM responses from FastAPI uses StreamingResponse with an async generator that yields token chunks as they arrive from the LLM API. The generator pattern enables the server to forward tokens to the client as soon as they arrive without buffering the full response, minimizing the client-perceived time to first token. Server-sent event (SSE) format — prefixing each chunk with "data: " and ending with double newline — is compatible with browser EventSource APIs and most streaming LLM client libraries.

Rate limiting and request queuing are critical for production FastAPI LLM services. Without rate limiting, a burst of concurrent requests can overwhelm LLM API rate limits and cause 429 errors that cascade into client-visible failures. Implementing token bucket rate limiting at the application layer (using libraries like slowapi) controls request admission. A background task queue (using Redis + ARQ or Celery) decouples request acceptance from LLM processing, allowing the API to accept requests at burst speed while processing them at a sustainable rate.

Pydantic models in FastAPI serve as both request validation schemas and API documentation, automatically generating the OpenAPI spec that powers the /docs UI. For LLM service endpoints, defining a strict Pydantic model for the request body — with field validation for model name, temperature range, max tokens, and message structure — prevents malformed requests from reaching the LLM API and produces clear error messages for API consumers. Response models similarly validate that the LLM service's output conforms to the expected schema before returning it to the client.

Dependency injection in FastAPI enables clean separation of LLM client initialization from endpoint logic. A dependency function that initializes and caches the LLM client (OpenAI, Anthropic, or a custom wrapper) is injected into endpoint functions via FastAPI's Depends mechanism. This pattern makes unit testing straightforward — tests can inject mock LLM clients instead of real ones — and ensures that configuration changes (API key rotation, model version updates, timeout adjustments) are made in one place rather than scattered across multiple endpoint functions.

Middleware in FastAPI LLM services handles cross-cutting concerns like request logging, authentication, and usage tracking that should apply to all endpoints uniformly. Custom middleware intercepts every request and response, logging the request ID, latency, token counts, and user identifier to a structured logging system. This observability data is essential for cost allocation, capacity planning, and detecting abuse patterns — users making unexpectedly large requests or hitting endpoints at rates inconsistent with normal usage.

Health check and readiness probe endpoints are essential for production FastAPI LLM services deployed in Kubernetes. A /health endpoint returns 200 when the service is running; a /ready endpoint performs a lightweight LLM call (often with a minimal test prompt) to confirm the LLM client is initialized and the API key is valid before signaling readiness to receive production traffic. Kubernetes uses these probes to route traffic only to ready instances and to restart unhealthy instances, preventing cold-start latency spikes from reaching users.