Execution Models

Streaming Architecture

Delivering LLM tokens to users as they're generated via SSE or WebSockets — the pattern that turns a 10-second blank wait into a fast-feeling experience. Covers FastAPI, async generators, and client handling.

SSE / WS
Transport
First token
<1s perceived
Async
Required

Table of Contents

SECTION 01

Why streaming transforms UX

A standard LLM response takes 5–30 seconds for a long output. Showing nothing for 20 seconds, then dumping a wall of text, feels broken — users assume the app has stalled. Streaming sends tokens to the client as they're generated, so users see the first word in under a second.

The psychological effect is significant. A response that takes 15 seconds to complete feels "fast" when you can see it being written in real time, because you have immediate feedback that something is happening. This is the same reason progress bars feel faster than blank waiting screens.

Beyond UX, streaming enables two practical patterns: early stopping (user reads enough and interrupts before the full response completes), and progressive rendering (parse and display structured output like markdown incrementally, rather than waiting for the full JSON).

SECTION 02

SSE vs WebSockets

Server-Sent Events (SSE): unidirectional, server → client. Uses regular HTTP. Auto-reconnects on disconnect. Supported natively in browsers via EventSource API. Perfect for streaming LLM responses — you only need server-to-client data flow. Simpler to implement, works through proxies and load balancers that support HTTP.

WebSockets: bidirectional. Separate protocol (WS/WSS). More complex setup. Use WebSockets when you need client-to-server messages mid-stream — e.g., user interrupts the generation, sends a follow-up while the model is still generating, or you need real-time bidirectional communication (like voice/audio streaming).

For most LLM chat applications: use SSE. The OpenAI and Anthropic APIs both use SSE for their streaming endpoints. WebSockets add complexity without benefit unless you have genuine bidirectional requirements.

SECTION 03

FastAPI SSE endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
import json

app = FastAPI()
client = anthropic.Anthropic()

async def stream_claude(prompt: str):
    # Open a streaming connection to Anthropic
    with client.messages.stream(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text_chunk in stream.text_stream:
            # SSE format: each event is "data: {json}

"
            yield f"data: {json.dumps({'text': text_chunk})}

"
        # Signal completion
        yield f"data: {json.dumps({'done': True})}

"

@app.get("/stream")
async def stream_endpoint(prompt: str):
    return StreamingResponse(
        stream_claude(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # disable nginx buffering
            "Connection": "keep-alive",
        }
    )

# Test: curl -N "http://localhost:8000/stream?prompt=Hello"
# Output: data: {"text": "Hello"}
#         data: {"text": "!"}
#         data: {"done": true}
SECTION 04

Streaming from the Anthropic API

import anthropic
import asyncio

client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()

# Synchronous streaming
def stream_sync(prompt: str) -> str:
    full_text = ""
    with client.messages.stream(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_text += text
        # Access final message after stream completes
        final_msg = stream.get_final_message()
        print(f"
Stop reason: {final_msg.stop_reason}")
        print(f"Tokens: {final_msg.usage.input_tokens}+{final_msg.usage.output_tokens}")
    return full_text

# Async streaming (for FastAPI/async applications)
async def stream_async(prompt: str):
    async with async_client.messages.stream(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            yield text   # yields tokens as they arrive

# Use in FastAPI:
async def stream_generator(prompt: str):
    async for token in stream_async(prompt):
        yield f"data: {token}

"
SECTION 05

Client-side JavaScript

// EventSource — simplest SSE client
const source = new EventSource(`/stream?prompt=Explain+transformers`);
const output = document.getElementById('output');

source.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.done) {
        source.close();
        return;
    }
    output.textContent += data.text;
};
source.onerror = () => source.close();

// Fetch API with ReadableStream — more control, supports POST
async function streamFetch(prompt) {
    const response = await fetch('/stream', {
        method: 'POST',
        headers: {'Content-Type': 'application/json'},
        body: JSON.stringify({prompt}),
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
        const {done, value} = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, {stream: true});
        // Parse SSE format
        for (const line of chunk.split('
')) {
            if (line.startsWith('data: ')) {
                const data = JSON.parse(line.slice(6));
                if (data.done) return;
                document.getElementById('output').textContent += data.text;
            }
        }
    }
}
SECTION 06

Streaming with tool use

import anthropic

client = anthropic.Anthropic()

tools = [{"name": "calculator", "description": "Compute math",
           "input_schema": {"type": "object",
                            "properties": {"expression": {"type": "string"}},
                            "required": ["expression"]}}]

# Stream events have different types during tool use
with client.messages.stream(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What is 2847 * 3921?"}],
) as stream:
    for event in stream:
        event_type = type(event).__name__

        if event_type == "RawContentBlockDeltaEvent":
            delta = event.delta
            if hasattr(delta, "text"):
                # Regular text token
                print(delta.text, end="", flush=True)
            elif hasattr(delta, "partial_json"):
                # Tool input being streamed as partial JSON
                print(f"[tool_input+={delta.partial_json}]", end="", flush=True)

        elif event_type == "RawContentBlockStartEvent":
            if hasattr(event.content_block, "name"):
                print(f"
[calling tool: {event.content_block.name}]")

    # Handle tool call after stream completes
    final = stream.get_final_message()
    for block in final.content:
        if block.type == "tool_use":
            result = eval(block.input["expression"])
            print(f"
Calculator result: {result}")
SECTION 07

Gotchas

Nginx and proxies buffer SSE by default. Most reverse proxies buffer responses before forwarding — this breaks streaming, showing nothing until the buffer fills or the connection closes. Fix: add X-Accel-Buffering: no response header (Nginx) or configure proxy_buffering off in your Nginx config. AWS ALB and Cloudflare also require explicit configuration to pass through SSE without buffering.

EventSource only supports GET requests. The browser's native EventSource API doesn't support POST or custom headers. If you need to send a long prompt (too long for URL params) or authentication headers, use the Fetch API with ReadableStream instead of EventSource. Most production streaming implementations use fetch + ReadableStream for this reason.

Count tokens before streaming starts for accurate cost estimates. Once you start streaming, you can't stop it mid-way without closing the connection — you'll be charged for all tokens generated. For expensive operations, count input tokens with client.messages.count_tokens() before opening the stream, and warn users or require confirmation if the estimated cost exceeds a threshold.

SECTION 08

Streaming Architecture Decision Guide

ScenarioRecommended ArchitectureKey Consideration
Single-turn chat, web browser clientFastAPI SSE + fetch EventSourceSimple, works through CDNs
Multi-turn chat, mobile appWebSocket with heartbeatStateful -- route to same server
High-volume API, B2BgRPC server streamingEfficient framing, strong typing
Streaming through a cache layerSSE with Redis pub/sub fanoutMultiple subscribers per stream
Serverless (AWS Lambda, Vercel)Chunked HTTP responseFunction timeout limits apply

Token-by-token streaming creates a perception of responsiveness that strongly influences user satisfaction scores, even when total latency is identical to buffered responses. The key metric is time-to-first-token (TTFT): getting the first token to the client within 300-500ms makes the response feel immediate. Optimise TTFT first by minimising prompt processing overhead (use prompt caching for static prefixes), then optimise throughput tokens-per-second for long completions.

Handle mid-stream errors gracefully. If the LLM connection drops after 200 tokens have been streamed, the client has already displayed partial output. Implement a resume protocol: send a unique stream ID at the start, allow clients to reconnect with that ID to receive remaining tokens from a server-side buffer. For most use cases, a simpler approach is to display an error indicator and allow the user to retry from the partial output.

Streaming architectures must handle backpressure carefully. When a downstream consumer processes tokens more slowly than the LLM generates them — for instance, a speech synthesis engine converting text to audio — the system needs a buffer to absorb the speed difference. Unbounded buffers risk memory exhaustion during long generations; bounded buffers with blocking semantics can cause the LLM API connection to time out. Token-based chunking with adaptive buffer sizing is a practical middle ground.

Server-sent events (SSE) remain the dominant streaming transport for web applications because they work over standard HTTP without requiring WebSocket infrastructure. SSE supports automatic reconnection on connection drop, making it resilient to transient network failures. For high-frequency token streams, the overhead of SSE framing is negligible, but for low-latency applications where every millisecond counts, raw WebSocket connections eliminate that overhead entirely.