Delivering LLM tokens to users as they're generated via SSE or WebSockets — the pattern that turns a 10-second blank wait into a fast-feeling experience. Covers FastAPI, async generators, and client handling.
A standard LLM response takes 5–30 seconds for a long output. Showing nothing for 20 seconds, then dumping a wall of text, feels broken — users assume the app has stalled. Streaming sends tokens to the client as they're generated, so users see the first word in under a second.
The psychological effect is significant. A response that takes 15 seconds to complete feels "fast" when you can see it being written in real time, because you have immediate feedback that something is happening. This is the same reason progress bars feel faster than blank waiting screens.
Beyond UX, streaming enables two practical patterns: early stopping (user reads enough and interrupts before the full response completes), and progressive rendering (parse and display structured output like markdown incrementally, rather than waiting for the full JSON).
Server-Sent Events (SSE): unidirectional, server → client. Uses regular HTTP. Auto-reconnects on disconnect. Supported natively in browsers via EventSource API. Perfect for streaming LLM responses — you only need server-to-client data flow. Simpler to implement, works through proxies and load balancers that support HTTP.
WebSockets: bidirectional. Separate protocol (WS/WSS). More complex setup. Use WebSockets when you need client-to-server messages mid-stream — e.g., user interrupts the generation, sends a follow-up while the model is still generating, or you need real-time bidirectional communication (like voice/audio streaming).
For most LLM chat applications: use SSE. The OpenAI and Anthropic APIs both use SSE for their streaming endpoints. WebSockets add complexity without benefit unless you have genuine bidirectional requirements.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
import json
app = FastAPI()
client = anthropic.Anthropic()
async def stream_claude(prompt: str):
# Open a streaming connection to Anthropic
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text_chunk in stream.text_stream:
# SSE format: each event is "data: {json}
"
yield f"data: {json.dumps({'text': text_chunk})}
"
# Signal completion
yield f"data: {json.dumps({'done': True})}
"
@app.get("/stream")
async def stream_endpoint(prompt: str):
return StreamingResponse(
stream_claude(prompt),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no", # disable nginx buffering
"Connection": "keep-alive",
}
)
# Test: curl -N "http://localhost:8000/stream?prompt=Hello"
# Output: data: {"text": "Hello"}
# data: {"text": "!"}
# data: {"done": true}
import anthropic
import asyncio
client = anthropic.Anthropic()
async_client = anthropic.AsyncAnthropic()
# Synchronous streaming
def stream_sync(prompt: str) -> str:
full_text = ""
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
full_text += text
# Access final message after stream completes
final_msg = stream.get_final_message()
print(f"
Stop reason: {final_msg.stop_reason}")
print(f"Tokens: {final_msg.usage.input_tokens}+{final_msg.usage.output_tokens}")
return full_text
# Async streaming (for FastAPI/async applications)
async def stream_async(prompt: str):
async with async_client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
) as stream:
async for text in stream.text_stream:
yield text # yields tokens as they arrive
# Use in FastAPI:
async def stream_generator(prompt: str):
async for token in stream_async(prompt):
yield f"data: {token}
"
// EventSource — simplest SSE client
const source = new EventSource(`/stream?prompt=Explain+transformers`);
const output = document.getElementById('output');
source.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.done) {
source.close();
return;
}
output.textContent += data.text;
};
source.onerror = () => source.close();
// Fetch API with ReadableStream — more control, supports POST
async function streamFetch(prompt) {
const response = await fetch('/stream', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({prompt}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const {done, value} = await reader.read();
if (done) break;
const chunk = decoder.decode(value, {stream: true});
// Parse SSE format
for (const line of chunk.split('
')) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
if (data.done) return;
document.getElementById('output').textContent += data.text;
}
}
}
}
import anthropic
client = anthropic.Anthropic()
tools = [{"name": "calculator", "description": "Compute math",
"input_schema": {"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"]}}]
# Stream events have different types during tool use
with client.messages.stream(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What is 2847 * 3921?"}],
) as stream:
for event in stream:
event_type = type(event).__name__
if event_type == "RawContentBlockDeltaEvent":
delta = event.delta
if hasattr(delta, "text"):
# Regular text token
print(delta.text, end="", flush=True)
elif hasattr(delta, "partial_json"):
# Tool input being streamed as partial JSON
print(f"[tool_input+={delta.partial_json}]", end="", flush=True)
elif event_type == "RawContentBlockStartEvent":
if hasattr(event.content_block, "name"):
print(f"
[calling tool: {event.content_block.name}]")
# Handle tool call after stream completes
final = stream.get_final_message()
for block in final.content:
if block.type == "tool_use":
result = eval(block.input["expression"])
print(f"
Calculator result: {result}")
Nginx and proxies buffer SSE by default. Most reverse proxies buffer responses before forwarding — this breaks streaming, showing nothing until the buffer fills or the connection closes. Fix: add X-Accel-Buffering: no response header (Nginx) or configure proxy_buffering off in your Nginx config. AWS ALB and Cloudflare also require explicit configuration to pass through SSE without buffering.
EventSource only supports GET requests. The browser's native EventSource API doesn't support POST or custom headers. If you need to send a long prompt (too long for URL params) or authentication headers, use the Fetch API with ReadableStream instead of EventSource. Most production streaming implementations use fetch + ReadableStream for this reason.
Count tokens before streaming starts for accurate cost estimates. Once you start streaming, you can't stop it mid-way without closing the connection — you'll be charged for all tokens generated. For expensive operations, count input tokens with client.messages.count_tokens() before opening the stream, and warn users or require confirmation if the estimated cost exceeds a threshold.
| Scenario | Recommended Architecture | Key Consideration |
|---|---|---|
| Single-turn chat, web browser client | FastAPI SSE + fetch EventSource | Simple, works through CDNs |
| Multi-turn chat, mobile app | WebSocket with heartbeat | Stateful -- route to same server |
| High-volume API, B2B | gRPC server streaming | Efficient framing, strong typing |
| Streaming through a cache layer | SSE with Redis pub/sub fanout | Multiple subscribers per stream |
| Serverless (AWS Lambda, Vercel) | Chunked HTTP response | Function timeout limits apply |
Token-by-token streaming creates a perception of responsiveness that strongly influences user satisfaction scores, even when total latency is identical to buffered responses. The key metric is time-to-first-token (TTFT): getting the first token to the client within 300-500ms makes the response feel immediate. Optimise TTFT first by minimising prompt processing overhead (use prompt caching for static prefixes), then optimise throughput tokens-per-second for long completions.
Handle mid-stream errors gracefully. If the LLM connection drops after 200 tokens have been streamed, the client has already displayed partial output. Implement a resume protocol: send a unique stream ID at the start, allow clients to reconnect with that ID to receive remaining tokens from a server-side buffer. For most use cases, a simpler approach is to display an error indicator and allow the user to retry from the partial output.
Streaming architectures must handle backpressure carefully. When a downstream consumer processes tokens more slowly than the LLM generates them — for instance, a speech synthesis engine converting text to audio — the system needs a buffer to absorb the speed difference. Unbounded buffers risk memory exhaustion during long generations; bounded buffers with blocking semantics can cause the LLM API connection to time out. Token-based chunking with adaptive buffer sizing is a practical middle ground.
Server-sent events (SSE) remain the dominant streaming transport for web applications because they work over standard HTTP without requiring WebSocket infrastructure. SSE supports automatic reconnection on connection drop, making it resilient to transient network failures. For high-frequency token streams, the overhead of SSE framing is negligible, but for low-latency applications where every millisecond counts, raw WebSocket connections eliminate that overhead entirely.