Break down end-to-end response time into components: embedding, retrieval, LLM TTFT, generation, and post-processing. Set budgets per component, identify bottlenecks, and apply targeted optimisations β from caching to speculative decoding.
LLM latency has two distinct components that require different optimisations:
TTFT (Time to First Token): The time from when the request is sent until the first output token is received. This is dominated by: network round-trip + prefill computation (processing the input prompt). For a 1000-token prompt, prefill takes 0.5β2s on typical API infrastructure. This is what the user experiences as "the model thinking".
TPOT (Time Per Output Token): The time to generate each subsequent token after the first. Typically 20β80ms/token for frontier models, giving 12β50 tokens/second. For a 500-token response, this adds 10β40 seconds of total generation time.
For interactive applications, optimising TTFT improves perceived responsiveness more than TPOT β users tolerate slow token generation more than a long initial pause. Streaming output makes TPOT less important for user experience.
A typical RAG application has these latency components. Set a budget for each based on your overall SLA:
| Component | Typical range | Budget (5s SLA) |
|---|---|---|
| Query embedding | 20β100ms | 100ms |
| Vector search | 10β50ms | 50ms |
| Reranking (if used) | 100β500ms | 200ms |
| LLM TTFT | 200msβ2s | 1,500ms |
| LLM generation | 1β10s | 2,500ms |
| Post-processing | 10β50ms | 50ms |
| Network overhead | 50β200ms | 100ms |
If any component consistently exceeds its budget, focus optimisation there before elsewhere.
import time
from functools import wraps
def latency_tracker(component_name: str):
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
start = time.perf_counter()
result = await func(*args, **kwargs)
elapsed_ms = (time.perf_counter() - start) * 1000
# Log to your metrics system
metrics.histogram(f"latency.{component_name}", elapsed_ms,
tags={"p50": ..., "p99": ...})
if elapsed_ms > BUDGET[component_name]:
logger.warning(f"{component_name} exceeded budget: {elapsed_ms:.0f}ms")
return result
return wrapper
return decorator
@latency_tracker("embedding")
async def embed_query(text: str) -> list:
return await embed_client.create(text)
@latency_tracker("vector_search")
async def search(embedding: list, k: int = 5) -> list:
return await vector_store.query(embedding, top_k=k)
@latency_tracker("llm_generation")
async def generate(prompt: str) -> str:
chunks = []
async for chunk in llm_client.stream(prompt):
chunks.append(chunk)
if len(chunks) == 1:
metrics.record("ttft_ms", (time.perf_counter() - start) * 1000)
return "".join(chunks)
Embedding latency: Cache query embeddings (same query = same vector). Use a local embedding model (no network round-trip) for queries β embedding models are small (100β400MB). Batch embedding requests at ingestion time.
Vector search: Approximate Nearest Neighbor (ANN) indices (HNSW, IVF) trade slight recall for 10β100Γ faster search vs exact search. Pre-filter by metadata before vector search to reduce the search space.
LLM TTFT: Reduce prompt length (shorter system prompts, fewer retrieved chunks). Use speculative decoding. Use a smaller model if quality allows. Cache common prompt prefixes (KV-cache warming in TGI/vLLM).
LLM generation: Use streaming β don't wait for full generation. Limit max_tokens aggressively. Use faster/smaller models for tasks where quality requirements are lower.
System-level: Use async throughout (no blocking I/O). Parallelize embedding + metadata lookup. Keep your application server co-located with the LLM API region.
import hashlib
from functools import lru_cache
import redis
r = redis.Redis()
def cache_key(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
async def embed_with_cache(text: str) -> list:
key = f"emb:{cache_key(text)}"
cached = r.get(key)
if cached:
return json.loads(cached)
embedding = await embed_client.create(text)
r.setex(key, 3600, json.dumps(embedding)) # 1 hour TTL
return embedding
# Semantic cache for LLM responses
async def query_with_semantic_cache(query: str, threshold: float = 0.95) -> str:
query_emb = await embed_with_cache(query)
# Search for similar past queries
similar = await semantic_cache_store.search(query_emb, top_k=1)
if similar and similar[0].similarity > threshold:
return similar[0].cached_response # reuse cached answer
# Cache miss: run full RAG pipeline
response = await full_rag_pipeline(query)
await semantic_cache_store.store(query_emb, query, response)
return response
Always track P99 (99th percentile) latency, not mean. The mean hides outliers that cause user-visible timeout errors. In LLM systems, P99 is often 3β5Γ the mean due to: occasional long-context requests hitting rate limits, cold model starts after scaling events, and occasional slow responses from LLM provider infrastructure.
Set SLAs in terms of P95 and P99: "95% of requests complete in <3s, 99% in <8s." Mean-based SLAs are meaningless for user-facing applications.
Tools: add latency histograms to your metrics (Prometheus histogram, Datadog distribution). Set alerting thresholds at P99, not mean. Log the full latency breakdown for any request exceeding the P99 budget to identify which component caused the outlier.
Async doesn't always help: Running embedding and LLM calls in parallel only helps if they're independent. In a standard RAG pipeline, you need the embedding before searching, and the search results before calling the LLM β these are sequential by nature. Parallelise what you can (e.g., retrieve while streaming previous tokens).
Context window affects TTFT nonlinearly: LLM prefill cost scales roughly quadratically with input length (due to attention). A 4K-token prompt takes ~4Γ longer to prefill than a 2K-token prompt. Reducing prompt length has outsized latency benefits for long-context applications.
Cold starts in serverless: If deploying on serverless (Lambda, Cloud Run), cold start times (200msβ2s) can dominate latency for infrequent users. Keep instances warm with periodic ping requests or use always-on minimum instances.
A latency budget is a formal decomposition of the total allowable end-to-end response time across all components of an LLM pipeline. By assigning a maximum time to each stage β retrieval, model inference, post-processing, network round-trip β teams can make explicit trade-offs about where to invest optimization effort and identify which component is the binding constraint for their latency SLO.
| Pipeline Component | Typical Range | Optimization Lever | Hard Floor |
|---|---|---|---|
| Network (clientβserver) | 10β100ms | CDN, edge inference | ~10ms |
| Embedding lookup | 5β50ms | Caching, smaller model | ~2ms |
| Vector retrieval | 10β100ms | Index tuning, caching | ~5ms |
| LLM TTFT | 200msβ2s | Smaller model, GPU tier | ~100ms |
| LLM generation | 1β10s | Streaming, early stop | ~500ms |
| Post-processing | 5β50ms | Async, lighter logic | ~1ms |
Time-to-first-token (TTFT) and time-per-output-token (TPOT) are the two most important sub-metrics within the LLM generation component. TTFT dominates perceived responsiveness for chat interfaces because users judge whether a system feels fast by how quickly it starts responding. TPOT determines total generation time and matters more for batch workflows or long-form content. Streaming output to the client as tokens are generated hides generation latency from the user, making TTFT the primary metric to optimize for interactive applications.
Caching is the highest-leverage optimization for latency budgets in production RAG systems. Semantic caches that return cached responses for semantically similar queries can serve a significant fraction of production traffic at sub-millisecond latency. Embedding caches that store pre-computed embeddings for frequently queried documents eliminate the embedding lookup cost entirely. Prompt prefix caches, supported by providers like Anthropic and OpenAI, reuse KV cache entries for system prompts and common prefixes, reducing TTFT by 20β60% for requests with long shared prefixes.
# Latency budget profiler: instrument each pipeline stage
import time
class LatencyBudget:
def __init__(self, budget_ms):
self.budget = budget_ms / 1000
self.stages = {}
self.start = time.perf_counter()
def checkpoint(self, stage):
elapsed = time.perf_counter() - self.start
self.stages[stage] = elapsed
return elapsed
def report(self):
prev = 0
for stage, t in self.stages.items():
print(f"{stage}: {(t-prev)*1000:.1f}ms")
prev = t
total = list(self.stages.values())[-1] * 1000
print(f"Total: {total:.1f}ms / budget {self.budget*1000:.0f}ms")
return total <= self.budget * 1000
Adaptive latency budgets adjust targets based on real-time system load. When GPU utilization is below 60%, the system can afford to process more tokens and return higher-quality long-form responses; when a traffic spike pushes utilization above 90%, the system dynamically reduces max output token limits and disables optional post-processing steps to protect the P99 latency SLO for all concurrent users. Load-shedding policies that gracefully degrade quality under pressure are preferable to latency SLO violations that cause user-facing errors.