Performance Design

Latency Budget

Break down end-to-end response time into components: embedding, retrieval, LLM TTFT, generation, and post-processing. Set budgets per component, identify bottlenecks, and apply targeted optimisations β€” from caching to speculative decoding.

TTFT vs TPOT
Two latency metrics
P99 not mean
Target metric
Budget decomposition
Design approach

Table of Contents

SECTION 01

Two latency metrics: TTFT and TPOT

LLM latency has two distinct components that require different optimisations:

TTFT (Time to First Token): The time from when the request is sent until the first output token is received. This is dominated by: network round-trip + prefill computation (processing the input prompt). For a 1000-token prompt, prefill takes 0.5–2s on typical API infrastructure. This is what the user experiences as "the model thinking".

TPOT (Time Per Output Token): The time to generate each subsequent token after the first. Typically 20–80ms/token for frontier models, giving 12–50 tokens/second. For a 500-token response, this adds 10–40 seconds of total generation time.

For interactive applications, optimising TTFT improves perceived responsiveness more than TPOT β€” users tolerate slow token generation more than a long initial pause. Streaming output makes TPOT less important for user experience.

SECTION 02

Decomposing the latency budget

A typical RAG application has these latency components. Set a budget for each based on your overall SLA:

ComponentTypical rangeBudget (5s SLA)
Query embedding20–100ms100ms
Vector search10–50ms50ms
Reranking (if used)100–500ms200ms
LLM TTFT200ms–2s1,500ms
LLM generation1–10s2,500ms
Post-processing10–50ms50ms
Network overhead50–200ms100ms

If any component consistently exceeds its budget, focus optimisation there before elsewhere.

SECTION 03

Measuring each component

import time
from functools import wraps

def latency_tracker(component_name: str):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.perf_counter()
            result = await func(*args, **kwargs)
            elapsed_ms = (time.perf_counter() - start) * 1000
            # Log to your metrics system
            metrics.histogram(f"latency.{component_name}", elapsed_ms,
                              tags={"p50": ..., "p99": ...})
            if elapsed_ms > BUDGET[component_name]:
                logger.warning(f"{component_name} exceeded budget: {elapsed_ms:.0f}ms")
            return result
        return wrapper
    return decorator

@latency_tracker("embedding")
async def embed_query(text: str) -> list:
    return await embed_client.create(text)

@latency_tracker("vector_search")
async def search(embedding: list, k: int = 5) -> list:
    return await vector_store.query(embedding, top_k=k)

@latency_tracker("llm_generation")
async def generate(prompt: str) -> str:
    chunks = []
    async for chunk in llm_client.stream(prompt):
        chunks.append(chunk)
        if len(chunks) == 1:
            metrics.record("ttft_ms", (time.perf_counter() - start) * 1000)
    return "".join(chunks)
SECTION 04

Optimisation levers

Embedding latency: Cache query embeddings (same query = same vector). Use a local embedding model (no network round-trip) for queries β€” embedding models are small (100–400MB). Batch embedding requests at ingestion time.

Vector search: Approximate Nearest Neighbor (ANN) indices (HNSW, IVF) trade slight recall for 10–100Γ— faster search vs exact search. Pre-filter by metadata before vector search to reduce the search space.

LLM TTFT: Reduce prompt length (shorter system prompts, fewer retrieved chunks). Use speculative decoding. Use a smaller model if quality allows. Cache common prompt prefixes (KV-cache warming in TGI/vLLM).

LLM generation: Use streaming β€” don't wait for full generation. Limit max_tokens aggressively. Use faster/smaller models for tasks where quality requirements are lower.

System-level: Use async throughout (no blocking I/O). Parallelize embedding + metadata lookup. Keep your application server co-located with the LLM API region.

SECTION 05

Caching strategies

import hashlib
from functools import lru_cache
import redis

r = redis.Redis()

def cache_key(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

async def embed_with_cache(text: str) -> list:
    key = f"emb:{cache_key(text)}"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    embedding = await embed_client.create(text)
    r.setex(key, 3600, json.dumps(embedding))  # 1 hour TTL
    return embedding

# Semantic cache for LLM responses
async def query_with_semantic_cache(query: str, threshold: float = 0.95) -> str:
    query_emb = await embed_with_cache(query)
    # Search for similar past queries
    similar = await semantic_cache_store.search(query_emb, top_k=1)
    if similar and similar[0].similarity > threshold:
        return similar[0].cached_response  # reuse cached answer
    # Cache miss: run full RAG pipeline
    response = await full_rag_pipeline(query)
    await semantic_cache_store.store(query_emb, query, response)
    return response
SECTION 06

P99 vs mean

Always track P99 (99th percentile) latency, not mean. The mean hides outliers that cause user-visible timeout errors. In LLM systems, P99 is often 3–5Γ— the mean due to: occasional long-context requests hitting rate limits, cold model starts after scaling events, and occasional slow responses from LLM provider infrastructure.

Set SLAs in terms of P95 and P99: "95% of requests complete in <3s, 99% in <8s." Mean-based SLAs are meaningless for user-facing applications.

Tools: add latency histograms to your metrics (Prometheus histogram, Datadog distribution). Set alerting thresholds at P99, not mean. Log the full latency breakdown for any request exceeding the P99 budget to identify which component caused the outlier.

SECTION 07

Gotchas

Async doesn't always help: Running embedding and LLM calls in parallel only helps if they're independent. In a standard RAG pipeline, you need the embedding before searching, and the search results before calling the LLM β€” these are sequential by nature. Parallelise what you can (e.g., retrieve while streaming previous tokens).

Context window affects TTFT nonlinearly: LLM prefill cost scales roughly quadratically with input length (due to attention). A 4K-token prompt takes ~4Γ— longer to prefill than a 2K-token prompt. Reducing prompt length has outsized latency benefits for long-context applications.

Cold starts in serverless: If deploying on serverless (Lambda, Cloud Run), cold start times (200ms–2s) can dominate latency for infrequent users. Keep instances warm with periodic ping requests or use always-on minimum instances.

Latency Budget Allocation by Component

A latency budget is a formal decomposition of the total allowable end-to-end response time across all components of an LLM pipeline. By assigning a maximum time to each stage β€” retrieval, model inference, post-processing, network round-trip β€” teams can make explicit trade-offs about where to invest optimization effort and identify which component is the binding constraint for their latency SLO.

Pipeline ComponentTypical RangeOptimization LeverHard Floor
Network (clientβ†’server)10–100msCDN, edge inference~10ms
Embedding lookup5–50msCaching, smaller model~2ms
Vector retrieval10–100msIndex tuning, caching~5ms
LLM TTFT200ms–2sSmaller model, GPU tier~100ms
LLM generation1–10sStreaming, early stop~500ms
Post-processing5–50msAsync, lighter logic~1ms

Time-to-first-token (TTFT) and time-per-output-token (TPOT) are the two most important sub-metrics within the LLM generation component. TTFT dominates perceived responsiveness for chat interfaces because users judge whether a system feels fast by how quickly it starts responding. TPOT determines total generation time and matters more for batch workflows or long-form content. Streaming output to the client as tokens are generated hides generation latency from the user, making TTFT the primary metric to optimize for interactive applications.

Caching is the highest-leverage optimization for latency budgets in production RAG systems. Semantic caches that return cached responses for semantically similar queries can serve a significant fraction of production traffic at sub-millisecond latency. Embedding caches that store pre-computed embeddings for frequently queried documents eliminate the embedding lookup cost entirely. Prompt prefix caches, supported by providers like Anthropic and OpenAI, reuse KV cache entries for system prompts and common prefixes, reducing TTFT by 20–60% for requests with long shared prefixes.

# Latency budget profiler: instrument each pipeline stage
import time

class LatencyBudget:
    def __init__(self, budget_ms):
        self.budget = budget_ms / 1000
        self.stages = {}
        self.start = time.perf_counter()

    def checkpoint(self, stage):
        elapsed = time.perf_counter() - self.start
        self.stages[stage] = elapsed
        return elapsed

    def report(self):
        prev = 0
        for stage, t in self.stages.items():
            print(f"{stage}: {(t-prev)*1000:.1f}ms")
            prev = t
        total = list(self.stages.values())[-1] * 1000
        print(f"Total: {total:.1f}ms / budget {self.budget*1000:.0f}ms")
        return total <= self.budget * 1000

Adaptive latency budgets adjust targets based on real-time system load. When GPU utilization is below 60%, the system can afford to process more tokens and return higher-quality long-form responses; when a traffic spike pushes utilization above 90%, the system dynamically reduces max output token limits and disables optional post-processing steps to protect the P99 latency SLO for all concurrent users. Load-shedding policies that gracefully degrade quality under pressure are preferable to latency SLO violations that cause user-facing errors.