Prompt Caching

The repeated-context problem
How prompt caching works
Anthropic API: cache_control
What gets cached and what doesn't
Cost and latency numbers
Patterns: RAG, multi-turn, tool definitions
Gotchas

SECTION 01

The repeated-context problem

Picture a legal-research assistant. Every query starts: "You are a legal analyst. Here are the 200 pages of the contract…" followed by the user's actual question. Without caching, those 200 pages are tokenised, processed through all attention layers, and billed — on every single request. With 100 users sending 10 queries each, you pay for that document 1,000 times.

Prompt caching solves this by storing the KV-cache (the internal attention state) for the stable prefix on the provider's servers. Subsequent requests that share the same prefix skip the expensive prefill computation and go straight to generation — faster and cheaper.

SECTION 02

How prompt caching works

LLM inference has two phases: prefill (processing the prompt tokens, building the KV-cache) and generation (autoregressively producing the output tokens). Prefill is the expensive part for long contexts.

When you mark a prompt prefix for caching, the provider saves the KV-cache state at that breakpoint. If the next request's prefix matches exactly up to that breakpoint, prefill is skipped — the saved state is loaded instead. You pay a lower "cache read" price rather than full input-token price.

The cache is keyed on the exact byte sequence up to the breakpoint. A single character difference = cache miss.

SECTION 03

Anthropic API: cache_control

import anthropic

client = anthropic.Anthropic()

# A long, stable system prompt (e.g., 10k tokens of reference material)
SYSTEM_DOCS = '''... your 10,000-token reference document ...'''

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_DOCS,
            "cache_control": {"type": "ephemeral"}   # ← mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "What does section 4.2 say about liability?"}
    ]
)

# Inspect cache usage in the response
usage = response.usage
print(f"Input tokens:        {usage.input_tokens}")
print(f"Cache creation:      {usage.cache_creation_input_tokens}")  # first call: non-zero
print(f"Cache read:          {usage.cache_read_input_tokens}")      # subsequent: non-zero

First call: cache_creation_input_tokens is non-zero — you paid to build and store the cache (1.25× normal rate).

Subsequent calls: cache_read_input_tokens is non-zero — you pay only 10% of the normal rate for those tokens.

SECTION 04

What gets cached and what doesn't

The cache checkpoint is placed at the end of the content block marked with cache_control. Everything before and including that block is potentially cached. Everything after is re-processed on every call.

System message (cacheable)
  └─ [cache_control: ephemeral]  ← checkpoint here

User turn 1 (NOT cached — changes every call)
Assistant turn 1 (NOT cached)
User turn 2 (NOT cached)

You can place multiple cache breakpoints (up to 4 on Claude): for example, cache the system prompt at position 1 and a large tools definition block at position 2.

Cache TTL on Anthropic is 5 minutes for ephemeral caches. If your application has bursts of requests around a stable document, keep the requests coming within 5-minute windows to maximise hit rate.

SECTION 05

Cost and latency numbers

Token type	Price relative to normal
Normal input tokens	1×
Cache write (first call)	1.25×
Cache read (hit)	0.10×
Output tokens	unchanged

Break-even: The cache write costs 25% extra. You break even when you read from cache ≥ 2 times (first read saves 90%, more than covers the 25% write premium).

Latency: Cache hits skip the prefill computation for the cached prefix. On a 10,000-token prefix, expect ~80–90% reduction in time-to-first-token.

SECTION 06

Patterns: RAG, multi-turn, tool definitions

Pattern 1 — Long document Q&A (RAG replacement for stable docs):

# Cache the entire document; only the question changes
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    system=[{"type":"text","text": full_document, "cache_control":{"type":"ephemeral"}}],
    messages=[{"role":"user","content": user_question}]
)

Pattern 2 — Multi-turn conversation with stable context:

# Mark the system prompt for caching; conversation history is NOT cached
system = [{"type":"text","text": PERSONA_PROMPT, "cache_control":{"type":"ephemeral"}}]

messages = []  # grows with each turn
for user_input in conversation:
    messages.append({"role":"user","content": user_input})
    resp = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system,
        messages=messages
    )
    messages.append({"role":"assistant","content": resp.content[0].text})

Pattern 3 — Tool definitions: If you have 50 tool definitions, cache them as the last block before the conversation. Tool definitions are verbose but rarely change.

SECTION 07

Gotchas

Exact prefix matching. A single byte difference = full cache miss. Make sure your system prompt template renders identically across calls (watch out for dynamic timestamps, user IDs, or random seeds injected into the "stable" prefix).

Minimum cacheable length. Anthropic requires the cacheable content to be at least 1,024 tokens (claude-3-haiku) or 2,048 tokens (larger models). Shorter prefixes are silently not cached — you'll see cache_creation_input_tokens: 0.

Cache doesn't persist across model upgrades. Switching from claude-3-5-sonnet-20241022 to a newer version invalidates all cached prefixes — plan major model upgrades during low-traffic windows.

Context window still counts. Cached tokens still count toward the model's context window limit. Caching reduces cost and latency, not context consumption.

Cache hit rate optimization

Achieving high prompt cache hit rates requires structuring prompts so that the stable portion appears first and the variable portion appears last. The cache key is computed from the exact byte sequence up to the cache boundary marker, meaning any change in the prefix — including whitespace, punctuation, or encoding differences — invalidates the cache entry. Applications that dynamically construct system prompts from templates must ensure that the stable portions of the template are rendered identically across requests to benefit from caching.

Multi-turn conversation caching requires careful management of which turns are included in the cached prefix versus the live context. Caching the full conversation history up to the last assistant turn and marking the boundary before the most recent user message amortizes the cost of the growing conversation history across subsequent turns. Without caching, a 20-turn conversation with a 2,000-token system prompt incurs full input token charges for all 20 turns plus system prompt on every call; with caching, only the new user message tokens are charged at full input price after the cache warms up.

Monitoring cache hit rates in production identifies whether prompt structure changes have inadvertently broken caching. A sudden drop in cache hit rate — visible in provider usage dashboards or through input token cost per request — often indicates a code change that altered the prompt prefix. Logging the rendered prompt prefix hash alongside each request enables correlation of cache hit rate changes with specific deployment events, making diagnosis straightforward when hit rates unexpectedly decline.

Prompt cache warming strategies reduce cold-start latency for newly deployed application instances. When a new pod starts, its prompt cache is empty and the first requests incur full input token processing latency. For applications with large stable system prompts, sending a synthetic warmup request immediately after startup populates the cache before real traffic arrives. Kubernetes readiness probes that check for cache warmup completion — verified by measuring the latency of a test request against a threshold — ensure that new instances only receive traffic after their caches are warmed, preventing cold-start latency spikes from affecting end users during rolling deployments.

Tool definition caching deserves special attention in agentic applications where large tool schemas are repeated in every request. A single tool-augmented agent with 20 tools defined as JSON schemas may include 2,000–5,000 tokens of tool definitions in every request. Placing the tool definitions in the cached prefix and the user message in the non-cached suffix amortizes this cost entirely, as tool schemas change rarely. The savings are most pronounced for agentic loops with many short user turns, where tool schema tokens might represent 80–90% of total input tokens without caching.

Cross-session cache persistence behavior varies by provider and should be verified empirically rather than assumed. Anthropic's prompt caching stores cache entries for 5 minutes with a sliding window that resets each time the cached prefix is reused. For applications with periodic request patterns — a daily report generation task, a nightly batch evaluation — requests may be separated by longer than the cache TTL, making caching ineffective. Restructuring periodic tasks to send requests in tight batches rather than spaced intervals, or verifying that the application's request rate naturally keeps the cache alive, ensures that caching provides the expected cost reduction in production.

Provider-specific cache invalidation rules require application-level handling to avoid unexpected cache misses. On the Anthropic API, inserting a new message anywhere before the cache_control boundary invalidates the cache for all content after the insertion point. This means that multi-turn applications must reconstruct the full conversation history in the same order on every request, with the cache boundary marker placed after the last stable turn. Applications that re-order, edit, or summarize earlier turns to manage context window limits must be designed to keep the cached prefix stable, or the caching benefit disappears as conversation length grows.

Cache efficiency reporting through provider usage APIs or dashboards provides the operational data needed to verify that caching is working as intended. Anthropic's usage API returns cache_read_input_tokens and cache_creation_input_tokens alongside regular input_tokens in each API response, enabling precise tracking of cache hit rates and cost savings at the request level. Aggregating these metrics over time and correlating drops in cache hit rate with deployment events or code changes is the primary diagnostic tool for identifying when caching configuration regressions have been introduced.