Context Compaction

What Is Context Compaction? How Compaction Works Compaction vs. Client-Side Strategies Implementation What Compaction Preserves (and What It Doesn't) Combining Compaction with Checkpointing

SECTION 01

What Is Context Compaction?

Context Compaction is a server-side feature of the Anthropic API (launched February 2026) that allows agent conversations to continue effectively indefinitely. When a conversation approaches the model's context window limit, the API automatically summarises older turns into a compact representation, then continues the conversation using that summary plus the recent uncompressed turns.

The result is that long-running agents — research agents, coding agents working on large tasks, document processing pipelines — no longer need to implement context management logic client-side. Compaction happens transparently: the agent simply keeps sending messages, and the API handles the window overflow.

This is a fundamentally different approach from all prior context management techniques, because the summarisation is performed by the same model family that will use the summary. The model is optimised to preserve the information most relevant to coherent continuation of the specific task, not just a generic summary of what was said.

Why this matters for production agents: Before compaction, building a long-running agent required non-trivial engineering: detecting when the window was full, deciding what to drop, and writing a summarisation prompt that preserved task state correctly. Compaction eliminates this entire problem class for most agent use cases.

SECTION 02

How Compaction Works

The compaction process operates on the server side, invisibly to the client, using the following flow:

Long agent conversation (100+ turns) │ ┌────────────▼────────────────────────┐ │ Approaching context limit │ └────────────┬────────────────────────┘ │ ┌────────────▼────────────────────────┐ │ Compaction API (server-side) │ │ │ │ Summarise turns 1 .. N-K │ │ Preserve: decisions, facts, │ │ constraints, current task state, │ │ tool call outcomes │ │ │ │ Output: compact summary + turns │ │ N-K+1 .. N (recent context raw) │ └────────────┬────────────────────────┘ │ ┌────────────▼────────────────────────┐ │ Model continues with: │ │ [compact summary] + [recent turns] │ │ Effectively unlimited conversation │ └─────────────────────────────────────┘

The key design choice is that recent turns are always kept uncompressed. Compaction summarises older history, not the most recent context. This preserves the full fidelity of what is happening right now, while compressing older context that is less immediately relevant.

Compaction can trigger multiple times over the course of a very long run, progressively compressing older summaries. The model always has access to a full semantic representation of the conversation history; only the token volume is reduced.

SECTION 03

Compaction vs. Client-Side Strategies

Prior to compaction, engineers used several client-side strategies to handle context limits in long-running agents. Each has significant limitations:

Sliding window (drop old turns): Simplest approach — just drop turns beyond a fixed window size. Problem: you lose information permanently and abruptly. The agent forgets everything before the window boundary, which often includes the original task specification or key decisions made early in the run.
Client-side summarisation: Detect when the window is getting full, call the model to summarise older turns, then replace those turns with the summary. Works better than dropping, but requires careful prompt engineering, introduces latency, and the summarisation quality depends on how well you wrote the summarisation prompt — not the agent's task-specific understanding of what matters.
Selective retention: Manually tag "important" turns to preserve and drop others. Requires domain-specific logic about what matters. Brittle as task patterns change.
Context compaction (server-side): The summarisation is done by the same model family that generated the content, optimised to preserve semantic continuity for the specific task. No client-side implementation required. The model's task context informs what gets preserved.

When not to use compaction: Compaction is not the right tool when you need exact verbatim preservation of specific turns — for example, if your agent later needs to quote or verify exact wording from an earlier message. For those cases, maintain a separate structured log alongside the conversation and read from it explicitly.

SECTION 04

Implementation

Enabling compaction requires only a single addition to your existing API calls: include "context-compaction-2026-02" in the betas array. The rest of your agent loop is unchanged.

import anthropic client = anthropic.Anthropic() def long_running_agent(initial_task: str, tools: list, max_turns: int = 200) -> str: # Agent that uses context compaction for arbitrarily long runs. messages = [{"role": "user", "content": initial_task}] for turn in range(max_turns): # Enable context compaction with the beta header resp = client.beta.messages.create( model="claude-opus-4-5", max_tokens=4096, tools=tools, messages=messages, betas=["context-compaction-2026-02"], # <-- the only change needed ) # Optional: detect compaction events for observability if hasattr(resp, 'usage') and hasattr(resp.usage, 'compacted_tokens'): print(f" [turn {turn}] Compaction: {resp.usage.compacted_tokens} tokens summarised") messages.append({"role": "assistant", "content": resp.content}) if resp.stop_reason == "end_turn": for block in reversed(resp.content): if hasattr(block, "text"): return block.text break # Tool handling is identical to a non-compacted agent tool_results = [] for block in resp.content: if block.type == "tool_use": result = dispatch_tool(block.name, block.input) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) if tool_results: messages.append({"role": "user", "content": tool_results}) return "Agent reached max_turns limit." def dispatch_tool(name: str, args: dict) -> str: # Your tool dispatch logic return f"[result of {name}]"

The compaction beta header can be combined with other beta features. If you are already using tool use, extended output, or other beta flags, simply add the compaction beta to the existing list.

Cost note: Compaction does consume tokens — the summarisation step is a model call. For most long-running agents, this overhead is negligible compared to the cost of the full agent run. However, if you are running many short agents (under 15 turns each) that rarely hit the context limit, the compaction overhead may not be justified. Enable it selectively based on expected run length.

SECTION 05

What Compaction Preserves (and What It Doesn't)

Understanding the information-preservation guarantees of compaction is critical for designing reliable agents.

Compaction preserves (semantic level):

The original task specification and high-level goal
Key decisions made during the run ("we decided to use PostgreSQL over SQLite because...")
Constraints and requirements established in earlier turns
The outcomes of tool calls (what the tool returned, what the agent concluded from it)
The current state of the task (what is done, what remains)

Compaction does not preserve (verbatim level):

Exact wording of earlier messages
The precise sequence of tool calls (the outcomes are preserved, not the individual call records)
Intermediate reasoning chains that did not lead to decisions or conclusions

The practical implication: if your agent needs to know "did I already call tool X on file Y?", do not rely on compaction to preserve that. Maintain a separate structured state dictionary that the agent explicitly updates and reads — for example, a set of processed file paths that gets included in the system prompt. Compaction is for semantic continuity; structured state is for exact bookkeeping.

# Pattern: structured state + compaction # The agent maintains explicit state alongside the compacted conversation import json import anthropic client = anthropic.Anthropic() def agent_with_state(task: str, files_to_process: list) -> str: # Explicit state that compaction does not need to preserve state = { "processed_files": [], "errors": [], "findings": [] } system_prompt = f"""You are a code analysis agent. Task: {task} Current state (updated after each file): {{state_json}} Always read the current state before deciding what to do next. Update the state after completing each file.""" messages = [{"role": "user", "content": f"Process these files: {files_to_process}"}] for turn in range(100): # Inject fresh state into every system prompt current_system = system_prompt.replace("{state_json}", json.dumps(state, indent=2)) resp = client.beta.messages.create( model="claude-opus-4-5", max_tokens=4096, system=current_system, messages=messages, betas=["context-compaction-2026-02"], ) # ... handle tool calls, update state explicitly ... if resp.stop_reason == "end_turn": break return json.dumps(state["findings"], indent=2)

SECTION 06

Combining Compaction with Checkpointing

Compaction handles one failure mode of long-running agents (hitting the context window limit) but not the other (process failure, API timeout, cost overrun mid-run). For truly robust long-running agents, compaction and checkpointing address complementary problems and should both be used.

Compaction: Keeps the agent running indefinitely within a single API session. Handles context overflow transparently. Does not help if the process crashes or you need to pause and resume.
Checkpointing: Saves the agent's state periodically so the run can be resumed from the last checkpoint if interrupted. Does not help with context overflow within a single resumed session.

The combined pattern: checkpoint the structured state (processed files, decisions, results) after every N turns, and enable compaction to handle the context window. If the agent is interrupted, restore from the checkpoint and resume — compaction will handle the new session's context growth just as it handled the previous one.

The long-running agent reliability stack: structured state (explicit bookkeeping) + compaction (context continuity) + checkpointing (fault tolerance) + max_turns limit (cost control) + escalation path (human-in-the-loop on failure). Each layer addresses a distinct failure mode. Compaction is a critical new addition to this stack, but it works best alongside the others.

Table of Contents

What Is Context Compaction?

How Compaction Works

Compaction vs. Client-Side Strategies

Implementation

What Compaction Preserves (and What It Doesn't)

Combining Compaction with Checkpointing