The simplest agent memory: keeping recent messages in the prompt window so the model remembers what was said earlier in the same session. Two strategies: sliding window (last N turns) and summary compression (LLM compresses old turns into a paragraph).
LLMs are stateless. Each API call is independent — the model has no inherent memory of what you said last turn. Every message you've exchanged must be re-sent in the messages array for the model to "remember" it. This is short-term memory: it lives in the prompt, and it's bounded by the context window.
For a simple chatbot, you just append every turn to a list and send the whole list each time. This works until the conversation grows too long and you exceed the context window (or costs become prohibitive). The conversation buffer pattern manages this problem — deciding which past messages to include and how to compress or discard the rest.
The simplest solution: keep only the last N turns. Old messages are dropped entirely. This is fast, cheap, and requires no extra LLM calls.
from collections import deque
class SlidingWindowMemory:
def __init__(self, max_turns: int = 10):
self.history = deque(maxlen=max_turns * 2) # * 2 for user+assistant pairs
self.system_prompt = ""
def add(self, role: str, content: str):
self.history.append({"role": role, "content": content})
def get_messages(self) -> list[dict]:
return list(self.history)
memory = SlidingWindowMemory(max_turns=5)
def chat(user_message: str) -> str:
import anthropic
memory.add("user", user_message)
response = anthropic.Anthropic().messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
system="You are a helpful assistant.",
messages=memory.get_messages()
)
reply = response.content[0].text
memory.add("assistant", reply)
return reply
# Test
chat("My name is Deepak and I'm building an AI mindmap.")
chat("What was the project I mentioned?") # Model still knows
The downside: context before the window is completely lost. If the user mentioned their name 15 turns ago and you only keep 10 turns, the model won't know their name.
When the conversation exceeds a threshold, use an LLM to compress old turns into a summary paragraph. The summary is prepended as a "system-level" context, and the window continues from there:
Full history (20 turns):
[Turn 1-10] → "Earlier in this conversation: user introduced themselves
as Deepak, a developer building a GenAI mindmap. They asked about RAG
architecture and we discussed chunking strategies and embedding models."
[Turn 11-20] → kept verbatim in messages array
This preserves key facts from old turns while keeping the prompt manageable. The summary is a lossy compression — fine details are lost, but the narrative thread is maintained.
import anthropic
client = anthropic.Anthropic()
class SummarisedMemory:
def __init__(self, window_size: int = 8, compress_after: int = 16):
self.recent: list[dict] = []
self.summary: str = ""
self.window_size = window_size
self.compress_after = compress_after
def add(self, role: str, content: str):
self.recent.append({"role": role, "content": content})
if len(self.recent) > self.compress_after:
self._compress()
def _compress(self):
# Summarise the oldest half of the window
to_compress = self.recent[:self.compress_after // 2]
self.recent = self.recent[self.compress_after // 2:]
turns_text = "
".join(
f"{m['role'].title()}: {m['content']}" for m in to_compress
)
if self.summary:
turns_text = f"Previous summary: {self.summary}
{turns_text}"
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{
"role": "user",
"content": f"Summarise this conversation in 2-3 sentences, "
f"preserving key facts and names:
{turns_text}"
}]
)
self.summary = resp.content[0].text
def get_messages(self) -> list[dict]:
return self.recent[-self.window_size * 2:]
def get_system(self, base_system: str) -> str:
if self.summary:
return f"{base_system}
Conversation so far: {self.summary}"
return base_system
LangChain provides pre-built memory implementations so you don't have to write this yourself:
from langchain.memory import (
ConversationBufferMemory, # keep all turns
ConversationBufferWindowMemory, # sliding window
ConversationSummaryMemory, # full summary compression
ConversationSummaryBufferMemory, # hybrid: summary + recent window
)
from langchain_anthropic import ChatAnthropic
from langchain.chains import ConversationChain
llm = ChatAnthropic(model="claude-haiku-4-5-20251001")
# Hybrid memory: summarise turns beyond token_limit, keep recent ones verbatim
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=1000,
)
chain = ConversationChain(llm=llm, memory=memory, verbose=True)
chain.predict(input="Hi, I'm Deepak and I'm building a GenAI mindmap.")
chain.predict(input="What project did I mention?") # still knows
ConversationSummaryBufferMemory is the most practical for production: it keeps recent turns verbatim (fast, faithful) and summarises older ones (compact). Set max_token_limit to roughly 20-30% of your model's context window.
Conversation buffer (short-term memory) is the right choice when: you're building a chatbot or assistant within a single session, the relevant context fits within a few thousand tokens, and you don't need to remember facts across separate conversations.
You need long-term memory (vector store + retrieval) when: users return across multiple sessions, you need to remember facts from weeks or months ago, or the volume of information is too large for even summary compression (e.g., an agent that processes hundreds of documents over time).
A practical upgrade path: start with ConversationSummaryBufferMemory for per-session context, and add a vector store lookup at the start of each new session to retrieve relevant facts from past conversations. The two systems are complementary, not competing.
Token counting matters. You can't just count messages — count tokens. A single long user message might be worth more than five short turns. Use anthropic.count_tokens() or tiktoken to track actual token usage before building your memory window.
Compression is lossy — and the model can't tell. After summarisation, the model doesn't know what was compressed. If a user references a detail from a compressed turn ("like I said about the edge case in my first question"), the model will have lost that detail. Consider including a note: "earlier turns summarised — ask me if you need specifics".
System messages are not magic memory. Prepending summaries to the system prompt works but can "feel" lower quality to the model than actual conversation history in the messages array. For anthropic models, the conversation history format (user/assistant alternation) is the most natural way to provide context.
Conversation buffer management determines how much conversation history an LLM agent retains across turns. The choice of strategy directly affects coherence (can the agent remember what was said earlier?), cost (how many tokens does each turn consume?), and relevance (is the retained context actually useful for the current turn?).
| Strategy | What Is Retained | Token Cost | Coherence | Best For |
|---|---|---|---|---|
| Full Buffer | All messages | Grows unboundedly | Perfect | Short conversations |
| Window Buffer | Last N messages | Fixed cap | Good for recent context | General chat |
| Summary Buffer | Summary + recent N | Controlled growth | Good overall | Long conversations |
| Token Buffer | Messages within token limit | Fixed cap | Good for recent context | Token-constrained apps |
| Semantic Buffer | Most relevant messages | Controlled | Task-specific | Task-focused agents |
Summary buffer memory compresses older conversation history into a rolling summary while retaining the N most recent messages verbatim. When the buffer approaches its token limit, a summarization call is triggered that condenses the oldest messages into an updated summary paragraph. The verbatim recent messages are preserved because they contain the most immediate context and are difficult to summarize accurately without losing critical details like specific values, names, or instructions the user just provided.
Semantic buffer memory uses embedding similarity to select which past messages to include in the context window, rather than relying purely on recency. This is most useful for task-focused agents where the user's goal stated at the beginning of a long conversation is more relevant to the current turn than the ten intermediate exchanges that happened since. Hybrid approaches that combine recency and semantic relevance tend to outperform either strategy alone.
Implementing summary buffer memory correctly requires careful handling of the summarization timing to avoid disrupting the conversation flow. Triggering summarization synchronously at the moment the buffer exceeds its limit introduces a latency spike that users notice. Asynchronous summarization — triggering the summarization call in a background task while the current turn proceeds using the un-summarized buffer — prevents this latency spike, but requires the buffer management layer to handle the case where summarization has not yet completed when the next turn arrives.
Multi-session memory persistence extends conversation buffer concepts to conversations that span multiple sessions over time. Rather than starting each session with an empty context, the agent loads a persisted summary of previous interactions along with any episodic memories flagged as important. Building this correctly requires distinguishing between working memory (intra-session context for the current conversation) and long-term memory (inter-session context persisted to a database), which have different storage, retrieval, and decay requirements.