Agent Memory

Conversation Buffer

The simplest agent memory: keeping recent messages in the prompt window so the model remembers what was said earlier in the same session. Two strategies: sliding window (last N turns) and summary compression (LLM compresses old turns into a paragraph).

In-context
Storage
Window or
Summary
Zero
Infra needed

Table of Contents

SECTION 01

Why agents forget

LLMs are stateless. Each API call is independent — the model has no inherent memory of what you said last turn. Every message you've exchanged must be re-sent in the messages array for the model to "remember" it. This is short-term memory: it lives in the prompt, and it's bounded by the context window.

For a simple chatbot, you just append every turn to a list and send the whole list each time. This works until the conversation grows too long and you exceed the context window (or costs become prohibitive). The conversation buffer pattern manages this problem — deciding which past messages to include and how to compress or discard the rest.

SECTION 02

The sliding window approach

The simplest solution: keep only the last N turns. Old messages are dropped entirely. This is fast, cheap, and requires no extra LLM calls.

from collections import deque

class SlidingWindowMemory:
    def __init__(self, max_turns: int = 10):
        self.history = deque(maxlen=max_turns * 2)  # * 2 for user+assistant pairs
        self.system_prompt = ""

    def add(self, role: str, content: str):
        self.history.append({"role": role, "content": content})

    def get_messages(self) -> list[dict]:
        return list(self.history)

memory = SlidingWindowMemory(max_turns=5)

def chat(user_message: str) -> str:
    import anthropic
    memory.add("user", user_message)
    response = anthropic.Anthropic().messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system="You are a helpful assistant.",
        messages=memory.get_messages()
    )
    reply = response.content[0].text
    memory.add("assistant", reply)
    return reply

# Test
chat("My name is Deepak and I'm building an AI mindmap.")
chat("What was the project I mentioned?")  # Model still knows

The downside: context before the window is completely lost. If the user mentioned their name 15 turns ago and you only keep 10 turns, the model won't know their name.

SECTION 03

Summary compression

When the conversation exceeds a threshold, use an LLM to compress old turns into a summary paragraph. The summary is prepended as a "system-level" context, and the window continues from there:

Full history (20 turns):
[Turn 1-10] → "Earlier in this conversation: user introduced themselves
  as Deepak, a developer building a GenAI mindmap. They asked about RAG
  architecture and we discussed chunking strategies and embedding models."
[Turn 11-20] → kept verbatim in messages array

This preserves key facts from old turns while keeping the prompt manageable. The summary is a lossy compression — fine details are lost, but the narrative thread is maintained.

SECTION 04

Implementing both strategies

import anthropic

client = anthropic.Anthropic()

class SummarisedMemory:
    def __init__(self, window_size: int = 8, compress_after: int = 16):
        self.recent: list[dict] = []
        self.summary: str = ""
        self.window_size = window_size
        self.compress_after = compress_after

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        if len(self.recent) > self.compress_after:
            self._compress()

    def _compress(self):
        # Summarise the oldest half of the window
        to_compress = self.recent[:self.compress_after // 2]
        self.recent = self.recent[self.compress_after // 2:]

        turns_text = "
".join(
            f"{m['role'].title()}: {m['content']}" for m in to_compress
        )
        if self.summary:
            turns_text = f"Previous summary: {self.summary}

{turns_text}"

        resp = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{
                "role": "user",
                "content": f"Summarise this conversation in 2-3 sentences, "
                           f"preserving key facts and names:

{turns_text}"
            }]
        )
        self.summary = resp.content[0].text

    def get_messages(self) -> list[dict]:
        return self.recent[-self.window_size * 2:]

    def get_system(self, base_system: str) -> str:
        if self.summary:
            return f"{base_system}

Conversation so far: {self.summary}"
        return base_system
SECTION 05

LangChain memory classes

LangChain provides pre-built memory implementations so you don't have to write this yourself:

from langchain.memory import (
    ConversationBufferMemory,         # keep all turns
    ConversationBufferWindowMemory,   # sliding window
    ConversationSummaryMemory,        # full summary compression
    ConversationSummaryBufferMemory,  # hybrid: summary + recent window
)
from langchain_anthropic import ChatAnthropic
from langchain.chains import ConversationChain

llm = ChatAnthropic(model="claude-haiku-4-5-20251001")

# Hybrid memory: summarise turns beyond token_limit, keep recent ones verbatim
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1000,
)

chain = ConversationChain(llm=llm, memory=memory, verbose=True)
chain.predict(input="Hi, I'm Deepak and I'm building a GenAI mindmap.")
chain.predict(input="What project did I mention?")  # still knows

ConversationSummaryBufferMemory is the most practical for production: it keeps recent turns verbatim (fast, faithful) and summarises older ones (compact). Set max_token_limit to roughly 20-30% of your model's context window.

SECTION 06

When to upgrade to long-term memory

Conversation buffer (short-term memory) is the right choice when: you're building a chatbot or assistant within a single session, the relevant context fits within a few thousand tokens, and you don't need to remember facts across separate conversations.

You need long-term memory (vector store + retrieval) when: users return across multiple sessions, you need to remember facts from weeks or months ago, or the volume of information is too large for even summary compression (e.g., an agent that processes hundreds of documents over time).

A practical upgrade path: start with ConversationSummaryBufferMemory for per-session context, and add a vector store lookup at the start of each new session to retrieve relevant facts from past conversations. The two systems are complementary, not competing.

SECTION 07

Gotchas

Token counting matters. You can't just count messages — count tokens. A single long user message might be worth more than five short turns. Use anthropic.count_tokens() or tiktoken to track actual token usage before building your memory window.

Compression is lossy — and the model can't tell. After summarisation, the model doesn't know what was compressed. If a user references a detail from a compressed turn ("like I said about the edge case in my first question"), the model will have lost that detail. Consider including a note: "earlier turns summarised — ask me if you need specifics".

System messages are not magic memory. Prepending summaries to the system prompt works but can "feel" lower quality to the model than actual conversation history in the messages array. For anthropic models, the conversation history format (user/assistant alternation) is the most natural way to provide context.

Conversation Memory Strategy Comparison

Conversation buffer management determines how much conversation history an LLM agent retains across turns. The choice of strategy directly affects coherence (can the agent remember what was said earlier?), cost (how many tokens does each turn consume?), and relevance (is the retained context actually useful for the current turn?).

StrategyWhat Is RetainedToken CostCoherenceBest For
Full BufferAll messagesGrows unboundedlyPerfectShort conversations
Window BufferLast N messagesFixed capGood for recent contextGeneral chat
Summary BufferSummary + recent NControlled growthGood overallLong conversations
Token BufferMessages within token limitFixed capGood for recent contextToken-constrained apps
Semantic BufferMost relevant messagesControlledTask-specificTask-focused agents

Summary buffer memory compresses older conversation history into a rolling summary while retaining the N most recent messages verbatim. When the buffer approaches its token limit, a summarization call is triggered that condenses the oldest messages into an updated summary paragraph. The verbatim recent messages are preserved because they contain the most immediate context and are difficult to summarize accurately without losing critical details like specific values, names, or instructions the user just provided.

Semantic buffer memory uses embedding similarity to select which past messages to include in the context window, rather than relying purely on recency. This is most useful for task-focused agents where the user's goal stated at the beginning of a long conversation is more relevant to the current turn than the ten intermediate exchanges that happened since. Hybrid approaches that combine recency and semantic relevance tend to outperform either strategy alone.

Implementing summary buffer memory correctly requires careful handling of the summarization timing to avoid disrupting the conversation flow. Triggering summarization synchronously at the moment the buffer exceeds its limit introduces a latency spike that users notice. Asynchronous summarization — triggering the summarization call in a background task while the current turn proceeds using the un-summarized buffer — prevents this latency spike, but requires the buffer management layer to handle the case where summarization has not yet completed when the next turn arrives.

Multi-session memory persistence extends conversation buffer concepts to conversations that span multiple sessions over time. Rather than starting each session with an empty context, the agent loads a persisted summary of previous interactions along with any episodic memories flagged as important. Building this correctly requires distinguishing between working memory (intra-session context for the current conversation) and long-term memory (inter-session context persisted to a database), which have different storage, retrieval, and decay requirements.