Agent Memory Systems

Contents

Why agents need memory
Working memory
Episodic memory
Semantic memory
Procedural memory
MemGPT & long-context
Production patterns

01 — Foundation

Why Agents Need Memory

An LLM has no persistent state between calls — every call starts from a blank context window. Agents need memory to maintain conversation history, recall facts from past interactions, learn from past task outcomes, store retrieved knowledge temporarily, and persist user preferences across sessions.

Raw pretraining teaches a model to predict plausible continuations. Without memory, an agent cannot build coherent long-term behavior. Four memory types (inspired by cognitive science) address different needs:

Working memory: active context (the current prompt)
Episodic memory: past events and interactions
Semantic memory: facts and knowledge
Procedural memory: learned skills and patterns

⚠️ The context window IS the agent's working memory. When it fills up, old information is lost. Memory systems exist to decide what to keep, what to compress, and what to offload to external storage.

02 — Current Context

Working Memory: The Context Window

Everything in the current prompt is working memory. Most flexible, most expensive, limited by context length. Management strategies decide what to keep as conversational history versus what to summarize or discard.

Management Strategies

Sliding window: drop oldest messages as new ones arrive
Summary compression: summarize old messages to recover tokens
Selective retention: keep only salient messages marked for importance

Example: sliding window + summary memory

from collections import deque from openai import OpenAI client = OpenAI() class SlidingWindowMemory: def __init__(self, max_messages=20, summary_threshold=15): self.messages = deque(maxlen=max_messages) self.summary = "" self.summary_threshold = summary_threshold def add(self, role: str, content: str): self.messages.append({"role": role, "content": content}) if len(self.messages) >= self.summary_threshold: self._compress() def _compress(self): to_summarize = list(self.messages)[:10] summary_response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": f"Summarize this conversation concisely:\n{to_summarize}"} ] ) self.summary = summary_response.choices[0].message.content for _ in range(10): if self.messages: self.messages.popleft() def get_context(self) -> list: prefix = [{"role": "system", "content": f"Previous context: {self.summary}"}] if self.summary else [] return prefix + list(self.messages)

Working Memory Strategies

Strategy	Memory preserved	Quality	Cost
Sliding window	Recent N messages	Loses early context	Fixed
Summarization	Compressed history	May lose details	Extra LLM call
Selective retention	Marked important items	Best quality	Extra LLM calls
Full history	Everything	Perfect recall	Grows linearly

03 — Past Events

Episodic Memory: Vector Stores

Episodic memory records past events and interactions, retrievable by semantic similarity to current context. Implementation: embed each memory (conversation turn, task outcome, observation) → store in vector DB → at query time, retrieve top-k most similar memories → inject into context.

Example: episodic memory with ChromaDB

import chromadb from openai import OpenAI client = OpenAI() chroma = chromadb.Client() collection = chroma.create_collection("agent_episodes") def store_episode(content: str, metadata: dict): embedding = client.embeddings.create( model="text-embedding-3-small", input=content).data[0].embedding collection.add( embeddings=[embedding], documents=[content], metadatas=[metadata], ids=[f"ep_{metadata['timestamp']}"] ) def recall_relevant(query: str, k: int = 5) -> list[str]: q_emb = client.embeddings.create( model="text-embedding-3-small", input=query).data[0].embedding results = collection.query(query_embeddings=[q_emb], n_results=k) return results["documents"][0] # Store an episode after each task store_episode("User asked about Python async; I explained asyncio and provided examples.", {"timestamp": "2024-01-15T10:30:00", "task_type": "coding", "success": True}) # Retrieve before responding to new task relevant = recall_relevant("How do I handle concurrent API calls in Python?")

✓ Add metadata to every stored episode: timestamp, task type, success/failure, user ID. This enables filtered retrieval ("find similar successful coding tasks from the last week") rather than pure semantic search.

04 — Knowledge & Facts

Semantic Memory: Knowledge Bases

Semantic memory is factual knowledge about the world, domain, or user — not tied to specific events. Sources include product documentation, user profile data, domain knowledge bases, and learned facts. Differs from episodic: episodic = "what happened during task X"; semantic = "this user prefers Python; the API rate limit is 1000/hour".

Example: user preference semantic store

class UserSemanticMemory: def __init__(self): self.facts = {} # key-value for structured facts self.preferences = [] # free-text preferences def learn_preference(self, preference: str): """Extract and store user preference from conversation.""" self.preferences.append(preference) def get_context_str(self) -> str: prefs = "\n".join(f"- {p}" for p in self.preferences[-10:]) facts = "\n".join(f"- {k}: {v}" for k, v in self.facts.items()) return f"User facts:\n{facts}\n\nUser preferences:\n{prefs}" # Auto-extract preferences with LLM def extract_preferences(conversation: str) -> list[str]: resp = client.chat.completions.create(model="gpt-4o-mini", messages=[ {"role": "user", "content": f"Extract any user preferences from: {conversation}\nReturn as JSON list."} ], response_format={"type": "json_object"}) return json.loads(resp.choices[0].message.content).get("preferences", [])

Semantic Memory Storage Options

Type	Storage	Retrieval	Best for
Key-value facts	Redis / dict	Exact key lookup	Structured user data
Vector embeddings	Vector DB	Semantic similarity	Unstructured knowledge
Graph (KG)	Neo4j / NetworkX	Graph traversal	Relational knowledge
SQL	Postgres	SQL queries	Structured, queryable

05 — Learned Skills

Procedural Memory: Skills and Patterns

Procedural memory is "how to do things" — stored as few-shot examples, tool definitions, workflow templates, or fine-tuned weights. Retrieval-augmented few-shot: store (task_type → example_pairs) in a vector DB; retrieve relevant examples at query time based on current task using them as templates.

Skills as memory: agent stores successful task solutions; retrieves similar solutions when facing new tasks; uses them as templates. This is more scalable than fine-tuning and easier to update.

Example: procedural memory with dynamic few-shot retrieval

class ProceduralMemory: def __init__(self): self.skills = [] # list of {task_desc, solution, embedding} def store_skill(self, task_description: str, solution: str): emb = embed(task_description) self.skills.append({ "task": task_description, "solution": solution, "embedding": emb }) def retrieve_examples(self, current_task: str, k=3) -> list[dict]: q_emb = embed(current_task) scored = [(cosine_sim(q_emb, s["embedding"]), s) for s in self.skills] return [s for _, s in sorted(scored, reverse=True)[:k]]

06 — Explicit Memory

MemGPT and Long-Context Architectures

MemGPT: model manages its own memory explicitly — decides what to store in "external memory" (vector DB) vs keep in context window, using special memory functions. Letta (MemGPT evolved) is production-ready stateful agent framework with built-in memory management.

Memory Functions

archival_memory_insert() — Store important information for long-term recall
archival_memory_search() — Search long-term memory
core_memory_append() — Update persona/human profile

Example: MemGPT memory function calling

SYSTEM: You have the following memory tools available: - archival_memory_insert(content): Store important information - archival_memory_search(query): Search your long-term memory - core_memory_append(name, content): Update your persona/human profile Your core memory: [PERSONA]: You are a helpful assistant who remembers past interactions. [HUMAN]: Name: Deepak. Prefers Python. Works in AI/ML. # Agent decides to store new information: # archival_memory_insert("User mentioned building a RAG system for legal documents")

⚠️ MemGPT requires reliable function calling. Weak models forget to call memory functions or call them incorrectly. Use GPT-4o or Claude 3.5 class models.

07 — Practical Approaches

Production Memory Patterns

Core Strategies

Tiered Storage — hot, warm, cold

Hot tier: context window (last N messages). Warm tier: in-memory vector store (session). Cold tier: persistent vector DB (across sessions). Each tier is cheaper but slower to retrieve.

Context window: ~1M tokens (GPT-4 Turbo)
Session memory: retrieval in milliseconds
Persistent DB: retrieval in seconds

Memory Consolidation — compress & prune

Periodically (end of session, or every N turns) run consolidation: summarize episodic memories, extract semantic facts, prune redundant entries. Prevents unbounded memory growth.

Consolidation every 50–100 messages
Merge similar episodes into summaries
Archive old data to cold storage

Memory Scoping — governance

Scope memories to: user (private to one user), session (cleared after session ends), agent (shared across all users of one agent), team (shared across a user's agents). Clear data governance.

User scope: most restrictive, highest privacy
Session scope: temporary, fast cleanup
Agent scope: enables cross-user learning

Forgetting & Privacy — TTL & erasure

Implement TTL (time-to-live) on episodic memories. Provide user-facing "forget this" functionality. Comply with GDPR right-to-erasure for user memory stores.

Default TTL: 90 days for episodic memories
User deletion requests: immediate erasure
Compliance audits: trace memory deletion

Memory Infrastructure Tools

Framework

LangGraph

Checkpointer for multi-turn memory persistence

Agent

Letta / MemGPT

Stateful agent with explicit memory management

Memory Store

mem0

User memory platform for agents and AI apps

Memory Store

Zep

Long-term memory for LLM applications

Observability

LangSmith

Debugging & tracing memory & LLM chains

Vector DB

ChromaDB

Embedded vector store for episodes

Vector DB

Qdrant

Scalable vector DB for production

Cache

Redis

Fast key-value store for semantic facts

References

Learning Path

Agent memory is a system design problem as much as an AI one. Here's the progression:

In-contextsimplest memory

→

Summarisationcompress history

→

External storevector + key-value

→

Knowledge graphstructured memory

Start with in-context memory

Just keep the last N turns in the prompt. This works surprisingly well for sessions up to ~20 turns. Don't over-engineer until you see the context window becoming a bottleneck.

Add rolling summarisation

When history approaches your context limit, summarise the oldest messages with the LLM itself. LangChain's ConversationSummaryBufferMemory does this automatically.

Use a vector store for long-term recall

Embed each significant fact or exchange, store in Chroma or Weaviate, retrieve the top-k relevant memories at each turn. This is "episodic memory" — the agent remembers relevant past experiences.

Use a knowledge graph for structured facts

Tools like Cognee or Zep build a graph of entities and relationships from conversation history. Enables multi-hop queries: "What did I say about the project after my meeting with Alice?"

Agent Memory Systems

Why Agents Need Memory

Working Memory: The Context Window

Management Strategies

Working Memory Strategies

Episodic Memory: Vector Stores

Semantic Memory: Knowledge Bases

Semantic Memory Storage Options

Procedural Memory: Skills and Patterns

MemGPT and Long-Context Architectures

Memory Functions

Production Memory Patterns

Core Strategies

Tiered Storage — hot, warm, cold

Memory Consolidation — compress & prune

Memory Scoping — governance

Forgetting & Privacy — TTL & erasure

Memory Infrastructure Tools

Further Reading

Learning Path

Start with in-context memory

Add rolling summarisation

Use a vector store for long-term recall

Use a knowledge graph for structured facts

Related concepts