01 — Foundation
Why Agents Need Memory
An LLM has no persistent state between calls — every call starts from a blank context window. Agents need memory to maintain conversation history, recall facts from past interactions, learn from past task outcomes, store retrieved knowledge temporarily, and persist user preferences across sessions.
Raw pretraining teaches a model to predict plausible continuations. Without memory, an agent cannot build coherent long-term behavior. Four memory types (inspired by cognitive science) address different needs:
- Working memory: active context (the current prompt)
- Episodic memory: past events and interactions
- Semantic memory: facts and knowledge
- Procedural memory: learned skills and patterns
⚠️
The context window IS the agent's working memory. When it fills up, old information is lost. Memory systems exist to decide what to keep, what to compress, and what to offload to external storage.
02 — Current Context
Working Memory: The Context Window
Everything in the current prompt is working memory. Most flexible, most expensive, limited by context length. Management strategies decide what to keep as conversational history versus what to summarize or discard.
Management Strategies
- Sliding window: drop oldest messages as new ones arrive
- Summary compression: summarize old messages to recover tokens
- Selective retention: keep only salient messages marked for importance
Example: sliding window + summary memory
from collections import deque
from openai import OpenAI
client = OpenAI()
class SlidingWindowMemory:
def __init__(self, max_messages=20, summary_threshold=15):
self.messages = deque(maxlen=max_messages)
self.summary = ""
self.summary_threshold = summary_threshold
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) >= self.summary_threshold:
self._compress()
def _compress(self):
to_summarize = list(self.messages)[:10]
summary_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": f"Summarize this conversation concisely:\n{to_summarize}"}
]
)
self.summary = summary_response.choices[0].message.content
for _ in range(10):
if self.messages:
self.messages.popleft()
def get_context(self) -> list:
prefix = [{"role": "system", "content": f"Previous context: {self.summary}"}] if self.summary else []
return prefix + list(self.messages)
Working Memory Strategies
| Strategy | Memory preserved | Quality | Cost |
| Sliding window | Recent N messages | Loses early context | Fixed |
| Summarization | Compressed history | May lose details | Extra LLM call |
| Selective retention | Marked important items | Best quality | Extra LLM calls |
| Full history | Everything | Perfect recall | Grows linearly |
03 — Past Events
Episodic Memory: Vector Stores
Episodic memory records past events and interactions, retrievable by semantic similarity to current context. Implementation: embed each memory (conversation turn, task outcome, observation) → store in vector DB → at query time, retrieve top-k most similar memories → inject into context.
Example: episodic memory with ChromaDB
import chromadb
from openai import OpenAI
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("agent_episodes")
def store_episode(content: str, metadata: dict):
embedding = client.embeddings.create(
model="text-embedding-3-small", input=content).data[0].embedding
collection.add(
embeddings=[embedding],
documents=[content],
metadatas=[metadata],
ids=[f"ep_{metadata['timestamp']}"]
)
def recall_relevant(query: str, k: int = 5) -> list[str]:
q_emb = client.embeddings.create(
model="text-embedding-3-small", input=query).data[0].embedding
results = collection.query(query_embeddings=[q_emb], n_results=k)
return results["documents"][0]
# Store an episode after each task
store_episode("User asked about Python async; I explained asyncio and provided examples.",
{"timestamp": "2024-01-15T10:30:00", "task_type": "coding", "success": True})
# Retrieve before responding to new task
relevant = recall_relevant("How do I handle concurrent API calls in Python?")
✓
Add metadata to every stored episode: timestamp, task type, success/failure, user ID. This enables filtered retrieval ("find similar successful coding tasks from the last week") rather than pure semantic search.
04 — Knowledge & Facts
Semantic Memory: Knowledge Bases
Semantic memory is factual knowledge about the world, domain, or user — not tied to specific events. Sources include product documentation, user profile data, domain knowledge bases, and learned facts. Differs from episodic: episodic = "what happened during task X"; semantic = "this user prefers Python; the API rate limit is 1000/hour".
Example: user preference semantic store
class UserSemanticMemory:
def __init__(self):
self.facts = {} # key-value for structured facts
self.preferences = [] # free-text preferences
def learn_preference(self, preference: str):
"""Extract and store user preference from conversation."""
self.preferences.append(preference)
def get_context_str(self) -> str:
prefs = "\n".join(f"- {p}" for p in self.preferences[-10:])
facts = "\n".join(f"- {k}: {v}" for k, v in self.facts.items())
return f"User facts:\n{facts}\n\nUser preferences:\n{prefs}"
# Auto-extract preferences with LLM
def extract_preferences(conversation: str) -> list[str]:
resp = client.chat.completions.create(model="gpt-4o-mini", messages=[
{"role": "user", "content": f"Extract any user preferences from: {conversation}\nReturn as JSON list."}
], response_format={"type": "json_object"})
return json.loads(resp.choices[0].message.content).get("preferences", [])
Semantic Memory Storage Options
| Type | Storage | Retrieval | Best for |
| Key-value facts | Redis / dict | Exact key lookup | Structured user data |
| Vector embeddings | Vector DB | Semantic similarity | Unstructured knowledge |
| Graph (KG) | Neo4j / NetworkX | Graph traversal | Relational knowledge |
| SQL | Postgres | SQL queries | Structured, queryable |
05 — Learned Skills
Procedural Memory: Skills and Patterns
Procedural memory is "how to do things" — stored as few-shot examples, tool definitions, workflow templates, or fine-tuned weights. Retrieval-augmented few-shot: store (task_type → example_pairs) in a vector DB; retrieve relevant examples at query time based on current task using them as templates.
Skills as memory: agent stores successful task solutions; retrieves similar solutions when facing new tasks; uses them as templates. This is more scalable than fine-tuning and easier to update.
Example: procedural memory with dynamic few-shot retrieval
class ProceduralMemory:
def __init__(self):
self.skills = [] # list of {task_desc, solution, embedding}
def store_skill(self, task_description: str, solution: str):
emb = embed(task_description)
self.skills.append({
"task": task_description,
"solution": solution,
"embedding": emb
})
def retrieve_examples(self, current_task: str, k=3) -> list[dict]:
q_emb = embed(current_task)
scored = [(cosine_sim(q_emb, s["embedding"]), s) for s in self.skills]
return [s for _, s in sorted(scored, reverse=True)[:k]]
06 — Explicit Memory
MemGPT and Long-Context Architectures
MemGPT: model manages its own memory explicitly — decides what to store in "external memory" (vector DB) vs keep in context window, using special memory functions. Letta (MemGPT evolved) is production-ready stateful agent framework with built-in memory management.
Memory Functions
archival_memory_insert() — Store important information for long-term recall
archival_memory_search() — Search long-term memory
core_memory_append() — Update persona/human profile
Example: MemGPT memory function calling
SYSTEM: You have the following memory tools available:
- archival_memory_insert(content): Store important information
- archival_memory_search(query): Search your long-term memory
- core_memory_append(name, content): Update your persona/human profile
Your core memory:
[PERSONA]: You are a helpful assistant who remembers past interactions.
[HUMAN]: Name: Deepak. Prefers Python. Works in AI/ML.
# Agent decides to store new information:
# archival_memory_insert("User mentioned building a RAG system for legal documents")
⚠️
MemGPT requires reliable function calling. Weak models forget to call memory functions or call them incorrectly. Use GPT-4o or Claude 3.5 class models.
07 — Practical Approaches
Production Memory Patterns
Core Strategies
1
Tiered Storage — hot, warm, cold
Hot tier: context window (last N messages). Warm tier: in-memory vector store (session). Cold tier: persistent vector DB (across sessions). Each tier is cheaper but slower to retrieve.
- Context window: ~1M tokens (GPT-4 Turbo)
- Session memory: retrieval in milliseconds
- Persistent DB: retrieval in seconds
2
Memory Consolidation — compress & prune
Periodically (end of session, or every N turns) run consolidation: summarize episodic memories, extract semantic facts, prune redundant entries. Prevents unbounded memory growth.
- Consolidation every 50–100 messages
- Merge similar episodes into summaries
- Archive old data to cold storage
3
Memory Scoping — governance
Scope memories to: user (private to one user), session (cleared after session ends), agent (shared across all users of one agent), team (shared across a user's agents). Clear data governance.
- User scope: most restrictive, highest privacy
- Session scope: temporary, fast cleanup
- Agent scope: enables cross-user learning
4
Forgetting & Privacy — TTL & erasure
Implement TTL (time-to-live) on episodic memories. Provide user-facing "forget this" functionality. Comply with GDPR right-to-erasure for user memory stores.
- Default TTL: 90 days for episodic memories
- User deletion requests: immediate erasure
- Compliance audits: trace memory deletion
Memory Infrastructure Tools
References
Further Reading
Papers & Research
Documentation
Guides & Articles
LEARNING PATH
Learning Path
Agent memory is a system design problem as much as an AI one. Here's the progression:
In-contextsimplest memory
→
Summarisationcompress history
→
External storevector + key-value
→
Knowledge graphstructured memory
1
Start with in-context memory
Just keep the last N turns in the prompt. This works surprisingly well for sessions up to ~20 turns. Don't over-engineer until you see the context window becoming a bottleneck.
2
Add rolling summarisation
When history approaches your context limit, summarise the oldest messages with the LLM itself. LangChain's ConversationSummaryBufferMemory does this automatically.
3
Use a vector store for long-term recall
Embed each significant fact or exchange, store in Chroma or Weaviate, retrieve the top-k relevant memories at each turn. This is "episodic memory" — the agent remembers relevant past experiences.
4
Use a knowledge graph for structured facts
Tools like Cognee or Zep build a graph of entities and relationships from conversation history. Enables multi-hop queries: "What did I say about the project after my meeting with Alice?"