Agent Architecture

Agent Memory Systems

How agents remember — working memory, episodic recall, semantic stores, and procedural knowledge

in-context → vector → database the storage hierarchy
short-term + long-term the two memory regimes
read → reason → write the memory loop
Contents
  1. Why agents need memory
  2. Working memory
  3. Episodic memory
  4. Semantic memory
  5. Procedural memory
  6. MemGPT & long-context
  7. Production patterns
01 — Foundation

Why Agents Need Memory

An LLM has no persistent state between calls — every call starts from a blank context window. Agents need memory to maintain conversation history, recall facts from past interactions, learn from past task outcomes, store retrieved knowledge temporarily, and persist user preferences across sessions.

Raw pretraining teaches a model to predict plausible continuations. Without memory, an agent cannot build coherent long-term behavior. Four memory types (inspired by cognitive science) address different needs:

⚠️ The context window IS the agent's working memory. When it fills up, old information is lost. Memory systems exist to decide what to keep, what to compress, and what to offload to external storage.
02 — Current Context

Working Memory: The Context Window

Everything in the current prompt is working memory. Most flexible, most expensive, limited by context length. Management strategies decide what to keep as conversational history versus what to summarize or discard.

Management Strategies

Example: sliding window + summary memory

from collections import deque from openai import OpenAI client = OpenAI() class SlidingWindowMemory: def __init__(self, max_messages=20, summary_threshold=15): self.messages = deque(maxlen=max_messages) self.summary = "" self.summary_threshold = summary_threshold def add(self, role: str, content: str): self.messages.append({"role": role, "content": content}) if len(self.messages) >= self.summary_threshold: self._compress() def _compress(self): to_summarize = list(self.messages)[:10] summary_response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": f"Summarize this conversation concisely:\n{to_summarize}"} ] ) self.summary = summary_response.choices[0].message.content for _ in range(10): if self.messages: self.messages.popleft() def get_context(self) -> list: prefix = [{"role": "system", "content": f"Previous context: {self.summary}"}] if self.summary else [] return prefix + list(self.messages)

Working Memory Strategies

StrategyMemory preservedQualityCost
Sliding windowRecent N messagesLoses early contextFixed
SummarizationCompressed historyMay lose detailsExtra LLM call
Selective retentionMarked important itemsBest qualityExtra LLM calls
Full historyEverythingPerfect recallGrows linearly
03 — Past Events

Episodic Memory: Vector Stores

Episodic memory records past events and interactions, retrievable by semantic similarity to current context. Implementation: embed each memory (conversation turn, task outcome, observation) → store in vector DB → at query time, retrieve top-k most similar memories → inject into context.

Example: episodic memory with ChromaDB

import chromadb from openai import OpenAI client = OpenAI() chroma = chromadb.Client() collection = chroma.create_collection("agent_episodes") def store_episode(content: str, metadata: dict): embedding = client.embeddings.create( model="text-embedding-3-small", input=content).data[0].embedding collection.add( embeddings=[embedding], documents=[content], metadatas=[metadata], ids=[f"ep_{metadata['timestamp']}"] ) def recall_relevant(query: str, k: int = 5) -> list[str]: q_emb = client.embeddings.create( model="text-embedding-3-small", input=query).data[0].embedding results = collection.query(query_embeddings=[q_emb], n_results=k) return results["documents"][0] # Store an episode after each task store_episode("User asked about Python async; I explained asyncio and provided examples.", {"timestamp": "2024-01-15T10:30:00", "task_type": "coding", "success": True}) # Retrieve before responding to new task relevant = recall_relevant("How do I handle concurrent API calls in Python?")
Add metadata to every stored episode: timestamp, task type, success/failure, user ID. This enables filtered retrieval ("find similar successful coding tasks from the last week") rather than pure semantic search.
04 — Knowledge & Facts

Semantic Memory: Knowledge Bases

Semantic memory is factual knowledge about the world, domain, or user — not tied to specific events. Sources include product documentation, user profile data, domain knowledge bases, and learned facts. Differs from episodic: episodic = "what happened during task X"; semantic = "this user prefers Python; the API rate limit is 1000/hour".

Example: user preference semantic store

class UserSemanticMemory: def __init__(self): self.facts = {} # key-value for structured facts self.preferences = [] # free-text preferences def learn_preference(self, preference: str): """Extract and store user preference from conversation.""" self.preferences.append(preference) def get_context_str(self) -> str: prefs = "\n".join(f"- {p}" for p in self.preferences[-10:]) facts = "\n".join(f"- {k}: {v}" for k, v in self.facts.items()) return f"User facts:\n{facts}\n\nUser preferences:\n{prefs}" # Auto-extract preferences with LLM def extract_preferences(conversation: str) -> list[str]: resp = client.chat.completions.create(model="gpt-4o-mini", messages=[ {"role": "user", "content": f"Extract any user preferences from: {conversation}\nReturn as JSON list."} ], response_format={"type": "json_object"}) return json.loads(resp.choices[0].message.content).get("preferences", [])

Semantic Memory Storage Options

TypeStorageRetrievalBest for
Key-value factsRedis / dictExact key lookupStructured user data
Vector embeddingsVector DBSemantic similarityUnstructured knowledge
Graph (KG)Neo4j / NetworkXGraph traversalRelational knowledge
SQLPostgresSQL queriesStructured, queryable
05 — Learned Skills

Procedural Memory: Skills and Patterns

Procedural memory is "how to do things" — stored as few-shot examples, tool definitions, workflow templates, or fine-tuned weights. Retrieval-augmented few-shot: store (task_type → example_pairs) in a vector DB; retrieve relevant examples at query time based on current task using them as templates.

Skills as memory: agent stores successful task solutions; retrieves similar solutions when facing new tasks; uses them as templates. This is more scalable than fine-tuning and easier to update.

Example: procedural memory with dynamic few-shot retrieval

class ProceduralMemory: def __init__(self): self.skills = [] # list of {task_desc, solution, embedding} def store_skill(self, task_description: str, solution: str): emb = embed(task_description) self.skills.append({ "task": task_description, "solution": solution, "embedding": emb }) def retrieve_examples(self, current_task: str, k=3) -> list[dict]: q_emb = embed(current_task) scored = [(cosine_sim(q_emb, s["embedding"]), s) for s in self.skills] return [s for _, s in sorted(scored, reverse=True)[:k]]
06 — Explicit Memory

MemGPT and Long-Context Architectures

MemGPT: model manages its own memory explicitly — decides what to store in "external memory" (vector DB) vs keep in context window, using special memory functions. Letta (MemGPT evolved) is production-ready stateful agent framework with built-in memory management.

Memory Functions

Example: MemGPT memory function calling

SYSTEM: You have the following memory tools available: - archival_memory_insert(content): Store important information - archival_memory_search(query): Search your long-term memory - core_memory_append(name, content): Update your persona/human profile Your core memory: [PERSONA]: You are a helpful assistant who remembers past interactions. [HUMAN]: Name: Deepak. Prefers Python. Works in AI/ML. # Agent decides to store new information: # archival_memory_insert("User mentioned building a RAG system for legal documents")
⚠️ MemGPT requires reliable function calling. Weak models forget to call memory functions or call them incorrectly. Use GPT-4o or Claude 3.5 class models.
07 — Practical Approaches

Production Memory Patterns

Core Strategies

1

Tiered Storage — hot, warm, cold

Hot tier: context window (last N messages). Warm tier: in-memory vector store (session). Cold tier: persistent vector DB (across sessions). Each tier is cheaper but slower to retrieve.

  • Context window: ~1M tokens (GPT-4 Turbo)
  • Session memory: retrieval in milliseconds
  • Persistent DB: retrieval in seconds
2

Memory Consolidation — compress & prune

Periodically (end of session, or every N turns) run consolidation: summarize episodic memories, extract semantic facts, prune redundant entries. Prevents unbounded memory growth.

  • Consolidation every 50–100 messages
  • Merge similar episodes into summaries
  • Archive old data to cold storage
3

Memory Scoping — governance

Scope memories to: user (private to one user), session (cleared after session ends), agent (shared across all users of one agent), team (shared across a user's agents). Clear data governance.

  • User scope: most restrictive, highest privacy
  • Session scope: temporary, fast cleanup
  • Agent scope: enables cross-user learning
4

Forgetting & Privacy — TTL & erasure

Implement TTL (time-to-live) on episodic memories. Provide user-facing "forget this" functionality. Comply with GDPR right-to-erasure for user memory stores.

  • Default TTL: 90 days for episodic memories
  • User deletion requests: immediate erasure
  • Compliance audits: trace memory deletion

Memory Infrastructure Tools

Framework
LangGraph
Checkpointer for multi-turn memory persistence
Agent
Letta / MemGPT
Stateful agent with explicit memory management
Memory Store
mem0
User memory platform for agents and AI apps
Memory Store
Zep
Long-term memory for LLM applications
Observability
LangSmith
Debugging & tracing memory & LLM chains
Vector DB
ChromaDB
Embedded vector store for episodes
Vector DB
Qdrant
Scalable vector DB for production
Cache
Redis
Fast key-value store for semantic facts
References

Further Reading

Papers & Research
Documentation
Guides & Articles
LEARNING PATH

Learning Path

Agent memory is a system design problem as much as an AI one. Here's the progression:

In-contextsimplest memory
Summarisationcompress history
External storevector + key-value
Knowledge graphstructured memory
1

Start with in-context memory

Just keep the last N turns in the prompt. This works surprisingly well for sessions up to ~20 turns. Don't over-engineer until you see the context window becoming a bottleneck.

2

Add rolling summarisation

When history approaches your context limit, summarise the oldest messages with the LLM itself. LangChain's ConversationSummaryBufferMemory does this automatically.

3

Use a vector store for long-term recall

Embed each significant fact or exchange, store in Chroma or Weaviate, retrieve the top-k relevant memories at each turn. This is "episodic memory" — the agent remembers relevant past experiences.

4

Use a knowledge graph for structured facts

Tools like Cognee or Zep build a graph of entities and relationships from conversation history. Enables multi-hop queries: "What did I say about the project after my meeting with Alice?"