External vector store that lets agents persist and retrieve facts across sessions. The agent writes memories explicitly and retrieves relevant ones via semantic search before each response.
Short-term (conversation buffer) memory lives in the prompt window and resets when the session ends. Long-term memory persists across sessions in an external store and is retrieved on demand.
Think of it like human memory: short-term is what you're actively thinking about right now; long-term is everything you've ever learned, retrieved via association ("that reminds me of..."). Agents need both: short-term for the current task, long-term for facts, preferences, and history that span many sessions.
The core mechanism: when the agent generates something worth remembering, it writes a memory entry to a vector store (embedding the text). On subsequent turns, it retrieves semantically similar memories before generating a response. This gives the model access to a personalised, growing knowledge base.
New conversation turn:
1. Retrieve: embed(current_message) → top-K similar memories from store
2. Inject: prepend retrieved memories to the context
3. Respond: LLM generates response with memory context
4. Write: LLM decides what (if anything) to store as a new memory
Memory write decisions:
- User preferences: "I prefer bullet points over prose"
- Key facts: "User's project is a GenAI mindmap with 337 nodes"
- Action history: "Successfully deployed to GitHub Pages on 2024-03-15"
- Corrections: "User clarified that embeddings should use cosine distance"
The agent can decide what to write explicitly (via a "save_memory" tool) or you can add an automatic memory extraction step after each turn.
import anthropic
from datetime import datetime
import chromadb # pip install chromadb
client = anthropic.Anthropic()
chroma = chromadb.Client()
collection = chroma.get_or_create_collection("agent_memory")
def embed(text: str) -> list[float]:
'''Use Anthropic embeddings or any embedding model.'''
# Using a lightweight local model for embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
return model.encode(text).tolist()
def save_memory(content: str, user_id: str = "default"):
memory_id = f"mem_{datetime.now().timestamp()}"
collection.add(
ids=[memory_id],
embeddings=[embed(content)],
documents=[content],
metadatas=[{"user_id": user_id, "timestamp": datetime.now().isoformat()}]
)
return memory_id
def retrieve_memories(query: str, user_id: str = "default", k: int = 5) -> list[str]:
results = collection.query(
query_embeddings=[embed(query)],
n_results=k,
where={"user_id": user_id}
)
return results["documents"][0] if results["documents"] else []
def chat_with_memory(user_message: str, user_id: str = "default") -> str:
# 1. Retrieve relevant memories
memories = retrieve_memories(user_message, user_id)
memory_context = ""
if memories:
memory_context = "Relevant memories:
" + "
".join(f"- {m}" for m in memories)
# 2. Build messages
messages = [{"role": "user", "content": user_message}]
# 3. Respond
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
system=f"You are a helpful assistant.
{memory_context}",
messages=messages
)
reply = response.content[0].text
# 4. Extract and save memories (simple heuristic)
if any(kw in user_message.lower() for kw in ["my name", "i am", "i prefer", "i work", "remember"]):
save_memory(user_message, user_id)
return reply
Semantic search (cosine similarity on embeddings) is the default: retrieve memories whose meaning is close to the current query. Works well for factual retrieval ("what did the user say about X?").
Recency weighting: blend similarity score with recency, so recent memories rank higher. Prevents old, contradicted memories from surfacing over newer ones.
import numpy as np
from datetime import datetime
def retrieve_with_recency(query: str, k: int = 5, recency_weight: float = 0.3):
# Get more results than needed, then re-rank
results = collection.query(
query_embeddings=[embed(query)],
n_results=k * 3,
include=["documents", "metadatas", "distances"]
)
docs = results["documents"][0]
metas = results["metadatas"][0]
distances = results["distances"][0] # lower = more similar
# Normalise similarity (1 - distance for cosine)
sim_scores = [1 - d for d in distances]
# Recency score: 1.0 for today, decays to 0 over 90 days
now = datetime.now()
rec_scores = []
for m in metas:
mem_time = datetime.fromisoformat(m["timestamp"])
age_days = (now - mem_time).days
rec_scores.append(max(0, 1 - age_days / 90))
# Combined score
combined = [(1 - recency_weight) * s + recency_weight * r
for s, r in zip(sim_scores, rec_scores)]
ranked = sorted(zip(docs, combined), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:k]]
Mem0 (pip install mem0ai) is a managed memory layer purpose-built for agents. It handles extraction, storage, deduplication, and retrieval automatically:
from mem0 import Memory
import os
os.environ["OPENAI_API_KEY"] = "..." # or use Anthropic
memory = Memory()
# Add a memory
memory.add("The user prefers Python code examples with type hints.", user_id="deepak")
memory.add("The project is a GenAI mindmap with 337 nodes and 174 concept pages.", user_id="deepak")
# Search memories
results = memory.search("what project is the user working on?", user_id="deepak")
for r in results:
print(r["memory"])
# Get all memories for a user
all_memories = memory.get_all(user_id="deepak")
Mem0 uses an LLM to extract structured facts from conversations (rather than just embedding raw messages), deduplicates memories when new information contradicts old, and supports multiple backends (Qdrant, Pinecone, Chroma, etc.).
Without management, memory grows unbounded and retrieval quality degrades as noise accumulates. Three management strategies:
Time-based expiry: delete memories older than N days. Simple, but removes potentially still-valid facts.
Contradiction detection: when writing a new memory, search for semantically similar existing ones and resolve conflicts. "User prefers dark mode" supersedes "User prefers light mode".
def smart_save(content: str, user_id: str):
# Check for contradictions before saving
similar = retrieve_memories(content, user_id, k=3)
if similar:
resolution_prompt = f'''New fact: {content}
Existing memories:
{chr(10).join(f"- {m}" for m in similar)}
Do any existing memories contradict the new fact? If yes,
list the IDs to delete. If not, say "no conflict".'''
# (simplified — production code would parse model output)
save_memory(content, user_id)
Importance scoring: have the model rate memory importance (1-10) at write time. Only persist high-importance memories (≥7), use lower-importance ones only within the session.
Stale memories cause wrong answers. If you saved "User's project has 100 nodes" six months ago and it now has 337, the agent might confidently state the old number. Add timestamps to all memories and include the age in the retrieval context: "Memory from 6 months ago: ...". Let the model factor in staleness.
Memory injection bloats the prompt. Injecting 10 memories at 200 tokens each adds 2,000 tokens to every request. Be selective: retrieve only the top 3-5 most relevant memories, and only inject them if their similarity score exceeds a threshold.
Embedding model and memory model should be consistent. If you change the embedding model (e.g., from all-MiniLM to text-embedding-3-small), all existing embeddings become incompatible. Store the embedding model name with each memory and rebuild the index when you upgrade models.
| Memory Type | Storage | Retrieval | Best For | Limitation |
|---|---|---|---|---|
| Episodic (raw logs) | Vector DB | Semantic search | Conversation history, past interactions | Noisy; needs summarisation |
| Semantic (facts) | Key-value or graph | Exact or fuzzy lookup | User preferences, domain knowledge | Manual extraction needed |
| Procedural (skills) | Prompt templates | Task-type routing | Learned workflows, successful patterns | Hard to update incrementally |
| Working (in-context) | Context window | Immediate | Current task state | Lost at session end |
Memory write quality determines retrieval quality. Raw conversation turns stored as memories are too verbose and noisy — always extract atomic facts before storing. A reliable extraction prompt: "Extract 3–5 specific, reusable facts from this exchange that would help an assistant in a future conversation with this user. Return as a JSON list of strings." Run this extraction after every user session end, not inline during the conversation, to avoid latency spikes in the hot path.
Memory privacy is a critical concern for production systems. Personal facts extracted and stored in a long-term memory system may be subject to GDPR right-to-erasure requirements. Design your memory store with user-scoped deletion from the start: every memory entry should carry a user ID and a created timestamp, and a delete-user endpoint should purge all entries for that user atomically. Test deletion coverage in staging — memory systems with denormalised storage often retain references even after a nominal delete operation.
For multi-user deployments, namespace all memory queries by user ID at the infrastructure level rather than filtering by a metadata field at query time. Use separate namespaces or collections per user (Pinecone, Weaviate, and Chroma all support this). Namespace-level isolation prevents information leakage from incorrect metadata filters and improves retrieval performance since the per-user search space is much smaller than the full corpus.
Run periodic memory consolidation jobs on long-running agent deployments. After 30 days of operation, a user's memory store may contain hundreds of redundant or outdated facts (old preferences that have changed, superseded information). A weekly consolidation job that deduplicates similar memories and marks stale entries for expiry keeps retrieval precision high and prevents the memory store from becoming a source of outdated context that misleads the agent.