Long-term Memory

Short-term vs long-term memory
The memory write–retrieve loop
Building a long-term memory store
Memory retrieval strategies
Mem0: managed memory for agents
Memory decay and management
Gotchas

SECTION 01

Short-term vs long-term memory

Short-term (conversation buffer) memory lives in the prompt window and resets when the session ends. Long-term memory persists across sessions in an external store and is retrieved on demand.

Think of it like human memory: short-term is what you're actively thinking about right now; long-term is everything you've ever learned, retrieved via association ("that reminds me of..."). Agents need both: short-term for the current task, long-term for facts, preferences, and history that span many sessions.

The core mechanism: when the agent generates something worth remembering, it writes a memory entry to a vector store (embedding the text). On subsequent turns, it retrieves semantically similar memories before generating a response. This gives the model access to a personalised, growing knowledge base.

SECTION 02

The memory write–retrieve loop

New conversation turn:
1. Retrieve: embed(current_message) → top-K similar memories from store
2. Inject: prepend retrieved memories to the context
3. Respond: LLM generates response with memory context
4. Write: LLM decides what (if anything) to store as a new memory

Memory write decisions:
- User preferences: "I prefer bullet points over prose"
- Key facts: "User's project is a GenAI mindmap with 337 nodes"
- Action history: "Successfully deployed to GitHub Pages on 2024-03-15"
- Corrections: "User clarified that embeddings should use cosine distance"

The agent can decide what to write explicitly (via a "save_memory" tool) or you can add an automatic memory extraction step after each turn.

SECTION 03

Building a long-term memory store

import anthropic
from datetime import datetime
import chromadb   # pip install chromadb

client = anthropic.Anthropic()
chroma = chromadb.Client()
collection = chroma.get_or_create_collection("agent_memory")

def embed(text: str) -> list[float]:
    '''Use Anthropic embeddings or any embedding model.'''
    # Using a lightweight local model for embeddings
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')
    return model.encode(text).tolist()

def save_memory(content: str, user_id: str = "default"):
    memory_id = f"mem_{datetime.now().timestamp()}"
    collection.add(
        ids=[memory_id],
        embeddings=[embed(content)],
        documents=[content],
        metadatas=[{"user_id": user_id, "timestamp": datetime.now().isoformat()}]
    )
    return memory_id

def retrieve_memories(query: str, user_id: str = "default", k: int = 5) -> list[str]:
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=k,
        where={"user_id": user_id}
    )
    return results["documents"][0] if results["documents"] else []

def chat_with_memory(user_message: str, user_id: str = "default") -> str:
    # 1. Retrieve relevant memories
    memories = retrieve_memories(user_message, user_id)
    memory_context = ""
    if memories:
        memory_context = "Relevant memories:
" + "
".join(f"- {m}" for m in memories)

    # 2. Build messages
    messages = [{"role": "user", "content": user_message}]

    # 3. Respond
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=f"You are a helpful assistant.

{memory_context}",
        messages=messages
    )
    reply = response.content[0].text

    # 4. Extract and save memories (simple heuristic)
    if any(kw in user_message.lower() for kw in ["my name", "i am", "i prefer", "i work", "remember"]):
        save_memory(user_message, user_id)

    return reply

SECTION 04

Memory retrieval strategies

Semantic search (cosine similarity on embeddings) is the default: retrieve memories whose meaning is close to the current query. Works well for factual retrieval ("what did the user say about X?").

Recency weighting: blend similarity score with recency, so recent memories rank higher. Prevents old, contradicted memories from surfacing over newer ones.

import numpy as np
from datetime import datetime

def retrieve_with_recency(query: str, k: int = 5, recency_weight: float = 0.3):
    # Get more results than needed, then re-rank
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=k * 3,
        include=["documents", "metadatas", "distances"]
    )
    docs = results["documents"][0]
    metas = results["metadatas"][0]
    distances = results["distances"][0]  # lower = more similar

    # Normalise similarity (1 - distance for cosine)
    sim_scores = [1 - d for d in distances]

    # Recency score: 1.0 for today, decays to 0 over 90 days
    now = datetime.now()
    rec_scores = []
    for m in metas:
        mem_time = datetime.fromisoformat(m["timestamp"])
        age_days = (now - mem_time).days
        rec_scores.append(max(0, 1 - age_days / 90))

    # Combined score
    combined = [(1 - recency_weight) * s + recency_weight * r
                for s, r in zip(sim_scores, rec_scores)]
    ranked = sorted(zip(docs, combined), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:k]]

SECTION 05

Mem0: managed memory for agents

Mem0 (pip install mem0ai) is a managed memory layer purpose-built for agents. It handles extraction, storage, deduplication, and retrieval automatically:

from mem0 import Memory
import os

os.environ["OPENAI_API_KEY"] = "..."  # or use Anthropic

memory = Memory()

# Add a memory
memory.add("The user prefers Python code examples with type hints.", user_id="deepak")
memory.add("The project is a GenAI mindmap with 337 nodes and 174 concept pages.", user_id="deepak")

# Search memories
results = memory.search("what project is the user working on?", user_id="deepak")
for r in results:
    print(r["memory"])

# Get all memories for a user
all_memories = memory.get_all(user_id="deepak")

Mem0 uses an LLM to extract structured facts from conversations (rather than just embedding raw messages), deduplicates memories when new information contradicts old, and supports multiple backends (Qdrant, Pinecone, Chroma, etc.).

SECTION 06

Memory decay and management

Without management, memory grows unbounded and retrieval quality degrades as noise accumulates. Three management strategies:

Time-based expiry: delete memories older than N days. Simple, but removes potentially still-valid facts.

Contradiction detection: when writing a new memory, search for semantically similar existing ones and resolve conflicts. "User prefers dark mode" supersedes "User prefers light mode".

def smart_save(content: str, user_id: str):
    # Check for contradictions before saving
    similar = retrieve_memories(content, user_id, k=3)
    if similar:
        resolution_prompt = f'''New fact: {content}
Existing memories:
{chr(10).join(f"- {m}" for m in similar)}

Do any existing memories contradict the new fact? If yes,
list the IDs to delete. If not, say "no conflict".'''
        # (simplified — production code would parse model output)
    save_memory(content, user_id)

Importance scoring: have the model rate memory importance (1-10) at write time. Only persist high-importance memories (≥7), use lower-importance ones only within the session.

SECTION 07

Gotchas

Stale memories cause wrong answers. If you saved "User's project has 100 nodes" six months ago and it now has 337, the agent might confidently state the old number. Add timestamps to all memories and include the age in the retrieval context: "Memory from 6 months ago: ...". Let the model factor in staleness.

Memory injection bloats the prompt. Injecting 10 memories at 200 tokens each adds 2,000 tokens to every request. Be selective: retrieve only the top 3-5 most relevant memories, and only inject them if their similarity score exceeds a threshold.

Embedding model and memory model should be consistent. If you change the embedding model (e.g., from all-MiniLM to text-embedding-3-small), all existing embeddings become incompatible. Store the embedding model name with each memory and rebuild the index when you upgrade models.

SECTION 08

Memory System Comparison

Memory Type	Storage	Retrieval	Best For	Limitation
Episodic (raw logs)	Vector DB	Semantic search	Conversation history, past interactions	Noisy; needs summarisation
Semantic (facts)	Key-value or graph	Exact or fuzzy lookup	User preferences, domain knowledge	Manual extraction needed
Procedural (skills)	Prompt templates	Task-type routing	Learned workflows, successful patterns	Hard to update incrementally
Working (in-context)	Context window	Immediate	Current task state	Lost at session end

Memory write quality determines retrieval quality. Raw conversation turns stored as memories are too verbose and noisy — always extract atomic facts before storing. A reliable extraction prompt: "Extract 3–5 specific, reusable facts from this exchange that would help an assistant in a future conversation with this user. Return as a JSON list of strings." Run this extraction after every user session end, not inline during the conversation, to avoid latency spikes in the hot path.

Memory privacy is a critical concern for production systems. Personal facts extracted and stored in a long-term memory system may be subject to GDPR right-to-erasure requirements. Design your memory store with user-scoped deletion from the start: every memory entry should carry a user ID and a created timestamp, and a delete-user endpoint should purge all entries for that user atomically. Test deletion coverage in staging — memory systems with denormalised storage often retain references even after a nominal delete operation.

For multi-user deployments, namespace all memory queries by user ID at the infrastructure level rather than filtering by a metadata field at query time. Use separate namespaces or collections per user (Pinecone, Weaviate, and Chroma all support this). Namespace-level isolation prevents information leakage from incorrect metadata filters and improves retrieval performance since the per-user search space is much smaller than the full corpus.

Run periodic memory consolidation jobs on long-running agent deployments. After 30 days of operation, a user's memory store may contain hundreds of redundant or outdated facts (old preferences that have changed, superseded information). A weekly consolidation job that deduplicates similar memories and marks stale entries for expiry keeps retrieval precision high and prevents the memory store from becoming a source of outdated context that misleads the agent.