Mem0 Guide 2026: Add Persistent Memory to Your AI Agents

Thu, 07 May 2026 12:00:00 +0000

AI agents without persistent memory lose 80% of context between interactions — every session starts cold, the agent has no recollection of user preferences, past decisions, or accumulated knowledge, and users pay both in frustration and in token costs. Mem0 solves this with a managed memory layer that combines vector search, knowledge graph storage, and key-value caching into a single API. With ~48,000 GitHub stars, a $24M Series A closed in October 2025, and YC backing, Mem0 has become the default choice for teams that want to bolt production-grade memory onto an existing agent in under a day. This guide covers everything you need to go from zero to a memory-enabled agent: architecture internals, quick start code, memory scoping patterns, integration with LangChain and AutoGen, pricing tiers, and how Mem0 compares to Zep and LangGraph Store.

What Is Mem0 and Why AI Agents Need Persistent Memory

Mem0 is an open-source, managed memory layer for AI agents that persists information across sessions, retrieves relevant context on demand, and updates stored facts adaptively without duplicating entries. AI agents without memory are stateless by design — each LLM call is independent, and anything learned in session one is gone by session two unless the developer manually reconstructs context. At small scale this is annoying; at production scale it is expensive. Persistent memory reduces token usage by 30–60% for repeated tasks by replacing verbose context reconstruction with targeted memory retrieval. The practical impact is measurable: a customer support bot that remembers a user’s past tickets cuts average handle time, reduces escalations, and eliminates the user’s need to repeat themselves. A coding agent that persists codebase conventions and architectural decisions stops recommending patterns the team has already explicitly rejected. Memory is not a nice-to-have — it is the difference between an agent that improves with use and one that stays perpetually ignorant of everything it has encountered before.

Mem0 Architecture: Vector, Graph, and Key-Value Combined

Mem0’s storage architecture is a deliberate hybrid that matches retrieval strategy to information type. Vector storage handles semantic memories — free-form facts, preferences, and conversational context — using embedding similarity to surface relevant entries when a query arrives. Graph storage handles structured entity relationships: “Alice works at Acme Corp” and “Acme Corp uses AWS” are two separate facts linked through a shared entity node, making multi-hop retrieval possible without complex query engineering. Key-value storage handles exact-match lookups — flags, configurations, and short factual fields that should be retrieved deterministically rather than semantically. Most memory systems force engineers to pick one of these strategies. Mem0 runs all three simultaneously and applies a routing layer that selects the appropriate backend per retrieval request. This architecture is why Mem0 handles the personalization memory problem well: it can retrieve “Alice prefers concise Python answers” semantically, look up “Alice’s subscription tier” exactly, and traverse “Alice’s relationship to her team’s codebase conventions” through the graph — all in a single memory call. The storage backends are pluggable; the hosted Mem0 Cloud service uses Qdrant for vectors, Neo4j for graph, and Redis for key-value, but the open-source library supports alternative backends including Chroma, Weaviate, Pinecone, and PostgreSQL with pgvector.

Quick Start: Add Mem0 to Your AI Agent in 10 Minutes

Install Mem0 with a single pip command and you have a working memory store in under ten minutes — the hosted cloud tier requires only an API key, no infrastructure setup. The open-source library runs fully locally with default backends configured out of the box. Here is the minimal pattern for adding, searching, and using memory in an agent:

pip install mem0ai

from mem0 import Memory

m = Memory()

# Store a memory for a specific user
m.add("User prefers Python and concise answers", user_id="alice")

# Store multiple facts in one call (list input)
m.add([
    "Alice is building a FastAPI backend",
    "Alice's team uses PostgreSQL, not MySQL",
    "Alice prefers async/await over threading"
], user_id="alice")

# Retrieve relevant memories via semantic search
results = m.search("user coding preferences", user_id="alice")
for r in results:
    print(r["memory"], r["score"])

# Retrieve all memories for a user
memories = m.get_all(user_id="alice")
context = "\n".join([mem["memory"] for mem in memories])

# Inject memory context into an LLM prompt
system_prompt = f"""You are a coding assistant.
Known facts about this user:
{context}

Answer concisely and use Python unless the user asks otherwise."""

For the Mem0 Cloud API (required for graph memory and production rate limits), swap in the MemoryClient:

from mem0 import MemoryClient

client = MemoryClient(api_key="your-api-key")

client.add("User is migrating from Flask to FastAPI", user_id="alice")

results = client.search("framework preferences", user_id="alice", limit=5)

The add() call is where Mem0 does its real work: it runs the input through an LLM extraction pipeline to identify discrete facts, embeds them, checks for conflicts with existing memories, and either inserts new entries or updates existing ones. This extraction step is why Mem0 can accept raw conversation turns rather than pre-formatted facts.

# Add a raw conversation turn — Mem0 extracts facts automatically
messages = [
    {"role": "user", "content": "I always use black for Python formatting and isort for imports"},
    {"role": "assistant", "content": "Got it, I'll follow black + isort conventions in all code I write for you."}
]
m.add(messages, user_id="alice")

Memory Scoping: User, Session, and Agent-Level Memory

Mem0 supports three memory scopes — user-level, session-level, and agent-level — and the scope you choose determines both what gets shared and what gets isolated. User-level memory persists across all sessions for a given user ID: preferences, biographical facts, and long-term behavioral patterns belong here. Session-level memory is scoped to a single interaction window and is garbage-collected (or archived) when the session ends: transient context, in-progress task state, and decisions made during a specific conversation belong here. Agent-level memory is scoped to an agent ID independent of any user: shared knowledge about tools, external APIs, codebase conventions, and domain facts that all users of a given agent should benefit from belong here. Getting scoping wrong is a common source of bugs. If you store architectural decisions at session scope, the agent forgets them the next time it runs. If you store sensitive user data at agent scope, you create unintended data sharing across users. The right pattern for most production agents is a three-layer architecture: agent-level facts as a global knowledge base, user-level facts for personalization, and session-level facts for short-term task state.

from mem0 import Memory

m = Memory()

# Agent-level: shared codebase knowledge
m.add(
    "This codebase uses domain-driven design with hexagonal architecture",
    agent_id="coding-assistant-v2"
)
m.add(
    "All database access must go through repository classes, never direct ORM queries in handlers",
    agent_id="coding-assistant-v2"
)

# User-level: personal preferences
m.add(
    "Alice prefers verbose docstrings with type hints",
    user_id="alice",
    agent_id="coding-assistant-v2"
)

# Session-level: transient task context
m.add(
    "Currently refactoring the payment module — do not suggest changes to auth",
    user_id="alice",
    agent_id="coding-assistant-v2",
    run_id="session-20260507-payment-refactor"
)

# Retrieve combining user + agent context (scopes merge automatically)
results = m.search(
    "how should I structure this service class?",
    user_id="alice",
    agent_id="coding-assistant-v2"
)

Session IDs can be any string; UUIDs, timestamps, and task-descriptive slugs all work. The run_id parameter maps to session scope internally.

Adaptive Memory Updates: How Mem0 Avoids Memory Bloat

Adaptive memory updates are Mem0’s answer to the memory bloat problem that plagues naive vector-store approaches. When you call m.add(), Mem0 does not blindly append a new embedding to the store. Instead, it runs a three-step pipeline: extract discrete facts from the input text, check each fact against existing memories for semantic overlap, and then decide whether to insert a new entry, update an existing entry, or discard the input as a duplicate. This behavior is controlled by the underlying LLM (configurable — OpenAI, Anthropic, or local models all work) and produces a resolution event for each fact: ADD, UPDATE, DELETE, or NONE. The result is a memory store that stays compact and accurate rather than growing unbounded with redundant entries. A user who says “I prefer Python” in session one and “use Python, not JavaScript” in session five doesn’t accumulate two conflicting entries — Mem0 detects the semantic overlap and updates the single canonical preference. This adaptive behavior is particularly important for long-running agents in customer support and personal assistant contexts, where the same user may interact hundreds of times. Without update semantics, memory stores grow linearly with sessions and retrieval quality degrades as noise accumulates. Mem0’s adaptive pipeline keeps the store pruned and coherent.

# Demonstrate adaptive update behavior
m = Memory()

m.add("Alice prefers Python", user_id="alice")
memories_before = m.get_all(user_id="alice")
print(f"Memories before: {len(memories_before)}")  # 1

# Adding a related fact triggers UPDATE, not INSERT
m.add("Alice only uses Python, never JavaScript or TypeScript", user_id="alice")
memories_after = m.get_all(user_id="alice")
print(f"Memories after: {len(memories_after)}")  # Still 1, updated

# Inspect the memory history for a specific memory ID
memory_id = memories_after[0]["id"]
history = m.history(memory_id)
for h in history:
    print(h["event"], h["old_memory"], "->", h["new_memory"])

The m.history() call returns the full audit trail for any memory entry, which is important for debugging unexpected agent behavior and for compliance requirements in regulated environments.

Integrating Mem0 with LangChain, AutoGen, and Custom Agents

Mem0 ships first-party integrations for LangChain and AutoGen, and the REST API makes it straightforward to integrate with any custom agent architecture. The LangChain integration wraps Mem0 as a BaseChatMessageHistory subclass, which means it slots directly into any RunnableWithMessageHistory chain without modifying existing chain logic:

from mem0.integrations.langchain import ZepMemory  # LangChain-compatible wrapper
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

# Mem0 as a LangChain message history backend
from mem0 import MemoryClient
from langchain.memory import ConversationBufferMemory

client = MemoryClient(api_key="your-api-key")

def get_memory_context(user_id: str, query: str) -> str:
    """Retrieve relevant memories as a formatted string for LangChain prompts."""
    results = client.search(query, user_id=user_id, limit=10)
    if not results:
        return "No prior context."
    return "\n".join([f"- {r['memory']}" for r in results])

llm = ChatOpenAI(model="gpt-4o")

def run_agent_turn(user_id: str, user_message: str) -> str:
    # Pull relevant memory
    memory_context = get_memory_context(user_id, user_message)

    # Build prompt with memory context injected
    messages = [
        ("system", f"You are a helpful assistant.\n\nWhat you know about this user:\n{memory_context}"),
        ("human", user_message)
    ]

    response = llm.invoke(messages)

    # Persist this interaction to memory
    client.add([
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": response.content}
    ], user_id=user_id)

    return response.content

For AutoGen (AG2), Mem0 integrates as a custom ConversableAgent that intercepts message history and injects retrieved memories into the system prompt:

from autogen import ConversableAgent
from mem0 import MemoryClient

client = MemoryClient(api_key="your-api-key")

class Mem0Agent(ConversableAgent):
    def __init__(self, user_id: str, *args, **kwargs):
        self.mem0_user_id = user_id
        super().__init__(*args, **kwargs)

    def generate_reply(self, messages, sender=None, **kwargs):
        # Inject memory context before generating reply
        if messages:
            last_user_msg = messages[-1].get("content", "")
            memories = client.search(last_user_msg, user_id=self.mem0_user_id, limit=8)
            memory_text = "\n".join([m["memory"] for m in memories])

            # Prepend memory context to the message list
            memory_message = {
                "role": "system",
                "content": f"Relevant context from prior sessions:\n{memory_text}"
            }
            messages = [memory_message] + list(messages)

        reply = super().generate_reply(messages, sender, **kwargs)

        # Store this exchange
        if reply and messages:
            client.add([
                {"role": "user", "content": messages[-1].get("content", "")},
                {"role": "assistant", "content": reply}
            ], user_id=self.mem0_user_id)

        return reply

agent = Mem0Agent(
    user_id="alice",
    name="MemoryAgent",
    system_message="You are a helpful coding assistant.",
    llm_config={"model": "gpt-4o"}
)

For custom agents built on raw LLM API calls, the pattern is the same: call client.search() before the LLM call, inject results into the system prompt, and call client.add() after the response is generated.

Mem0 Pricing: Free Tier, Pro, and When the Graph Features Matter

Mem0’s pricing has three tiers that map cleanly to deployment stage: Free, Developer at $19/month, and Pro at $249/month. The Free tier gives you 10,000 memories with vector search, full SDK access, and no credit card required — enough to validate a use case and run a non-trivial pilot. The Developer tier at $19/month raises the ceiling to 50,000 memories and adds priority support. The Pro tier at $249/month is where graph memory features unlock: unlimited memories, knowledge graph queries, multi-hop entity resolution, and the SOC 2 Type II and HIPAA compliance documentation required for enterprise procurement. The key decision point is whether your use case requires graph memory. If you are building a personalization layer — storing user preferences, past conversation context, behavioral patterns — the vector tier at Developer pricing handles it efficiently. If you are building an agent that needs to reason about relationships between entities (customer X has contract with vendor Y, who has a known issue with product Z), the graph features in Pro become load-bearing. HIPAA compliance is a hard requirement for any healthcare-adjacent deployment; Mem0 Cloud Pro is one of the few managed memory services with the compliance documentation to satisfy enterprise security reviews. Self-hosting the open-source library is always an option for teams that cannot send data to a third-party cloud regardless of compliance certifications.

Tier	Price	Memories	Graph Memory	Compliance
Free	$0	10K	No	—
Developer	$19/mo	50K	No	—
Pro	$249/mo	Unlimited	Yes	SOC 2 Type II, HIPAA

Mem0 vs Zep vs LangGraph Store: When to Use Each

Mem0, Zep, and LangGraph Store solve overlapping but distinct problems, and choosing the wrong one creates technical debt that compounds over time. Mem0 leads on developer experience and deployment speed — the API is simple, the free tier is generous, the SDK covers Python and TypeScript, and you can be in production in hours rather than days. Zep’s Graphiti engine scores 63.8% on LongMemEval versus Mem0’s 49.0%, a 15-point gap that comes entirely from temporal fact tracking: Zep stores when facts became true and when they stopped being true, which is essential for any use case where the world changes and historical accuracy matters. LangGraph Store is the right answer if your team is already committed to LangGraph for orchestration — it shares state management with your graph nodes, reduces the number of moving parts, and the integration overhead is near zero. The decision framework: pick Mem0 if you need personalization memory deployed quickly across any LLM or framework. Pick Zep if your agent needs to reason about entities whose facts change over time (regulatory compliance, product catalogs, organizational structures). Pick LangGraph Store if your agent is a LangGraph graph and keeping the stack homogeneous matters more than specialized memory features.

Criterion	Mem0	Zep	LangGraph Store
LongMemEval score	49.0%	63.8%	Not published
Temporal fact tracking	Limited	Native (Graphiti)	Limited
Developer experience	Excellent	Good	Good (within LangGraph)
Standalone deployment	Yes	Yes	LangGraph-only
Graph memory	Pro tier	Yes	No
Open-source	Yes (MIT)	Yes (Apache 2.0)	Yes (MIT)
Best fit	Personalization, fast deployment	Entity-heavy, temporal facts	LangGraph-first teams

For teams evaluating both Mem0 and Zep, the practical test is to run LongMemEval on your own data. The benchmark uses multi-session conversation histories that require temporal fact retrieval — if your use case resembles the benchmark (customer support, personal assistants, research agents), Zep’s 15-point lead will show up in production metrics. If your use case is simpler (preference retrieval, conversation summarization), Mem0’s lower complexity and better developer experience often win.

Production Considerations: Performance, Privacy, and Compliance

Running Mem0 in production surfaces three categories of concern that are easy to overlook during prototyping: latency budgets, data residency, and memory quality degradation. On latency: Mem0 Cloud’s vector search returns in 10–50ms for typical memory stores under 100K entries; graph queries run 80–200ms for single-hop lookups and 200–500ms for multi-hop. If your agent is latency-sensitive, run memory retrieval in parallel with the LLM call using async, not sequentially before it. The add() call — which runs LLM-based fact extraction and deduplication — runs 300–800ms and should always be done asynchronously after the response is returned to the user, not in the hot path.

import asyncio
from mem0 import AsyncMemoryClient

client = AsyncMemoryClient(api_key="your-api-key")

async def agent_turn(user_id: str, user_message: str) -> str:
    # Retrieve memory and call LLM in parallel
    memory_task = asyncio.create_task(
        client.search(user_message, user_id=user_id, limit=8)
    )

    # Start LLM call with placeholder context, or await memory first
    memories = await memory_task
    context = "\n".join([m["memory"] for m in memories])

    # Run LLM (simplified)
    response = await call_llm(user_message, context)

    # Store the exchange asynchronously — don't await in hot path
    asyncio.create_task(
        client.add([
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": response}
        ], user_id=user_id)
    )

    return response

On data residency: Mem0 Cloud stores data on AWS infrastructure. If your use case involves EU user data, verify that Mem0’s data processing agreement covers GDPR requirements for your jurisdiction. For regulated data (PHI under HIPAA, financial data under SOC 2 requirements), the Pro tier’s compliance certifications are the starting point — not the ending point. Review the shared responsibility model, configure data deletion hooks for right-to-erasure requests, and audit which LLM provider Mem0 uses for its fact extraction pipeline (OpenAI by default; this can be configured to Anthropic or local models for data residency reasons).

On memory quality: memory stores degrade when agents add low-quality or contradictory entries at high volume. Three mitigations work well in practice. First, filter what you add — not every message turn deserves to be stored. Use a lightweight classifier (even a simple regex or keyword filter) to identify turns with extractable facts before calling m.add(). Second, set TTLs on session-scoped memories to prevent stale task context from polluting future sessions. Third, review memory stores periodically using m.get_all() and build a manual curation interface for production deployments where memory quality directly affects user outcomes.

FAQ

Q: Does Mem0 work with local LLMs like Ollama or llama.cpp?

Yes. Mem0’s fact extraction and deduplication pipeline uses a configurable LLM. You can point it at any OpenAI-compatible endpoint, including Ollama running locally. Set the llm config block in your Memory() constructor to use openai provider with a custom base_url and model matching your local server. Note that fact extraction quality depends on the LLM’s instruction-following capability; models under 7B parameters often produce lower-quality extractions.

Q: How does Mem0 handle PII in stored memories?

Mem0 does not automatically redact PII before storage. If users say “my social security number is X,” that fact will be extracted and stored. For production deployments handling PII, implement a pre-processing step that runs a PII detection library (Microsoft Presidio is a common choice) on user input before calling m.add(). On the cloud tier, configure data deletion webhooks to handle right-to-erasure requests from users.

Q: What happens when two agents access the same user’s memories simultaneously?

Mem0 Cloud handles concurrent reads safely — multiple agents can call m.search() for the same user simultaneously without conflict. Concurrent writes to the same user’s memory store are serialized internally to prevent race conditions in the deduplication pipeline, which means two simultaneous m.add() calls for the same user will queue rather than execute in parallel. In high-throughput scenarios, batch your add() calls or use a queue to control write concurrency.

Q: Can I migrate from Mem0 Cloud to self-hosted without losing memories?

Yes. Use m.get_all() with pagination to export all memories to JSON, then re-import them into a self-hosted Mem0 instance using m.add(). The memory IDs will change (the self-hosted instance generates new IDs), so update any external references accordingly. Mem0 does not currently offer a native export/import CLI command, so the migration script needs to be written manually — it is around 30 lines of Python.

Q: How does Mem0 compare to simply storing chat history in a database and summarizing it?

Naive chat history + summarization works at small scale and fails in three ways at production scale. Summarization is lossy — facts get dropped or distorted as conversation length grows. Summaries don’t support semantic retrieval — you can’t efficiently answer “what did this user say about their database preferences?” against a summary blob. And the token cost of passing full summaries into every LLM call grows linearly with session count. Mem0’s approach — discrete extracted facts with semantic retrieval and adaptive deduplication — solves all three problems. The tradeoff is that Mem0 adds an LLM call per add() operation (for fact extraction), whereas raw summary storage does not. For most use cases, the token savings from targeted retrieval more than offset the extraction overhead.

Persistent-Memory on RockB