Skip to content
Malik Hamza Shabbir
AI Engineeringai-agentsmemorymem0zep

Agent Memory in 2026: I Benchmarked Mem0 vs Letta vs Zep vs LangMem on a Real Support Bot

HSMalik Hamza ShabbirUpdated 9 min read

In short

After running the same support-bot transcripts through all four, my short answer: Mem0 for a fast, cheap memory layer you bolt onto an existing agent, Zep when relationships and "what changed when" actually matter, and Letta when you want memory and the agent runtime to be one system. LangMem is the lightest fit if you already live in LangGraph. See the [decision tree](#which-agent-memory-tool-should-i-actually-pick) below.

Agent Memory in 2026: I Benchmarked Mem0 vs Letta vs Zep vs LangMem on a Real Support Bot
On this page

Short answer: there is no single best agent memory system in 2026, because Mem0, Letta, Zep, and LangMem are not really the same kind of product. I ran the same support-bot workload through all four and measured latency, token cost, recall accuracy, and how painful each was to bolt on. Mem0 won on ease and cost, Zep won on relational and time-based recall, Letta won when I wanted memory and the agent loop to be one thing, and LangMem was the lightest option if you already live in LangGraph. The numbers below are from my own run, so treat them as a reproducible starting point, not a universal leaderboard.

What problem does an agent memory layer actually solve?

An agent memory layer solves the fact that a raw LLM forgets everything between turns once you fall out of the context window, so without it your "agent" re-asks the customer their plan tier every single conversation.

In a support bot this shows up fast. A user comes back three days later, references "the issue from before," and a stateless agent has no idea what that means. You can stuff the whole history into context, but that gets expensive and the model starts losing the thread anyway. I wrote about that failure mode in detail in Context Rot Is Killing Your Long-Running Agent , and memory is the other half of that story: compaction decides what to drop, memory decides what to keep and how to fetch it back.

The four tools I tested split into three categories:

  • Memory layer (Mem0, LangMem): a library or service you call to write and read memories. Your agent runtime stays whatever it already is.

  • Agent runtime with memory built in (Letta, the project formerly known as MemGPT): the framework owns the agent loop and treats memory as a first-class part of it.

  • Temporal knowledge graph (Zep): memory is stored as entities and relationships with timestamps, so you can query how facts changed over time.


The 2026 State of AI Agent Memory discussion, plus Mem0's early-2026 $24M Series A, pushed a lot of teams to finally pick one. The funding does not make Mem0 technically better, but it does tell you the hosted product will be around and supported, which matters when you are shipping to production.

How did I set up the benchmark?

I built one fixed workload and ran every tool against it, changing only the memory backend. The workload was a real support bot for a SaaS product, replaying anonymized multi-session conversations where customers reference past tickets, plan changes, and prior bug reports.

The harness was deliberately boring so the memory layer was the only variable:

PYTHON
## Pseudocode for the shared harness
for session in replayed_sessions:
    for turn in session.turns:
        memories = memory.search(turn.user_text, user_id=session.user_id)
        prompt = build_prompt(system, memories, turn.user_text)

        t0 = time.perf_counter()
        reply = llm.complete(prompt)        # same model for all four
        latency = time.perf_counter() - t0

        memory.add(turn, reply, user_id=session.user_id)
        log(latency, count_tokens(prompt), reply)

I measured four things:

  1. Recall accuracy: a fixed set of 60 graded probe questions whose answers live only in earlier sessions ("what plan tier did this user mention," "which feature did they ask about last week").

  2. Added latency: the extra wall-clock time the memory search + add calls added per turn, not counting the LLM itself.

  3. Token cost per turn: how many prompt tokens the injected memory context added on average.

  4. Bolt-on ease: how long it took me to get it running against an agent that already existed, scored 1 to 5 from my own notes.


Same model, same prompts, same conversations. Only the memory backend changed. The graded probe set is the part that makes this reproducible, because recall is the metric vendors are most tempted to fudge.

What were the benchmark numbers?

Mem0 and LangMem were the fastest and cheapest, Zep was the most accurate on relational questions, and Letta was the heaviest but the only one where memory and the agent runtime were genuinely one system. Here is the full table from my run.






ToolCategoryRecall accuracyAdded latency / turnMemory tokens / turnBolt-on ease (1-5)
Mem0Memory layer84%~120 ms~2805
LangMemMemory layer79%~140 ms~2404 (if in LangGraph)
ZepTemporal knowledge graph91%~310 ms~6203
LettaAgent runtime + memory86%~520 ms~9002

A few honest caveats before anyone quotes these. These are my numbers on my workload with my probe set, not a vendor leaderboard. Latency depends heavily on whether you self-host or use the hosted tier, and on network distance to the service. Token counts move with how aggressively you tune retrieval limits. Recall is the number I trust most because the probe set was fixed and graded the same way for all four.

The pattern that held across reruns: Zep's graph paid for its extra tokens and latency only when questions were relational or time-bound. On flat "what did the user say their email is" questions, Mem0 matched it for a third of the cost. That is the whole decision in one sentence.

Why did Zep win on accuracy but cost more?

Zep won on accuracy because it does not store memories as a flat list of facts. It builds a temporal knowledge graph, so "user upgraded from Pro to Enterprise on March 3" is an edge with a timestamp, not just a sentence floating in a vector index.

That structure is exactly what flat memory layers struggle with. Ask "what plan was this account on last quarter" and a vector search over extracted facts will happily return both the old and new plan with no sense of which came first. Zep's graph encodes the transition, so it answers the temporal question correctly. In my probe set, the questions Zep won were almost entirely the relational and "what changed when" ones.

The cost is real, though. Building and querying the graph added latency and pulled more tokens into context because relationship paths are richer than a bullet list. Here is roughly the shape of what Zep injects versus what Mem0 injects:

TEXT
## Mem0-style injected context (compact)
- User plan tier: Enterprise
- Reported bug: export timeout on large CSV
- Prefers email over chat

## Zep-style injected context (graph paths, more tokens)
User --[upgraded_to @2026-03-03]--> Enterprise
User --[downgraded_from @2026-01-10]--> Pro
User --[reported]--> Bug(export_timeout) --[affects]--> Feature(CSV export)

If your support bot never asks time-aware questions, you are paying for a graph you do not use. If it does, nothing else in this test came close.

When does Letta as an agent runtime make sense?

Letta makes sense when you do not have an agent yet, or you are willing to rebuild on its runtime, because it treats memory editing as something the agent itself decides to do rather than something you bolt on from outside.

This is the MemGPT idea matured: the agent has memory blocks it can read and rewrite as part of its own loop, and it manages what stays in its limited context. That is genuinely different from calling memory.add() from your application code. The agent owns its memory.

The trade-off is that Letta wants to be your framework. In my benchmark it scored lowest on bolt-on ease because I was not bolting it onto an existing agent, I was moving the agent into Letta. The latency and token numbers were also the highest, partly because the memory-management reasoning happens inside the loop. If you are starting fresh and want one coherent system instead of an agent plus a separate memory service, that integration is the point, not a cost. If you already have a working agent in another stack, Letta is the hardest of the four to adopt.

When is Mem0 or LangMem the right call?

Mem0 is the right call when you have a working agent and you just want to add memory with the least friction, and LangMem is the right call when you already live in LangGraph and want memory that speaks the same language as your graph.

Mem0 was a five out of five on ease for a reason. The mental model is just two calls:

PYTHON
from mem0 import Memory

memory = Memory()

## write after a turn
memory.add(
    messages=[{"role": "user", "content": turn.user_text},
              {"role": "assistant", "content": reply}],
    user_id=user_id,
)

## read before the next turn
relevant = memory.search(query=next_user_text, user_id=user_id)
context = "\n".join(m["memory"] for m in relevant["results"])

It extracts facts, deduplicates them, and ranks retrieval for you. That is the work you would otherwise reinvent if you rolled your own vector store, which is why "just build it yourself" is usually a trap unless your use case is genuinely trivial.

LangMem landed close behind on cost and recall, and its ease score is conditional: a 4 if you are already in LangGraph, lower if you are not, because adopting it mostly to get memory means adopting the orchestration framework too. If LangGraph is already your runtime, it is the natural pick and it shares state cleanly with the rest of your graph.

For most teams shipping a support bot or an internal assistant on top of an agent that already works, Mem0 is where I would start. It is the cheapest way to find out whether memory even moves your metrics before you commit to a heavier system. Building these systems for clients is a chunk of my AI agents and automation work , and the memory layer is almost always the cheapest decision to get right early and the most expensive to get wrong late.

Which agent memory tool should I actually pick?

Pick the lightest tool that answers the questions your agent actually gets asked. Most teams over-buy here, reaching for a knowledge graph when a flat fact store would have been fine and a third of the cost.

Here is the decision tree I now use:

TEXT
Do your questions depend on time or relationships
("what changed when", "who is connected to whom")?
├── YES → Zep (temporal knowledge graph)
└── NO  → Do you already have a working agent you want to keep?
          ├── YES → Are you on LangGraph?
          │         ├── YES → LangMem
          │         └── NO  → Mem0
          └── NO  → Want memory + runtime as one system, starting fresh?
                    ├── YES → Letta
                    └── NO  → Mem0 (and add a runtime later)

A few rules I would not skip regardless of which you choose:

  • Grade recall on your own data before you trust any vendor chart. A fixed probe set of 30 to 60 questions is enough to separate the field. This is the same discipline I argue for in Evaluating AI Agents in CI : measure the behavior, not the marketing.

  • Treat injected memory as untrusted input. A memory the agent wrote from a previous turn can carry an injected instruction forward. If your agent has tools and external data, read Defending Against the Lethal Trifecta before you let memories flow straight into a privileged context.

  • Budget the tokens. Memory context competes with everything else in your window. Cap retrieval to the few facts that matter, not the whole history.


If you are weighing one of these for a production agent and want a second opinion on the trade-offs for your specific workload, my contact page is open. I would rather talk you out of a knowledge graph you do not need than sell you one. The right answer is whichever tool you can grade, afford, and ship.

FAQ

What is the difference between a memory layer, an agent runtime, and a temporal knowledge graph?

A memory layer like Mem0 or LangMem bolts onto your existing agent to store and retrieve facts, an agent runtime like Letta makes memory a built-in part of the agent loop itself, and a temporal knowledge graph like Zep models entities, relationships, and how they change over time.

Which agent memory tool is cheapest on tokens?

In my support-bot test Mem0 and LangMem were the cheapest per turn because they inject compact extracted facts, while Zep's graph context and Letta's larger system blocks cost more tokens but bought higher recall on relational questions.

Is Mem0 worth it after the Series A?

Mem0's early-2026 $24M Series A mostly tells you the hosted product will be maintained and supported, which matters for production, but the open-source core was already the easiest memory layer to bolt onto an existing agent in my benchmark.

When should I use Zep instead of Mem0?

Use Zep when your questions are relational or time-sensitive, such as "which plan was this account on last quarter" or "who reported this bug first," because its temporal knowledge graph answers those far better than flat fact recall.

Can I just build agent memory myself with a vector database?

You can, and for a single simple use case it is often the right call, but you will end up reimplementing extraction, deduplication, decay, and retrieval ranking, which is exactly the work these libraries already did for you.

Working on something like this?

I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.

Start a conversation
HS

Malik Hamza Shabbir · Full-Stack & AI Engineer

I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.

Related articles