Skip to content


Mem0:构建具有可扩展长期记忆的生产级 AI 智能体

Abstract

Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on the LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results demonstrate that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base Mem0 configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to the full-context approach. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, thereby offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight the critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.

大型语言模型在生成上下文连贯的回答方面展现出了卓越能力,但其固定上下文窗口给长期、多会话对话中的一致性维护带来了根本挑战。 我们提出 Mem0,这是一种以记忆为中心的可扩展架构,通过从持续对话中动态提取、整合和检索关键信息来解决这一问题。 在此基础上,我们进一步提出了一个增强变体,它利用基于图的记忆表示来捕捉对话元素之间复杂的关系结构。 通过在 LOCOMO 基准上的全面评估,我们系统地将方法与六类基线进行比较:(i)已有的记忆增强系统,(ii)采用不同分块大小和 k 值的检索增强生成,(iii)处理完整对话历史的全上下文方法,(iv)一个开源记忆方案,(v)一个专有模型系统,以及(vi)一个专门的记忆管理平台。 实验结果表明,我们的方法在单跳、时间、多跳和开放域四类问题上均持续优于所有现有记忆系统。 值得注意的是,相比 OpenAI,Mem0 在 LLM-as-a-Judge 指标上取得了 26% 的相对提升,而带有图记忆的 Mem0 相比基础 Mem0 配置的总体得分约高 2%。 除了准确率提升之外,相比全上下文方法,我们还显著降低了计算开销。 具体而言,Mem0 的 p95 延迟降低了 91%,并节省了超过 90% 的 token 成本,从而在高级推理能力和实际部署约束之间提供了有吸引力的平衡。 我们的发现突出了结构化、持久化记忆机制对于长期对话连贯性的关键作用,并为更可靠、更高效的 LLM 驱动 AI 智能体铺平了道路。

1. Introduction

Human memory is a foundation of intelligence—it shapes our identity, guides decision-making, and enables us to learn, adapt, and form meaningful relationships. Among its many roles, memory is essential for communication: we recall past interactions, infer preferences, and construct evolving mental models of those we engage with. This ability to retain and retrieve information over extended periods enables coherent, contextually rich exchanges that span days, weeks, or even months. AI agents, powered by large language models (LLMs), have made remarkable progress in generating fluent, contextually appropriate responses. However, these systems are fundamentally limited by their reliance on fixed context windows, which severely restrict their ability to maintain coherence over extended interactions.

人类记忆是智能的基础,它塑造我们的身份,指导决策,并使我们能够学习、适应以及建立有意义的关系。 在记忆的诸多作用中,它对于交流尤为重要:我们会回忆过去的互动,推断偏好,并为交流对象构建不断演化的心理模型。 这种在较长时间内保留和检索信息的能力,使得跨越数天、数周甚至数月的连贯且富含上下文的交流成为可能。 由大型语言模型驱动的 AI 智能体在生成流畅且上下文适当的回答方面已经取得了显著进展。 然而,这些系统从根本上受限于对固定上下文窗口的依赖,这严重限制了它们在长期交互中保持连贯性的能力。

This limitation stems from LLMs' lack of persistent memory mechanisms that can extend beyond their finite context windows. While humans naturally accumulate and organize experiences over time, forming a continuous narrative of interactions, AI systems cannot inherently persist information across separate sessions or after context overflow. The absence of persistent memory creates a fundamental disconnect in human-AI interaction. Without memory, AI agents forget user preferences, repeat questions, and contradict previously established facts.

这种限制源于大型语言模型缺乏能够超越有限上下文窗口的持久记忆机制。 人类会随着时间自然地积累和组织经验,形成一条连续的互动叙事,而 AI 系统本身并不能在不同会话之间,或在上下文溢出之后持续保存信息。 持久记忆的缺失在人机交互中造成了根本性的断裂。 没有记忆时,AI 智能体会忘记用户偏好,重复提问,并与先前已经确立的事实相矛盾。

Consider a simple example illustrated in 图1, where a user mentions being vegetarian and avoiding dairy products in an initial conversation. In a subsequent session, when the user asks about dinner recommendations, a system without persistent memory might suggest chicken, completely contradicting the established dietary preferences. In contrast, a system with persistent memory would maintain this critical user information across sessions and suggest appropriate vegetarian, dairy-free options. This common scenario highlights how memory failures can fundamentally undermine user experience and trust.

考虑图1中展示的一个简单例子:用户在初始对话中提到自己是素食者,并且不吃乳制品。 在之后的某次会话中,当用户询问晚餐推荐时,一个没有持久记忆的系统可能会推荐鸡肉,从而完全违背已经建立的饮食偏好。 相反,拥有持久记忆的系统会在多次会话之间保留这一关键用户信息,并推荐合适的素食、无乳制品选项。 这个常见场景说明,记忆失败会从根本上削弱用户体验和信任。

Beyond conversational settings, memory mechanisms have been shown to dramatically enhance agent performance in interactive environments. Agents equipped with memory of past experiences can better anticipate user needs, learn from previous mistakes, and generalize knowledge across tasks. Research demonstrates that memory-augmented agents improve decision-making by leveraging causal relationships between actions and outcomes, leading to more effective adaptation in dynamic scenarios. Hierarchical memory architectures and agentic memory systems capable of autonomous evolution have further shown that memory enables more coherent, long-term reasoning across multiple dialogue sessions.

除了对话场景之外,研究也表明,记忆机制可以显著提升智能体在交互式环境中的表现。 具备过往经验记忆的智能体能够更好地预判用户需求,从先前错误中学习,并在不同任务之间泛化知识。 相关研究表明,记忆增强智能体可以利用行动与结果之间的因果关系来改进决策,从而在动态场景中实现更有效的适应。 分层记忆架构以及能够自主演化的智能体记忆系统进一步表明,记忆能够支持跨越多轮对话会话的、更连贯的长期推理。

Unlike humans, who dynamically integrate new information and revise outdated beliefs, LLMs effectively “reset” once information falls outside their context window. Even as models like OpenAI's GPT-4 (128K tokens), o1 (200K context), Anthropic's Claude 3.7 Sonnet (200K tokens), and Google's Gemini (at least 10M tokens) push the boundaries of context length, these improvements merely delay rather than solve the fundamental limitation.

人类会动态整合新信息并修正过时的信念,而大型语言模型则不同:一旦信息落到上下文窗口之外,它们实际上就会“重置”。 即便 OpenAI 的 GPT-4(128K 词元)、o1(200K 上下文)、Anthropic 的 Claude 3.7 Sonnet(200K 词元)以及 Google 的 Gemini(至少 10M 词元)不断推动上下文长度边界,这些改进也只是延缓了问题,而没有解决这一根本限制。

In practical applications, even these extended context windows prove insufficient for two critical reasons. First, as meaningful human-AI relationships develop over weeks or months, conversation history inevitably exceeds even the most generous context limits. Second, and perhaps more importantly, real-world conversations rarely maintain thematic continuity. A user might mention dietary preferences (being vegetarian), then engage in hours of unrelated discussion about programming tasks, before returning to food-related queries about dinner options. In such scenarios, a full-context approach would need to reason through mountains of irrelevant information, with the critical dietary preferences potentially buried among thousands of tokens of coding discussions. Moreover, simply presenting longer contexts does not ensure effective retrieval or utilization of past information, as attention mechanisms degrade over distant tokens.

在实际应用中,即使这些扩展后的上下文窗口仍然不足,主要有两个关键原因。 第一,随着有意义的人机关系在数周或数月中发展,对话历史不可避免地会超过哪怕是最宽松的上下文限制。 第二,也许更重要的是,真实世界的对话很少保持主题连续。 用户可能先提到饮食偏好(例如自己是素食者),随后花数小时讨论与此无关的编程任务,最后又回到关于晚餐选择的食物相关问题。 在这种场景下,全上下文方法需要在大量无关信息中推理,而关键的饮食偏好可能被埋在数千个关于代码讨论的词元之中。 此外,仅仅提供更长上下文并不能保证模型有效检索或利用过去的信息,因为注意力机制在远距离词元上会退化。

This limitation is particularly problematic in high-stakes domains such as healthcare, education, and enterprise support, where maintaining continuity and trust is crucial. To address these challenges, AI agents must adopt memory systems that go beyond static context extension. A robust AI memory should selectively store important information, consolidate related concepts, and retrieve relevant details when needed—mirroring human cognitive processes. By integrating such mechanisms, we can develop AI agents that maintain consistent personas, track evolving user preferences, and build upon prior exchanges. This shift will transform AI from transient, forgetful responders into reliable, long-term collaborators, fundamentally redefining the future of conversational intelligence.

这一限制在医疗、教育和企业支持等高风险领域尤其成问题,因为这些场景中保持连续性和信任至关重要。 为了解决这些挑战,AI 智能体必须采用超越静态上下文扩展的记忆系统。 一个稳健的 AI 记忆系统应当能够选择性地存储重要信息,整合相关概念,并在需要时检索相关细节,从而模拟人类认知过程 通过整合这些机制,我们可以开发出能够保持一致人格、追踪不断变化的用户偏好,并基于先前交流继续推进的 AI 智能体。 这种转变将使 AI 从短暂、健忘的回应者,转变为可靠的长期协作者,从根本上重新定义对话智能的未来。

In this paper, we address a fundamental limitation in AI systems: their inability to maintain coherent reasoning across extended conversations across different sessions, which severely restricts meaningful long-term interactions with users. We introduce Mem0 (pronounced as mem-zero), a novel memory architecture that dynamically captures, organizes, and retrieves salient information from ongoing conversations. Building on this foundation, we develop Mem0g, which enhances the base architecture with graph-based memory representations to better model complex relationships between conversational elements.

本文解决 AI 系统中的一个根本限制:它们无法在跨不同会话的长时间对话中保持连贯推理,这严重限制了与用户进行有意义的长期交互。 我们提出 Mem0(读作 mem-zero),这是一种新型记忆架构,能够从持续对话中动态捕获、组织和检索关键信息。 在此基础上,我们进一步开发了 Mem0g,它使用基于图的记忆表示增强基础架构,从而更好地建模对话元素之间的复杂关系。

Our experimental results on the LOCOMO benchmark demonstrate that our approaches consistently outperform existing memory systems—including memory-augmented architectures, retrieval-augmented generation (RAG) methods, and both open-source and proprietary solutions—across diverse question types, while simultaneously requiring significantly lower computational resources. Latency measurements further reveal that Mem0 operates with 91% lower response times than full-context approaches, striking an optimal balance between sophisticated reasoning capabilities and practical deployment constraints. These contributions represent a meaningful step toward AI systems that can maintain coherent, context-aware conversations over extended durations—mirroring human communication patterns and opening new possibilities for applications in personal tutoring, healthcare, and personalized assistance.

我们在 LOCOMO 基准上的实验结果表明,我们的方法在多种问题类型上持续优于现有记忆系统,包括记忆增强架构、检索增强生成方法,以及开源和专有解决方案,同时所需计算资源显著更低。 延迟测量进一步显示,相比全上下文方法,Mem0 的响应时间降低了 91%,在复杂推理能力与实际部署约束之间取得了最佳平衡。 这些贡献代表着迈向能够在长时间跨度内保持连贯、上下文感知对话的 AI 系统的重要一步,它们模拟了人类交流模式,并为个人辅导、医疗健康和个性化助手等应用打开了新的可能性。

2. Proposed Methods

We introduce two memory architectures for AI agents. (1) Mem0 implements a novel paradigm that extracts, evaluates, and manages salient information from conversations through dedicated modules for memory extraction and updation. The system processes a pair of messages between either two user participants or a user and an assistant. (2) Mem0g extends this foundation by incorporating graph-based memory representations, where memories are stored as directed labeled graphs with entities as nodes and relationships as edges. This structure enables a deeper understanding of the connections between entities. By explicitly modeling both entities and their relationships, Mem0g supports more advanced reasoning across interconnected facts, especially for queries that require navigating complex relational paths across multiple memories.

我们提出了两种面向 AI 智能体的记忆架构。 (1) Mem0 实现了一种新的范式,通过专门的记忆提取和更新模块,从对话中提取、评估并管理关键信息。 该系统处理两个用户参与者之间,或用户与助手之间的一对消息。 (2) Mem0g 在此基础上进一步引入基于图的记忆表示,其中记忆被存储为有向带标签图,实体作为节点,关系作为边。 这种结构能够更深入地理解实体之间的连接。 通过显式建模实体及其关系,Mem0g 支持在相互关联的事实之间进行更高级的推理,尤其适用于需要跨多个记忆导航复杂关系路径的查询。

2.1. Mem0

Our architecture follows an incremental processing paradigm, enabling it to operate seamlessly within ongoing conversations. As illustrated in 图2, the complete pipeline architecture consists of two phases: extraction and update.

我们的架构遵循增量式处理范式,使其能够在持续进行的对话中无缝运行。 如图2所示,完整的流水线架构由两个阶段组成:提取更新

The extraction phase initiates upon ingestion of a new message pair (mt1,mt), where mt represents the current message and mt1 the preceding one. This pair typically consists of a user message and an assistant response, capturing a complete interaction unit. To establish appropriate context for memory extraction, the system employs two complementary sources: (1) a conversation summary S retrieved from the database that encapsulates the semantic content of the entire conversation history, and (2) a sequence of recent messages {mtm,mtm+1,...,mt2} from the conversation history, where m is a hyperparameter controlling the recency window.

提取阶段在系统接收到新的消息对 (mt1,mt) 时启动,其中 mt 表示当前消息,mt1 表示前一条消息。 这个消息对通常由一条用户消息和一条助手回复构成,用来捕捉一个完整的交互单元。 为了为记忆提取建立合适的上下文,系统使用两个互补来源:(1) 从数据库中检索到的对话摘要 S,它概括了整个对话历史的语义内容;(2) 对话历史中的近期消息序列 {mtm,mtm+1,...,mt2},其中 m 是控制近期窗口大小的超参数。

To support context-aware memory extraction, we implement an asynchronous summary generation module that periodically refreshes the conversation summary. This component operates independently of the main processing pipeline, ensuring that memory extraction consistently benefits from up-to-date contextual information without introducing processing delays. While S provides global thematic understanding across the entire conversation, the recent message sequence offers granular temporal context that may contain relevant details not consolidated in the summary.

为了支持上下文感知的记忆提取,我们实现了一个异步摘要生成模块,用于周期性刷新对话摘要。 该组件独立于主处理流水线运行,确保记忆提取能够持续受益于最新的上下文信息,同时不会引入处理延迟。 S 提供了整个对话层面的全局主题理解,而近期消息序列则提供了更细粒度的时间上下文,其中可能包含尚未被整合进摘要的相关细节。

This dual contextual information, combined with the new message pair, forms a comprehensive prompt P=(S,{mtm,...,mt2},mt1,mt) for an extraction function ϕ implemented via an LLM. The function ϕ(P) then extracts a set of salient memories Ω={ω1,ω2,...,ωn} specifically from the new exchange while maintaining awareness of the conversation's broader context, resulting in candidate facts for potential inclusion in the knowledge base.

这两类上下文信息与新的消息对结合在一起,形成一个完整提示 P=(S,{mtm,...,mt2},mt1,mt),并输入由 LLM 实现的提取函数 ϕ 函数 ϕ(P) 随后会在保持对更广泛对话上下文感知的同时,专门从新的交互中提取一组关键记忆 Ω={ω1,ω2,...,ωn},从而得到可能被纳入知识库的候选事实。

Following extraction, the update phase evaluates each candidate fact against existing memories to maintain consistency and avoid redundancy. This phase determines the appropriate memory management operation for each extracted fact ωiΩ. Algorithm 1, mentioned in Appendix A, illustrates this process. For each fact, the system first retrieves the top s semantically similar memories using vector embeddings from the database. These retrieved memories, along with the candidate fact, are then presented to the LLM through a function-calling interface we refer to as a tool call.

在提取之后,更新阶段会将每个候选事实与已有记忆进行比较,以维持一致性并避免冗余。 该阶段会为每个被提取出的事实 ωiΩ 确定合适的记忆管理操作。 附录 A 中提到的算法1展示了这一过程。 对于每个事实,系统首先使用数据库中的向量嵌入检索语义上最相似的前 s 条记忆。 随后,这些被检索出的记忆会与候选事实一起,通过一个被称为 tool call 的函数调用接口呈现给 LLM。

The LLM itself determines which of four distinct operations to execute: ADD for creation of new memories when no semantically equivalent memory exists; UPDATE for augmentation of existing memories with complementary information; DELETE for removal of memories contradicted by new information; and NOOP when the candidate fact requires no modification to the knowledge base. Rather than using a separate classifier, we leverage the LLM's reasoning capabilities to directly select the appropriate operation based on the semantic relationship between the candidate fact and existing memories. Following this determination, the system executes the provided operations, thereby maintaining knowledge base coherence and temporal consistency.

LLM 自身会决定执行四种不同操作中的哪一种:当不存在语义等价记忆时,使用 ADD 创建新记忆;当新信息可以补充已有记忆时,使用 UPDATE 增强现有记忆;当新信息与已有记忆矛盾时,使用 DELETE 移除相应记忆;当候选事实不需要修改知识库时,使用 NOOP 我们没有使用单独的分类器,而是利用 LLM 的推理能力,根据候选事实与已有记忆之间的语义关系,直接选择合适操作。 在完成判断后,系统执行相应操作,从而维护知识库的连贯性和时间一致性。

In our experimental evaluation, we configured the system with m=10 previous messages for contextual reference and s=10 similar memories for comparative analysis. All language model operations utilized GPT-4o-mini as the inference engine. The vector database employs dense embeddings to facilitate efficient similarity search during the update phase.

在实验评估中,我们将系统配置为使用 m=10 条先前消息作为上下文参考,并使用 s=10 条相似记忆进行比较分析。 所有语言模型操作都使用 GPT-4o-mini 作为推理引擎。 向量数据库采用稠密嵌入,以便在更新阶段进行高效的相似性搜索。

2.2. Mem0g

The Mem0g pipeline, illustrated in 图3, implements a graph-based memory approach that effectively captures, stores, and retrieves contextual information from natural language interactions. In this framework, memories are represented as a directed labeled graph G=(V,E,L), where:

  • Nodes V represent entities, e.g., Alice, San Francisco.
  • Edges E represent relationships between entities, e.g., lives_in.
  • Labels L assign semantic types to nodes, e.g., Alice - Person, San Francisco - City.

Mem0g 流水线如图3所示,它实现了一种基于图的记忆方法,能够从自然语言交互中有效捕获、存储和检索上下文信息。 在这个框架中,记忆被表示为有向带标签图 G=(V,E,L),其中:

  • 节点 V 表示实体,例如 Alice、San Francisco。
  • E 表示实体之间的关系,例如 lives_in。
  • 标签 L 为节点分配语义类型,例如 Alice - Person、San Francisco - City。

Each entity node vV contains three components: (1) an entity type classification that categorizes the entity, e.g., Person, Location, Event, (2) an embedding vector ev that captures the entity's semantic meaning, and (3) metadata including a creation timestamp tv. Relationships in our system are structured as triplets in the form (vs,r,vd), where vs and vd are source and destination entity nodes, respectively, and r is the labeled edge connecting them.

每个实体节点 vV 包含三个组成部分:(1)实体类型分类,用于对实体进行归类,例如 Person、Location、Event;(2)嵌入向量 ev,用于捕捉实体的语义含义;(3)元数据,包括创建时间戳 tv 在我们的系统中,关系被组织为 (vs,r,vd) 形式的三元组,其中 vsvd 分别是源实体节点和目标实体节点,r 是连接它们的带标签边。

The extraction process employs a two-stage pipeline leveraging LLMs to transform unstructured text into structured graph representations. First, an entity extractor module processes the input text to identify a set of entities along with their corresponding types. In our framework, entities represent the key information elements in conversations—including people, locations, objects, concepts, events, and attributes that merit representation in the memory graph. The entity extractor identifies these diverse information units by analyzing the semantic importance, uniqueness, and persistence of elements in the conversation. For instance, in a conversation about travel plans, entities might include destinations, transportation modes, dates, activities, and participant preferences—essentially any discrete information that could be relevant for future reference or reasoning.

提取过程采用两阶段流水线,利用 LLM 将非结构化文本转换为结构化图表示。 首先,实体提取器模块处理输入文本,识别一组实体及其对应类型。 在我们的框架中,实体表示对话中的关键信息元素,包括人物、地点、物体、概念、事件以及值得在记忆图中表示的属性。 实体提取器通过分析对话中元素的语义重要性、唯一性和持久性,识别这些不同类型的信息单元。 例如,在关于旅行计划的对话中,实体可能包括目的地、交通方式、日期、活动和参与者偏好,本质上就是任何未来参考或推理时可能相关的离散信息。

Next, a relationship generator component derives meaningful connections between these entities, establishing a set of relationship triplets that capture the semantic structure of the information. This LLM-based module analyzes the extracted entities and their context within the conversation to identify semantically significant connections. It works by examining linguistic patterns, contextual cues, and domain knowledge to determine how entities relate to one another. For each potential entity pair, the generator evaluates whether a meaningful relationship exists and, if so, classifies this relationship with an appropriate label, e.g., lives_in, prefers, owns, happened_on. The module employs prompt engineering techniques that guide the LLM to reason about both explicit statements and implicit information in the dialogue, resulting in relationship triplets that form the edges in our memory graph and enable complex reasoning across interconnected information.

接下来,关系生成器组件会推导这些实体之间有意义的连接,建立一组关系三元组,以捕捉信息的语义结构。 这个基于 LLM 的模块会分析被提取出的实体及其在对话中的上下文,从而识别语义上重要的连接。 它通过检查语言模式、上下文线索和领域知识,判断实体之间如何关联。 对于每一对潜在实体,生成器会评估是否存在有意义的关系;如果存在,则用合适的标签对该关系进行分类,例如 lives_inprefersownshappened_on 该模块采用提示工程技术,引导 LLM 对对话中的显式陈述和隐含信息进行推理,最终得到构成记忆图中边的关系三元组,并支持跨互联信息的复杂推理。

When integrating new information, Mem0g employs a sophisticated storage and update strategy. For each new relationship triple, we compute embeddings for both source and destination entities, then search for existing nodes with semantic similarity above a defined threshold t. Based on node existence, the system may create both nodes, create only one node, or use existing nodes before establishing the relationship with appropriate metadata. To maintain a consistent knowledge graph, we implement a conflict detection mechanism that identifies potentially conflicting existing relationships when new information arrives. An LLM-based update resolver determines if certain relationships should be obsolete, marking them as invalid rather than physically removing them to enable temporal reasoning.

在整合新信息时,Mem0g 采用了一种更复杂的存储与更新策略。 对于每个新的关系三元组,我们会分别计算源实体和目标实体的嵌入,然后搜索语义相似度高于设定阈值 t 的已有节点。 根据节点是否已经存在,系统可能会创建两个节点,只创建其中一个节点,或者复用已有节点,然后再用适当的元数据建立关系。 为了维护一致的知识图谱,我们实现了一个冲突检测机制,用于在新信息到来时识别可能与之冲突的已有关系。 一个基于 LLM 的更新解析器会判断某些关系是否应该过时,并将其标记为无效,而不是物理删除它们,从而支持时间推理。

The memory retrieval functionality in Mem0g implements a dual-approach strategy for optimal information access. The entity-centric method first identifies key entities within a query, then leverages semantic similarity to locate corresponding nodes in the knowledge graph. It systematically explores both incoming and outgoing relationships from these anchor nodes, constructing a comprehensive subgraph that captures relevant contextual information. Complementing this, the semantic triplet approach takes a more holistic view by encoding the entire query as a dense embedding vector. This query representation is then matched against textual encodings of each relationship triplet in the knowledge graph. The system calculates fine-grained similarity scores between the query and all available triplets, returning only those that exceed a configurable relevance threshold, ranked in order of decreasing similarity. This dual retrieval mechanism enables Mem0g to handle both targeted entity-focused questions and broader conceptual queries with equal effectiveness.

Mem0g 的记忆检索功能采用双路径策略,以实现最佳信息访问。 以实体为中心的方法首先识别查询中的关键实体,然后利用语义相似度在知识图谱中定位对应节点。 它会系统地探索这些锚点节点的入边和出边,构建一个包含相关上下文信息的综合子图。 作为补充,语义三元组方法采用更整体的视角,将整个查询编码为稠密嵌入向量。 随后,该查询表示会与知识图谱中每个关系三元组的文本编码进行匹配。 系统会计算查询与所有可用三元组之间的细粒度相似度分数,仅返回超过可配置相关性阈值的三元组,并按相似度递减排序。 这种双重检索机制使 Mem0g 能够同样有效地处理针对实体的具体问题和更宽泛的概念性查询。

From an implementation perspective, the system utilizes [Neo4j](https://neo4j. com/) as the underlying graph database. LLM-based extractors and update module leverage GPT-4o-mini with function calling capabilities, allowing for structured extraction of information from unstructured text. By combining graph-based representations with semantic embeddings and LLM-based information extraction, Mem0g achieves both the structural richness needed for complex reasoning and the semantic flexibility required for natural language understanding.

从实现角度看,系统使用 [Neo4j](https://neo4j. com/) 作为底层图数据库。 基于 LLM 的提取器和更新模块利用具备函数调用能力的 GPT-4o-mini,从而能够从非结构化文本中进行结构化信息提取。 通过结合基于图的表示、语义嵌入以及基于 LLM 的信息提取,Mem0g 同时获得了复杂推理所需的结构丰富性,以及自然语言理解所需的语义灵活性。

3. Experimental Setup

3.1. Dataset

The LOCOMO dataset is designed to evaluate long-term conversational memory in dialogue systems. It comprises 10 extended conversations, each containing approximately 600 dialogues and 26000 tokens on average, distributed across multiple sessions. Each conversation captures two individuals discussing daily experiences or past events. Following these multi-session dialogues, each conversation is accompanied by 200 questions on an average with corresponding ground truth answers. These questions are categorized into multiple types: single-hop, multi-hop, temporal, and open-domain. The dataset originally included an adversarial question category, which was designed to test systems' ability to recognize unanswerable questions. However, this category was excluded from our evaluation because ground truth answers were unavailable, and the expected behavior for this question type is that the agent should recognize them as unanswerable.

LOCOMO 数据集旨在评估对话系统中的长期对话记忆能力。 它包含 10 段长对话,每段对话平均约有 600 轮对话和 26000 个 token,并分布在多个会话中。 每段对话记录了两个人讨论日常经历或过往事件的过程。 在这些多会话对话之后,每段对话平均配有 200 个问题及其对应的标准答案。 这些问题被划分为多种类型:单跳问题、多跳问题、时间问题和开放域问题。 该数据集最初还包含一个对抗性问题类别,用于测试系统识别不可回答问题的能力。 不过,我们在评估中排除了这一类别,因为它缺少标准答案,并且这类问题的预期行为是智能体应该识别出它们不可回答。

3.2. Evaluation Metrics

Our evaluation framework implements a comprehensive approach to assess long-term memory capabilities in dialogue systems, considering both response quality and operational efficiency. We categorize our metrics into two distinct groups that together provide a holistic understanding of system performance.

我们的评估框架采用一种综合方法来评估对话系统的长期记忆能力,同时考虑回答质量和运行效率。 我们将指标分为两个不同的组,这两组指标共同提供对系统性能的整体理解。

(1) Performance Metrics Previous research in conversational AI has predominantly relied on lexical similarity metrics such as F1 Score (F1) and BLEU-1 (BLEU-1). However, these metrics exhibit significant limitations when evaluating factual accuracy in conversational contexts. Consider a scenario where the ground truth answer is "Alice was born in March" and a system generates "Alice is born in July." Despite containing a critical factual error regarding the birth month, traditional metrics would assign relatively high scores due to lexical overlap in the remaining tokens ("Alice," "born," etc.). This fundamental limitation can lead to misleading evaluations that fail to capture semantic correctness.

(1) 性能指标 以往的对话式 AI 研究主要依赖词面相似度指标,例如 F1 ScoreF1)和 BLEU-1BLEU-1)。 然而,在评估对话场景中的事实准确性时,这些指标存在明显局限。 设想一种情况:标准答案是“Alice was born in March”,而系统生成的答案是“Alice is born in July”。 虽然这个回答在出生月份这一关键信息上存在事实错误,但由于其余 token(例如“Alice”“born”等)存在词面重叠,传统指标仍可能给出相对较高的分数。 这种根本性局限可能导致误导性评估,因为它无法真正捕捉语义正确性。

To address these shortcomings, we use LLM-as-a-Judge (Judge) as a complementary evaluation metric. This approach leverages a separate, more capable LLM to assess response quality across multiple dimensions, including factual accuracy, relevance, completeness, and contextual appropriateness. The judge model analyzes the question, ground truth answer and the generated answer, providing a more nuanced evaluation that aligns better with human judgment. Due to the stochastic nature of Judge evaluations, we conducted 10 independent runs for each method on the entire dataset and report the mean scores along with ± 1 standard deviation. More details about the Judge is present in Appendix.

为了解决这些不足,我们使用 LLM-as-a-JudgeJudge)作为补充评估指标。 该方法利用一个单独的、更强的 LLM,从多个维度评估回答质量,包括事实准确性、相关性、完整性和上下文适配性。 裁判模型会分析问题、标准答案和生成答案,从而给出更细致、也更符合人类判断的评估。 由于 Judge 评估本身具有随机性,我们对整个数据集上的每种方法进行了 10 次独立运行,并报告平均分以及 ± 1 个标准差。 关于 Judge 的更多细节见附录。

(2) Deployment Metrics Beyond response quality, practical deployment considerations are crucial for real-world applications of long-term memory in AI agents. We systematically track Token Consumption, using cl100k_base encoding from tiktoken, measuring the number of tokens extracted during retrieval that serve as context for answering queries. For our memory-based models, these tokens represent the memories retrieved from the knowledge base, while for RAG-based models, they correspond to the total number of tokens in the retrieved text chunks. This distinction is important as it directly affects operational costs and system efficiency—whether processing concise memory facts or larger raw text segments. We further monitor Latency, (i) search latency: which captures the total time required to search the memory (in memory-based solutions) or chunk (in RAG-based solutions) and (ii) total latency: time to generate appropriate responses, consisting of both retrieval time (accessing memories or chunks) and answer generation time using the LLM.

(2) 部署指标 除了回答质量之外,实际部署因素对于 AI 智能体中长期记忆的真实应用也非常关键。 我们系统性地跟踪 Token Consumption,使用 tiktoken 中的 cl100k_base 编码,衡量检索过程中被取出并作为回答查询上下文的 token 数量。 对于基于记忆的模型,这些 token 表示从知识库中检索出的记忆;而对于基于 RAG 的模型,它们对应的是被检索文本块中的 token 总数。 这一区分很重要,因为它会直接影响运行成本和系统效率,也就是系统处理的是简洁的记忆事实,还是更大的原始文本片段。 我们还进一步监控 Latency:(i)搜索延迟,即搜索记忆(在基于记忆的方案中)或文本块(在基于 RAG 的方案中)所需的总时间;(ii)总延迟,即生成合适回答所需的时间,它由检索时间(访问记忆或文本块)和使用 LLM 生成答案的时间共同组成。

The relationship between these metrics reveals important trade-offs in system design. For instance, more sophisticated memory architectures might achieve higher factual accuracy but at the cost of increased token consumption and latency. Our multi-dimensional evaluation methodology enables researchers and practitioners to make informed decisions based on their specific requirements, whether prioritizing response quality for critical applications or computational efficiency for real-time deployment scenarios.

这些指标之间的关系揭示了系统设计中的重要权衡。 例如,更复杂的记忆架构可能获得更高的事实准确性,但代价是 token 消耗和延迟增加。 我们的多维评估方法使研究人员和实践者能够根据自身需求做出更有依据的决策,无论他们是优先考虑关键应用中的回答质量,还是优先考虑实时部署场景中的计算效率。

3.3. Baselines

To comprehensively evaluate our approach, we compare against six distinct categories of baselines that represent the current state of conversational memory systems. These diverse baselines collectively provide a robust framework for evaluating the effectiveness of different memory architectures across various dimensions, including factual accuracy, computational efficiency, and scalability to extended conversations. Where applicable, unless otherwise specified, we set the temperature to 0 to ensure the runs are as reproducible as possible.

为了全面评估我们的方法,我们将其与六类不同的基线进行比较,这些基线代表了当前对话记忆系统的发展状态。 这些多样化的基线共同构成了一个稳健的评估框架,用于从多个维度衡量不同记忆架构的有效性,包括事实准确性、计算效率,以及扩展到长对话时的可扩展性。 在适用情况下,除非另有说明,我们将温度设置为 0,以尽可能保证运行结果可复现。

Established LOCOMO Benchmarks We first establish a comparative foundation by evaluating previously benchmarked methods on the LOCOMO dataset. These include five established approaches: LoCoMo, ReadAgent, MemoryBank, MemGPT, and A-Mem. These established benchmarks not only provide direct comparison points with published results but also represent the evolution of conversational memory architectures across different algorithmic paradigms. For our evaluation, we select the metrics where gpt-4o-mini was used for the evaluation. More details about these benchmarks are mentioned in Appendix.

已有的 LOCOMO 基准 我们首先通过评估此前已在 LOCOMO 数据集上测试过的方法,建立比较基础。 这些方法包括五个已有方案:LoCoMo、ReadAgent、MemoryBank、MemGPT 和 A-Mem。 这些已有基准不仅提供了与已发表结果直接比较的参照点,也代表了不同算法范式下对话记忆架构的演进。 在我们的评估中,我们选择使用 gpt-4o-mini 作为评估模型时对应的指标。 关于这些基准的更多细节见附录。

Open-Source Memory Solutions Our second category consists of promising open-source memory architectures such as LangMem (Hot Path) that have demonstrated effectiveness in related conversational tasks but have not yet been evaluated on the LOCOMO dataset. By adapting these systems to our evaluation framework, we broaden the comparative landscape and identify potential alternative approaches that may offer competitive performance. We initialized the LLM with gpt-4o-mini and used text-embedding-small-3 as the embedding model.

开源记忆方案 第二类基线由一些有潜力的开源记忆架构组成,例如 LangMem(Hot Path)。这些系统已经在相关对话任务中展示出有效性,但尚未在 LOCOMO 数据集上进行评估。 通过将这些系统适配到我们的评估框架中,我们拓宽了比较范围,并识别出可能具有竞争性能的替代方案。 我们使用 gpt-4o-mini 初始化 LLM,并使用 text-embedding-small-3 作为嵌入模型。

Retrieval-Augmented Generation (RAG) As a baseline, we treat the entire conversation history as a document collection and apply a standard RAG pipeline. We first segment each conversation into fixed-length chunks (128, 256, 512, 1024, 2048, 4096, and 8192 tokens), where 8192 is the maximum chunk size supported by our embedding model. All chunks are embedded using OpenAI's text-embedding-small-3 to ensure consistent vector quality across configurations. At query time, we retrieve the top k chunks by semantic similarity and concatenate them as context for answer generation. Throughout our experiments we set k{1,2}: with k=1 only the single most relevant chunk is used, and with k=2 the two most relevant chunks (up to 16384 tokens) are concatenated. We avoid k>2 since the average conversation length (26000 tokens) would be fully covered, negating the benefits of selective retrieval. By varying chunk size and k, we systematically evaluate RAG performance on long-term conversational memory tasks.

检索增强生成(RAG) 作为一个基线,我们将完整对话历史视为文档集合,并应用标准 RAG 流水线。 我们首先将每段对话切分为固定长度的文本块(128、256、512、1024、2048、4096 和 8192 个 token),其中 8192 是我们的嵌入模型支持的最大块大小。 所有文本块都使用 OpenAI 的 text-embedding-small-3 进行嵌入,以确保不同配置下的向量质量一致。 在查询时,我们根据语义相似度检索 top-k 文本块,并将它们拼接为生成答案的上下文。 在所有实验中,我们设置 k{1,2}:当 k=1 时,只使用最相关的单个文本块;当 k=2 时,拼接两个最相关的文本块(最多 16384 个 token)。 我们避免使用 k>2,因为平均对话长度约为 26000 个 token,这样几乎会覆盖完整对话,从而抵消选择性检索的优势。 通过改变文本块大小和 k,我们系统性地评估 RAG 在长期对话记忆任务上的表现。

Full-Context Processing We adopt a straightforward approach by passing the entire conversation history within the context window of the LLM. This method leverages the model's inherent ability to process sequential information without additional architectural components. While conceptually simple, this approach faces practical limitations as conversation length increases, eventually increasing token cost and latency. Nevertheless, it establishes an important reference point for understanding the value of more sophisticated memory mechanisms compared to direct processing of available context.

全上下文处理 我们采用一种直接方法:将完整对话历史放入 LLM 的上下文窗口中。 该方法利用模型自身处理序列信息的能力,不需要额外架构组件。 虽然这个思路在概念上很简单,但随着对话长度增加,它会面临实际限制,最终导致 token 成本和延迟上升。 尽管如此,它仍然提供了一个重要参照点,有助于理解更复杂的记忆机制相对于直接处理可用上下文的价值。

Proprietary Models We evaluate OpenAI's memory feature available in their ChatGPT interface, specifically using gpt-4o-mini for consistency. We ingest entire LOCOMO conversations with a prompt into single chat sessions, prompting memory generation with timestamps, participant names, and conversation text. These generated memories are then used as complete context for answering questions about each conversation, intentionally granting the OpenAI approach privileged access to all memories rather than only question-relevant ones. This methodology accommodates the lack of external API access for selective memory retrieval in OpenAI's system for benchmarking.

专有模型 我们评估了 OpenAI 在 ChatGPT 界面中提供的记忆功能,并专门使用 gpt-4o-mini 以保持一致性。 我们通过提示词将完整的 LOCOMO 对话导入单个聊天会话,并要求系统根据时间戳、参与者姓名和对话文本生成记忆。 随后,这些生成的记忆会作为完整上下文,用于回答每段对话相关的问题。这里我们有意让 OpenAI 方法能够访问全部记忆,而不只是与问题相关的记忆。 由于 OpenAI 系统缺少用于选择性记忆检索的外部 API,这种方法用于适配其基准评测。

Memory Providers We incorporate Zep, a memory management platform designed for AI agents. Using their platform version, we conduct systematic evaluations across the LOCOMO dataset, maintaining temporal fidelity by preserving timestamp information alongside conversational content. This temporal anchoring ensures that time-sensitive queries can be addressed through appropriately contextualized memory retrieval, particularly important for evaluating questions that require chronological awareness. This baseline represents an important commercial implementation of memory management specifically engineered for AI agents.

记忆服务提供方 我们纳入了 Zep,这是一个面向 AI 智能体设计的记忆管理平台。 我们使用其平台版本,在 LOCOMO 数据集上进行系统性评估,并通过在对话内容旁保留时间戳信息来维持时间一致性。 这种时间锚定确保了涉及时间的问题可以通过适当的上下文化记忆检索来回答,尤其适合评估需要时间顺序意识的问题。 该基线代表了一种专门为 AI 智能体设计的商业记忆管理实现。

表1:不同问题类型下,带记忆系统的性能比较。评估指标包括 F1、BLEU-1 和 LLM-as-a-Judge 分数,数值越高越好。A-Mem* 表示我们重新运行 A-Mem 得到的 LLM-as-a-Judge 分数,温度设置为 0。Mem0g 表示带图记忆增强的 Mem0。粗体表示每个指标下的最佳结果, 表示越高越好。
方法单跳多跳
F1 ↑BLEU-1 ↑LLM裁判 ↑F1 ↑BLEU-1 ↑LLM裁判 ↑
LoCoMo25.0219.75--12.0411.16--
ReadAgent9.156.48--5.315.12--
MemoryBank5.004.77--5.565.94--
MemGPT26.6517.72--9.157.44--
A-Mem27.0220.09--12.1412.00--
A-Mem*20.7614.9039.79 ± 0.389.228.8118.85 ± 0.31
LangMem35.5126.8662.23 ± 0.7526.0422.3247.92 ± 0.47
Zep35.7423.3061.70 ± 0.3219.3714.8241.35 ± 0.48
OpenAI34.3023.7263.79 ± 0.4620.0915.4242.92 ± 0.63
Mem038.7227.1367.13 ± 0.6528.6421.5851.15 ± 0.31
Mem0g38.0926.0365.71 ± 0.4524.3218.8247.19 ± 0.67

方法开放域时间
F1 ↑BLEU-1 ↑LLM裁判 ↑F1 ↑BLEU-1 ↑LLM裁判 ↑
LoCoMo40.3629.05--18.4114.77--
ReadAgent9.677.66--12.608.87--
MemoryBank6.615.16--9.686.99--
MemGPT41.0434.34--25.5219.44--
A-Mem44.6537.06--45.8536.67--
A-Mem*33.3427.5854.05 ± 0.2235.4031.0849.91 ± 0.31
LangMem40.9133.6371.12 ± 0.2030.7525.8423.43 ± 0.39
Zep49.5638.9276.60 ± 0.1342.0034.5349.31 ± 0.50
OpenAI39.3131.1662.29 ± 0.1214.0411.2521.71 ± 0.20
Mem047.6538.7272.93 ± 0.1148.9340.5155.51 ± 0.34
Mem0g49.2740.3075.71 ± 0.2151.5540.2858.13 ± 0.44

4. Evaluation Results, Analysis and Discussion

4.1. Performance Comparison Across Memory-Enabled Systems

Table 1 reports F1, BLEU-1 and Judge scores for our two architectures, Mem0 and Mem0g, against a suite of competitive baselines on single-hop, multi-hop, open-domain, and temporal questions. Overall, both of our models set new state-of-the-art marks in all the three evaluation metrics for most question types.

表1报告了我们两个架构 Mem0 和 Mem0g 在单跳、多跳、开放域和时间问题上的 F1BLEU-1Judge 分数,并将它们与一系列有竞争力的基线进行了比较。 总体而言,对于大多数问题类型,我们的两个模型在三个评估指标上都达到了新的最优水平。

Single-Hop Question Performance Single-hop queries involve locating a single factual span contained within one dialogue turn. Leveraging its dense memories in natural language text, Mem0 secures the strongest results: F1 = 38.72, BLEU-1 = 27.13, and Judge = 67.13. Augmenting the natural language memories with graph memory (Mem0g) yields marginal performance drop compared to Mem0, indicating that relational structure provides limited utility when the retrieval target occupies a single turn. Among the existing baselines, the full-context OpenAI run attains the next-best Judge score, reflecting the benefits of retaining the entire conversation in context, while LangMem and Zep both score around 8% relatively less against our models on Judge score. Previous LOCOMO benchmarks such as A-Mem lag by more than 25 points in Judge, underscoring the necessity of fine-grained, structured memory indexing even for simple retrieval tasks.

单跳问题表现 单跳查询需要定位包含在某一轮对话中的单个事实片段。 凭借自然语言文本形式的稠密记忆,Mem0 取得了最强结果:F1 = 38.72,BLEU-1 = 27.13,Judge = 67.13。 相比 Mem0,在自然语言记忆上加入图记忆的 Mem0g 表现略有下降,这说明当检索目标只位于单轮对话中时,关系结构带来的帮助有限。 在现有基线中,全上下文的 OpenAI 运行取得了次优 Judge 分数,体现了将完整对话保留在上下文中的好处;而 LangMem 和 Zep 在 Judge 分数上相对我们的模型低约 8%。 此前的 LOCOMO 基准方法如 A-Mem 在 Judge 上落后超过 25 分,这说明即使对于简单检索任务,细粒度、结构化的记忆索引也很重要。

Multi-Hop Question Performance Multi-hop queries require synthesizing information dispersed across multiple conversation sessions, posing significant challenges in memory integration and retrieval. Mem0 clearly outperforms other methods with an F1 score of 28.64 and a Judge score of 51.15, reflecting its capability to efficiently retrieve and integrate disparate information stored across sessions. Interestingly, the addition of graph memory in Mem0g does not provide performance gains here, indicating potential inefficiencies or redundancies in structured graph representations for complex integrative tasks compared to dense natural language memory alone. Baselines like LangMem show competitive performances, but their scores substantially trail those of Mem0, emphasizing the advantage of our refined memory indexing and retrieval mechanisms for complex query processing.

多跳问题表现 多跳查询需要综合分散在多个对话会话中的信息,这对记忆整合和检索提出了显著挑战。 Mem0 明显优于其他方法,取得了 28.64 的 F1 分数和 51.15 的 Judge 分数,反映出它能够高效检索并整合跨会话存储的分散信息。 有趣的是,Mem0g 中加入图记忆并没有带来性能提升,这说明在复杂整合任务中,相比单独使用稠密自然语言记忆,结构化图表示可能存在一定低效或冗余。 LangMem 等基线表现具有竞争力,但分数明显落后于 Mem0,这强调了我们更精细的记忆索引和检索机制在复杂查询处理中的优势。

Open-Domain Performance In open-domain settings, the baseline Zep achieves the highest F1 (49.56) and Judge (76.60) scores, edging out our methods by a narrow margin. In particular, Zep's Judge score of 76.60 surpasses Mem0g's 75.71 by just 0.89 percentage points and outperforms Mem0's 72.93 by 3.67 points, highlighting a consistent, if slight, advantage in integrating conversational memory with external knowledge. Mem0g remains a strong runner-up, with a Judge of 75.71 reflecting high factual retrieval precision, while Mem0 follows with 72.93, demonstrating robust coherence. These results underscore that although structured relational memories, as in Mem0 and Mem0g, substantially improve open-domain retrieval, Zep maintains a small but meaningful lead.

开放域表现 在开放域设置中,基线 Zep 取得了最高的 F1(49.56)和 Judge(76.60)分数,以很小优势超过我们的方法。 具体而言,Zep 的 Judge 分数为 76.60,仅比 Mem0g 的 75.71 高 0.89 个百分点,比 Mem0 的 72.93 高 3.67 分,这表明它在将对话记忆与外部知识结合方面具有稳定但轻微的优势。 Mem0g 仍然是非常强的第二名,其 Judge 分数为 75.71,反映出较高的事实检索精度;Mem0 的分数为 72.93,也表现出稳健的一致性。 这些结果说明,尽管 Mem0 和 Mem0g 中的结构化关系记忆显著改善了开放域检索,但 Zep 仍保持了小而有意义的领先。

Temporal Reasoning Performance Temporal reasoning tasks hinge on accurate modeling of event sequences, their relative ordering, and durations within conversational history. Our architectures demonstrate substantial improvements across all metrics, with Mem0g achieving the highest F1 (51.55) and Judge (58.13), suggesting that structured relational representations in addition to natural language memories significantly aid in temporally grounded judgments. Notably, the base variant, Mem0, also provide a decent Judge score (55.51), suggesting that natural language alone can aid in temporally grounded judgments. Among baselines, OpenAI notably underperforms, with scores below 15%, primarily due to missing timestamps in most generated memories despite explicit prompting in the OpenAI ChatGPT to extract memories with timestamps. Other baselines such as A-Mem achieve respectable results, yet our models clearly advance the state-of-the-art, emphasizing the critical advantage of accurately leveraging both natural language contextualization and structured graph representations for temporal reasoning.

时间推理表现 时间推理任务依赖于对对话历史中事件序列、相对顺序和持续时间的准确建模。 我们的架构在所有指标上都取得了显著提升,其中 Mem0g 获得了最高的 F1(51.55)和 Judge(58.13),这说明在自然语言记忆之外加入结构化关系表示,可以显著帮助基于时间的判断。 值得注意的是,基础版本 Mem0 也取得了不错的 Judge 分数(55.51),说明单独的自然语言记忆也能帮助时间判断。 在基线中,OpenAI 表现明显较弱,分数低于 15%,主要原因是尽管我们在 ChatGPT 中明确提示其提取带时间戳的记忆,但大多数生成记忆仍然缺少时间戳。 其他基线如 A-Mem 取得了尚可的结果,但我们的模型明显推进了当前最优水平,强调了同时准确利用自然语言上下文化和结构化图表示进行时间推理的关键优势。

4.2. Cross-Category Analysis

The comprehensive evaluation across diverse question categories reveals that our proposed architectures, Mem0 and Mem0g, consistently achieve superior performance compared to baseline systems. For single-hop queries, Mem0 demonstrates particularly strong performance, benefiting from its efficient dense natural language memory structure. Although graph-based representations in Mem0g slightly lag behind in lexical overlap metrics for these simpler queries, they significantly enhance semantic coherence, as demonstrated by competitive Judge scores. This indicates that graph structures are more beneficial in scenarios involving nuanced relational context rather than straightforward retrieval. For multi-hop questions, Mem0 exhibits clear advantages by effectively synthesizing dispersed information across multiple sessions, confirming that natural language memories provide sufficient representational richness for these integrative tasks. Surprisingly, the expected relational advantages of Mem0g do not translate into better outcomes here, suggesting potential overhead or redundancy when navigating more intricate graph structures in multi-step reasoning scenarios.

跨多种问题类别的综合评估表明,我们提出的 Mem0 和 Mem0g 相比基线系统持续取得更优性能。 对于单跳查询,Mem0 表现尤其强,这得益于其高效的稠密自然语言记忆结构。 虽然 Mem0g 中的图表示在这些较简单查询的词面重叠指标上略微落后,但从具有竞争力的 Judge 分数可以看出,它显著增强了语义一致性。 这说明图结构更适用于涉及细微关系上下文的场景,而不是直接检索场景。 对于多跳问题,Mem0 通过有效综合跨多个会话的分散信息展现出明显优势,说明自然语言记忆已经为这类整合任务提供了足够丰富的表示。 令人意外的是,Mem0g 预期中的关系优势在这里没有转化为更好结果,这表明在多步推理场景中导航更复杂的图结构时,可能会引入额外开销或冗余。

表2:不同基线与本文方法的性能比较。延迟指标显示搜索时间和总响应时间的 p50(中位数)与 p95(95 分位数),单位为秒。总体 LLM-as-a-Judge 分数(J)表示整个 LOCOMO 数据集上的回答质量。
方法K块大小 / 记忆 token延迟(秒)总体 J
搜索总响应
p50p95p50p95
RAG11280.2810.8230.7741.82547.77 ± 0.23%
2560.2510.7100.7451.62850.15 ± 0.16%
5120.2400.6390.7721.71046.05 ± 0.14%
10240.2400.7230.8211.95740.74 ± 0.17%
20480.2550.7520.9962.18237.93 ± 0.12%
40960.2540.7191.0932.71136.84 ± 0.17%
81920.2790.8381.3964.41644.53 ± 0.13%
21280.2670.6240.7661.82959.56 ± 0.19%
2560.2550.6990.8021.90760.97 ± 0.20%
5120.2470.7460.8291.72958.19 ± 0.18%
10240.2380.7020.8601.85050.68 ± 0.13%
20480.2610.8291.1012.79148.57 ± 0.22%
40960.2660.9441.4514.82251.79 ± 0.15%
81920.2881.1242.3129.94260.53 ± 0.16%
Full-context--26031----9.87017.11772.90 ± 0.19%
A-Mem--25200.6681.4851.4104.37448.38 ± 0.15%
LangMem--12717.9959.8218.5360.4058.10 ± 0.21%
Zep--39110.5130.7781.2922.92665.99 ± 0.16%
OpenAI--4437----0.4660.88952.90 ± 0.14%
Mem0--17640.1480.2000.7081.44066.88 ± 0.15%
Mem0g--36160.4760.6571.0912.59068.44 ± 0.17%

In temporal reasoning, Mem0g substantially outperforms other methods, validating that structured relational graphs excel in capturing chronological relationships and event sequences. The presence of explicit relational context significantly enhances Mem0g's temporal coherence, outperforming Mem0's dense memory storage and highlighting the importance of precise relational representations when tracking temporally sensitive information. Open-domain performance further reinforces the value of relational modeling. Mem0g, benefiting from the relational clarity of graph-based memory, closely competes with the top-performing baseline, Zep. This competitive result underscores Mem0g's robustness in integrating external knowledge through relational clarity, suggesting an optimal synergy between structured memory and open-domain information synthesis.

在时间推理中,Mem0g 明显优于其他方法,验证了结构化关系图在捕捉时间顺序关系和事件序列方面的优势。 显式关系上下文显著增强了 Mem0g 的时间一致性,使其超过 Mem0 的稠密记忆存储,并突出显示了在追踪时间敏感信息时,精确关系表示的重要性。 开放域表现进一步强化了关系建模的价值。 受益于图记忆带来的关系清晰性,Mem0g 与表现最好的基线 Zep 非常接近。 这一竞争性结果说明,Mem0g 能够通过清晰的关系结构稳健地整合外部知识,也暗示了结构化记忆与开放域信息综合之间的良好协同。

Overall, our analysis indicates complementary strengths of Mem0 and Mem0g across various task demands: dense, natural-language-based memory offers significant efficiency for simpler queries, while explicit relational modeling becomes essential for tasks demanding nuanced temporal and contextual integration. These findings reinforce the importance of adaptable memory structures tailored to specific reasoning contexts in AI agent deployments.

总体而言,我们的分析表明 Mem0 和 Mem0g 在不同任务需求下具有互补优势:基于稠密自然语言的记忆对于较简单查询具有显著效率优势,而对于需要细致时间和上下文整合的任务,显式关系建模则变得十分关键。 这些发现进一步强调,在 AI 智能体部署中,需要根据具体推理场景选择可适配的记忆结构。

![图4(a):搜索延迟分析](./diagrams/latency_search.png)

图4(a):不同记忆方法的搜索延迟对比。柱状高度表示 Judge 分数,折线表示搜索延迟,右轴采用对数尺度。

图4(b):总响应延迟分析

图4(b):不同记忆方法的总响应延迟对比。柱状高度表示 Judge 分数,折线表示端到端延迟,右轴采用对数尺度。

图4:不同记忆方法的延迟分析。图4(a) 展示答案生成前的搜索/检索延迟,图4(b) 展示包含 LLM 推理在内的总延迟。

图4:不同记忆方法的延迟分析。图4(a) 展示答案生成前的搜索/检索延迟,图4(b) 展示包含 LLM 推理在内的总延迟。

4.3. Performance Comparison of Mem0 and Mem0g Against RAG Approaches and Full-Context Model

Comparisons in Table 2, focusing on the Overall Judge column, reveal that both Mem0 and Mem0g consistently outperform all RAG configurations, which vary chunk sizes (128 to 8192 tokens) and retrieve either one (k=1) or two (k=2) chunks. Even the strongest RAG approach peaks at around 61% in the Judge metric, whereas Mem0 reaches 67%, about a 10% relative improvement, and Mem0g reaches over 68%, achieving around a 12% relative gain. These advances underscore the advantage of capturing only the most salient facts in memory, rather than retrieving large chunk of original text. By converting the conversation history into concise, structured representations, Mem0 and Mem0g mitigate noise and surface more precise cues to the LLM, leading to better answers as evaluated by an external LLM (Judge).

表2Overall Judge 列可以看出,Mem0 和 Mem0g 持续优于所有 RAG 配置。这些 RAG 配置改变了文本块大小(128 到 8192 个 token),并分别检索一个(k=1)或两个(k=2)文本块。 即使是最强的 RAG 方法,在 Judge 指标上也只有约 61%;而 Mem0 达到 67%,相对提升约 10%;Mem0g 超过 68%,相对提升约 12%。 这些提升说明,只在记忆中捕捉最关键事实,而不是检索大段原始文本,是有优势的。 通过将对话历史转换为简洁、结构化的表示,Mem0 和 Mem0g 能够减少噪声,并向 LLM 暴露更精确的线索,从而在外部 LLM(Judge)评估下生成更好的答案。

Despite these improvements, a full-context method that ingests a chunk of roughly 26000 tokens still achieves the highest Judge score, approximately 73%. However, as shown in Figure 4(b), it also incurs a very high total p95 latency, around 17 seconds, since the model must read the entire conversation on every query. By contrast, Mem0 and Mem0g significantly reduce token usage and thus achieve lower p95 latencies of around 1.44 seconds, a 92% reduction, and 2.6 seconds, a 85% reduction, respectively over full-context approach. Although the full-context approach can provide a slight accuracy edge, the memory-based systems offer a more practical trade-off, maintaining near-competitive quality while imposing only a fraction of the token and latency cost.

尽管有这些提升,输入约 26000 个 token 的全上下文方法仍然取得了最高的 Judge 分数,约为 73%。 但是,如图4(b)所示,它也带来了很高的总 p95 延迟,约 17 秒,因为模型每次查询都必须读取完整对话。 相比之下,Mem0 和 Mem0g 显著降低了 token 使用量,因此相对于全上下文方法分别达到约 1.44 秒和 2.6 秒的 p95 延迟,降幅约为 92% 和 85%。 虽然全上下文方法在准确率上略有优势,但基于记忆的系统提供了更实际的权衡,能够在保持接近竞争水平质量的同时,只付出很小一部分 token 和延迟成本。

As conversation length increases, full-context approaches suffer from exponential growth in computational overhead, evident in Table 2 where total p95 latency increases significantly with larger k values or chunk sizes. This increase in input chunks leads to longer response times and higher token consumption costs. In contrast, memory-focused approaches like Mem0 and Mem0g maintain consistent performance regardless of conversation length, making them substantially more viable for production-scale deployments where efficiency and responsiveness are critical.

随着对话长度增加,全上下文方法会面临计算开销的快速增长。表2中可以看到,当 k 值或文本块大小变大时,总 p95 延迟显著增加。 输入文本块增加会带来更长的响应时间和更高的 token 消耗成本。 相比之下,像 Mem0 和 Mem0g 这样以记忆为中心的方法能够在不同对话长度下保持稳定性能,因此更适合效率和响应速度都很关键的生产级部署。

4.4. Latency Analysis

Table 2 provides a comprehensive performance comparison of various retrieval and memory methodologies, presenting median (p50) and tail (p95) latencies for both the search phase and total response generation across the LOCOMO dataset. Our analysis reveals distinct performance patterns governed by architectural choices. Memory-centric architectures demonstrate different performance characteristics. A-Mem, despite its larger memory store, incurs substantial search overhead (p50: 0.668s), resulting in total median latencies of 1.410s. LangMem exhibits even higher search latencies (p50: 17.99s, p95: 59.82s), rendering it impractical for interactive applications. Zep achieves moderate performance (p50 total: 1.292s).

表2全面比较了不同检索和记忆方法的性能,展示了 LOCOMO 数据集上搜索阶段和总响应生成阶段的中位数延迟(p50)与尾部延迟(p95)。 我们的分析揭示了由架构选择决定的不同性能模式。 以记忆为中心的架构表现出不同特征。 A-Mem 尽管拥有较大的记忆存储,但搜索开销较高(p50: 0.668s),导致总中位延迟为 1.410s。 LangMem 的搜索延迟更高(p50: 17.99s,p95: 59.82s),使其不适合交互式应用。 Zep 表现中等,总 p50 延迟为 1.292s。

The full-context baseline, which processes the entire conversation history without retrieval, fundamentally differs from retrieval-based approaches. By passing the entire conversation context (26000 tokens) directly to the LLM, it eliminates search overhead but incurs extreme total latencies (p50: 9.870s, p95: 17.117s). Similarly, the OpenAI implementation does not perform memory search, as it processes manually extracted memories from their playground. While this approach achieves impressive response generation times (p50: 0.466s, p95: 0.889s), it requires pre-extraction of relevant context, which is not reflected in the reported metrics.

全上下文基线与基于检索的方法存在根本差异,因为它不进行检索,而是处理完整对话历史。 通过将完整对话上下文(26000 个 token)直接传给 LLM,它消除了搜索开销,但带来了极高的总延迟(p50: 9.870s,p95: 17.117s)。 类似地,OpenAI 实现也不执行记忆搜索,因为它处理的是从 playground 中手动提取出的记忆。 虽然该方法取得了很快的回答生成时间(p50: 0.466s,p95: 0.889s),但它需要预先提取相关上下文,而这部分成本没有反映在报告指标中。

Our proposed Mem0 approach achieves the lowest search latency among all methods (p50: 0.148s, p95: 0.200s) as illustrated in Figure 4(a). This efficiency stems from our selective memory retrieval mechanism and infrastructure improvements that dynamically identifies and retrieves only the most salient information rather than fixed-size chunks. Consequently, Mem0 maintains the lowest total median latency (0.708s) with remarkably contained p95 values (1.440s), making it particularly suitable for latency-sensitive applications such as interactive AI agents. The graph-enhanced Mem0g variant introduces additional relational modeling capabilities at a moderate latency cost, with search times (0.476s) still outperforming all existing memory solutions and baselines. Despite this increase, Mem0g maintains competitive total latencies (p50: 1.091s, p95: 2.590s) while achieving the highest Judge score (68.44%) across all methods, trailing only the computationally prohibitive full-context approach. This performance profile demonstrates our methods' ability to balance response quality and computational efficiency, offering a compelling solution for production AI agents where both factors are critical constraints.

图4(a)所示,我们提出的 Mem0 在所有方法中取得了最低搜索延迟(p50: 0.148s,p95: 0.200s)。 这种效率来自我们的选择性记忆检索机制和基础设施改进,它们能够动态识别并检索最关键的信息,而不是固定大小的文本块。 因此,Mem0 保持了最低的总中位延迟(0.708s),并将 p95 延迟控制在很低水平(1.440s),特别适合交互式 AI 智能体等延迟敏感应用。 图增强版本 Mem0g 以适中的延迟成本引入了额外关系建模能力,其搜索时间(0.476s)仍优于所有现有记忆方案和基线。 尽管延迟有所增加,Mem0g 仍保持了有竞争力的总延迟(p50: 1.091s,p95: 2.590s),同时在所有方法中取得了最高的 Judge 分数(68.44%),仅低于计算成本很高的全上下文方法。 这一性能画像表明,我们的方法能够平衡回答质量和计算效率,为两者都是关键约束的生产级 AI 智能体提供了有吸引力的方案。

4.5. Memory System Overhead: Token Analysis and Construction Time

We measure the average token budget required to materialise each system's long-term memory store. Mem0 encodes complete dialogue turns in a natural language representation and therefore occupies only 7k tokens per conversation on an average. Whereas Mem0g roughly doubles the footprint to 14k tokens, due to the introduction of graph memories which includes nodes and corresponding relationships. In stark contrast, Zep's memory graph consumes in excess of 600k tokens. The inflation arises from Zep's design choice to cache a full abstractive summary at every node while also storing facts on the connecting edges, leading to extensive redundancy across the graph. For perspective, supplying the entire raw conversation context to the language model, without any memory abstraction, amounts to roughly 26k tokens on average, 20 times less relative to Zep's graph.

我们衡量了每个系统构建长期记忆存储所需的平均 token 预算。 Mem0 使用自然语言表示编码完整对话轮次,因此每段对话平均只占用约 7k token。 Mem0g 由于引入了包含节点及其对应关系的图记忆,token 占用大约翻倍到 14k 与之形成鲜明对比的是,Zep 的记忆图消耗超过 600k token。 这种膨胀来自 Zep 的设计选择:它在每个节点缓存完整的抽象摘要,同时还在连接边上存储事实,导致图中存在大量冗余。 作为参照,如果不做任何记忆抽象,直接将完整原始对话上下文提供给语言模型,平均约为 26k token,相比 Zep 的图少 20 倍。

Beyond token inefficiency, our experiments revealed significant operational bottlenecks with Zep. After adding memories to Zep's system, we observed that immediate memory retrieval attempts often failed to answer our queries correctly. Interestingly, re-running identical searches after a delay of several hours yielded considerably better results. This latency suggests that Zep's graph construction involves multiple asynchronous LLM calls and extensive background processing, making the memory system impractical for real-time applications. In contrast, Mem0 graph construction completes in under a minute even in worst-case scenarios, allowing users to immediately leverage newly added memories for query responses.

除了 token 效率低之外,我们的实验还发现 Zep 存在显著的运行瓶颈。 在向 Zep 系统添加记忆后,我们观察到立即进行记忆检索时,系统往往无法正确回答查询。 有趣的是,等待数小时后重新运行相同搜索,会得到明显更好的结果。 这种延迟表明,Zep 的图构建涉及多次异步 LLM 调用和大量后台处理,使得该记忆系统不适合实时应用。 相比之下,即使在最坏情况下,Mem0 的图构建也能在一分钟内完成,使用户可以立即利用新加入的记忆回答查询。

These findings highlight that Zep not only replicates identical knowledge fragments across multiple nodes, but also introduces significant operational delays. Our architectures, Mem0 and Mem0g, preserve the same information at a fraction of the token cost and with substantially faster memory availability, offering a more memory-efficient and operationally responsive representation.

这些发现表明,Zep 不仅会在多个节点之间复制相同知识片段,还会引入显著的运行延迟。 我们的架构 Mem0 和 Mem0g 能够以很小一部分 token 成本保留相同信息,并显著加快记忆可用速度,从而提供一种更节省记忆、更具运行响应性的表示方式。

5. Conclusion and Future Work

We have introduced Mem0 and Mem0g, two complementary memory architectures that overcome the intrinsic limitations of fixed context windows in LLMs. By dynamically extracting, consolidating, and retrieving compact memory representations, Mem0 achieves state-of-the-art performance across single-hop and multi-hop reasoning, while Mem0g's graph-based extensions unlock significant gains in temporal and open-domain tasks. On the LOCOMO benchmark, our methods deliver 5%, 11%, and 7% relative improvements in single-hop, temporal, and multi-hop reasoning question types over best performing methods in respective question type and reduce p95 latency by over 91% compared to full-context baselines, demonstrating a powerful balance between precision and responsiveness. Mem0's dense memory pipeline excels at rapid retrieval for straightforward queries, minimizing token usage and computational overhead. In contrast, Mem0g's structured graph representations provide nuanced relational clarity, enabling complex event sequencing and rich context integration without sacrificing practical efficiency. Together, they form a versatile memory toolkit that adapts to diverse conversational demands while remaining deployable at scale.

我们提出了 Mem0 和 Mem0g,这是两种互补的记忆架构,能够克服 LLM 固定上下文窗口的内在限制。 通过动态提取、整合和检索紧凑的记忆表示,Mem0 在单跳和多跳推理上取得了最优性能,而 Mem0g 的图扩展则在时间任务和开放域任务中带来了显著收益。 LOCOMO 基准上,相比各自问题类型中表现最好的方法,我们的方法在单跳、时间和多跳推理问题上分别带来了 5%、11% 和 7% 的相对提升,并且相比全上下文基线将 p95 延迟降低了超过 91%,展示了准确性与响应速度之间的强大平衡。 Mem0 的稠密记忆流水线擅长对直接查询进行快速检索,能够最小化 token 使用量和计算开销。 相比之下,Mem0g 的结构化图表示提供了更细致的关系清晰性,使其能够处理复杂事件排序和丰富上下文整合,同时不牺牲实际效率。 二者共同构成了一个灵活的记忆工具箱,能够适应多样化的对话需求,并且仍然可以大规模部署。

Future research directions include optimizing graph operations to reduce the latency overhead in Mem0g, exploring hierarchical memory architectures that blend efficiency with relational representation, and developing more sophisticated memory consolidation mechanisms inspired by human cognitive processes. Additionally, extending our memory frameworks to domains beyond conversational scenarios, such as procedural reasoning and multimodal interactions, would further validate their broader applicability. By addressing the fundamental limitations of fixed context windows, our work represents a significant advancement toward conversational AI systems capable of maintaining coherent, contextually rich interactions over extended periods, much like their human counterparts.

未来研究方向包括:优化图操作以降低 Mem0g 中的延迟开销;探索能够结合效率与关系表示的层次化记忆架构;以及开发受人类认知过程启发的、更复杂的记忆整合机制。 此外,将我们的记忆框架扩展到对话场景之外的领域,例如程序性推理和多模态交互,也将进一步验证其更广泛的适用性。 通过解决固定上下文窗口的根本限制,我们的工作朝着能够在长时间跨度内保持连贯、上下文丰富交互的对话式 AI 系统迈出了重要一步,使其更接近人类交流方式。