Skip to content


Zep:面向智能体记忆的时间知识图谱架构

Abstract

We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti---a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.

本文介绍了 Zep,一种面向 AI 智能体的新型记忆层服务;在 Deep Memory Retrieval(DMR)基准上,它超过了当前最先进的系统 MemGPT。 此外,Zep 在比 DMR 更全面、更具挑战性的评测中也表现出色,而这些评测更能反映真实企业场景。 现有面向基于大语言模型(LLM)的智能体的检索增强生成(RAG)框架通常局限于静态文档检索;但企业应用需要从持续对话和业务数据等多样来源中动态整合知识。 Zep 通过其核心组件 Graphiti 来解决这一根本限制;Graphiti 是一个时间感知知识图谱引擎,能够在维护历史关系的同时,动态综合非结构化对话数据和结构化业务数据。 在 MemGPT 团队作为其主要评估指标建立的 DMR 基准中,Zep 展现出更优性能(94.8% 对 93.4%)。 除 DMR 外,Zep 的能力还通过更具挑战性的 LongMemEval 基准得到进一步验证;该基准通过复杂时间推理任务,更好地反映企业用例。 在该评测中,Zep 取得了显著结果:准确率提升最高达到 18.5%,同时相较基线实现将响应延迟降低了 90%。 这些结果在跨会话信息综合和长期上下文维护等企业关键任务中尤其明显,证明了 Zep 在真实应用部署中的有效性。

1. Introduction

The impact of transformer-based large language models (LLMs) on industry and research communities has garnered significant attention in recent years. A major application of LLMs has been the development of chat-based agents. However, these agents' capabilities are limited by the LLMs' context windows, effective context utilization, and knowledge gained during pre-training. Consequently, additional context is required to provide out-of-domain (OOD) knowledge and reduce hallucinations.

近年来,基于 Transformer 的大语言模型(LLM)对产业界和研究社区的影响受到了广泛关注。 LLM 的一个重要应用是开发基于聊天的智能体。 然而,这些智能体的能力受到 LLM 上下文窗口、有效上下文利用能力以及预训练阶段获得知识的限制。 因此,需要额外上下文来提供域外(OOD)知识并减少幻觉。

Retrieval-Augmented Generation (RAG) has emerged as a key area of interest in LLM-based applications. RAG leverages Information Retrieval (IR) techniques pioneered over the last fifty years to supply necessary domain knowledge to LLMs.

检索增强生成(RAG)已经成为基于 LLM 的应用中的一个关键关注方向。 RAG 利用过去五十年发展出的信息检索(IR)技术,为 LLM 提供必要的领域知识。

Current approaches using RAG have focused on broad domain knowledge and largely static corpora—that is, document contents added to a corpus seldom change. For agents to become pervasive in our daily lives, autonomously solving problems from trivial to highly complex, they will need access to a large corpus of continuously evolving data from users' interactions with the agent, along with related business and world data. We view empowering agents with this broad and dynamic "memory" as a crucial building block to actualize this vision, and we argue that current RAG approaches are unsuitable for this future. Since entire conversation histories, business datasets, and other domain-specific content cannot fit effectively inside LLM context windows, new approaches need to be developed for agent memory. Adding memory to LLM-powered agents isn't a new idea—this concept has been explored previously in MemGPT.

当前使用 RAG 的方法主要关注广泛领域知识和基本静态的语料库,也就是说,加入语料库的文档内容很少发生变化。 如果智能体要普遍融入我们的日常生活,自主解决从琐碎到高度复杂的问题,它们就需要访问一个持续演化的大型数据语料库,其中既包括用户与智能体交互产生的数据,也包括相关的业务数据和世界数据。 作者认为,为智能体赋予这种广泛而动态的“记忆”,是实现这一愿景的关键构件;同时,当前的 RAG 方法并不适合这样的未来。 由于完整对话历史、业务数据集和其他特定领域内容无法有效放入 LLM 上下文窗口,因此需要为智能体记忆开发新的方法。 给 LLM 驱动的智能体加入记忆并不是新想法,MemGPT 已经探索过这一概念。

Recently, Knowledge Graphs (KGs) have been employed to enhance RAG architectures to address many of the shortcomings of traditional IR techniques. In this paper, we introduce Zep, a memory layer service powered by Graphiti, a dynamic, temporally-aware knowledge graph engine. Zep ingests and synthesizes both unstructured message data and structured business data. The Graphiti KG engine dynamically updates the knowledge graph with new information in a non-lossy manner, maintaining a timeline of facts and relationships, including their periods of validity. This approach enables the knowledge graph to represent a complex, evolving world.

最近,知识图谱(KG)被用于增强 RAG 架构,以解决传统 IR 技术的许多不足。 本文介绍 Zep,一种由 Graphiti 支撑的记忆层服务;Graphiti 是一个动态的、时间感知的知识图谱引擎。 Zep 会摄取并综合非结构化消息数据和结构化业务数据。 Graphiti KG 引擎以非损失的方式用新信息动态更新知识图谱,并维护事实和关系的时间线,包括它们的有效期。 这种方法使知识图谱能够表示一个复杂且不断演化的世界。

As Zep is a production system, we've focused heavily on the accuracy, latency, and scalability of its memory retrieval mechanisms. We evaluate these mechanisms' efficacy using two existing benchmarks: a Deep Memory Retrieval task (DMR) from MemGPT, as well as the LongMemEval benchmark.

由于 Zep 是一个生产系统,作者非常重视其记忆检索机制的准确率、延迟和可扩展性。 作者使用两个现有基准来评估这些机制的有效性:来自 MemGPT 的 Deep Memory Retrieval 任务(DMR),以及 LongMemEval 基准。

2. Knowledge Graph Construction

In Zep, memory is powered by a temporally-aware dynamic knowledge graph G=(N,E,ϕ), where N represents nodes, E represents edges, and ϕ:EN×N represents a formal incidence function. This graph comprises three hierarchical tiers of subgraphs: an episode subgraph, a semantic entity subgraph, and a community subgraph.

在 Zep 中,记忆由一个时间感知的动态知识图谱 G=(N,E,ϕ) 支撑,其中 N 表示节点,E 表示边,ϕ:EN×N 表示一个形式化的关联函数。 这个图包含三个层级化的子图:episode 子图、语义实体子图和社区子图。

  • Episode Subgraph Ge: Episodic nodes (episodes), niNe, contain raw input data in the form of messages, text, or JSON. Episodes serve as a non-lossy data store from which semantic entities and relations are extracted. Episodic edges, eiEeϕ(Ne×Ns), connect episodes to their referenced semantic entities.
  • Semantic Entity Subgraph Gs: The semantic entity subgraph builds upon the episode subgraph. Entity nodes (entities), niNs, represent entities extracted from episodes and resolved with existing graph entities. Entity edges (semantic edges), eiEsϕ(Ns×Ns), represent relationships between entities extracted from episodes.
  • Community Subgraph Gc: The community subgraph forms the highest level of Zep's knowledge graph. Community nodes (communities), niNc, represent clusters of strongly connected entities. Communities contain high-level summarizations of these clusters and represent a more comprehensive, interconnected view of Gs's structure. Community edges, eiEcϕ(Nc×Ns), connect communities to their entity members.
  • Episode 子图 Ge:episode 节点(episodes),niNe,包含消息、文本或 JSON 形式的原始输入数据。 Episode 作为一种非损失数据存储,语义实体和关系会从中被抽取出来。Episode 边,eiEeϕ(Ne×Ns),将 episode 连接到它们引用的语义实体。
  • 语义实体子图 Gs:语义实体子图建立在 episode 子图之上。 实体节点(entities),niNs,表示从 episode 中抽取并与已有图实体解析合并后的实体。实体边(语义边),eiEsϕ(Ns×Ns),表示从 episode 中抽取出的实体之间关系。
  • 社区子图 Gc:社区子图构成 Zep 知识图谱的最高层级。 社区节点(communities),niNc,表示强连接实体的簇。社区包含这些簇的高层摘要,并表示对 Gs 结构更全面、更互联的视图。社区边,eiEcϕ(Nc×Ns),将社区连接到其成员实体。

The dual storage of both raw episodic data and derived semantic entity information mirrors psychological models of human memory. These models distinguish between episodic memory, which represents distinct events, and semantic memory, which captures associations between concepts and their meanings. This approach enables LLM agents using Zep to develop more sophisticated and nuanced memory structures that better align with our understanding of human memory systems. Knowledge graphs provide an effective medium for representing these memory structures, and our implementation of distinct episodic and semantic subgraphs draws from similar approaches in AriGraph. Our use of community nodes to represent high-level structures and domain concepts builds upon work from GraphRAG, enabling a more comprehensive global understanding of the domain. The resulting hierarchical organization—from episodes to facts to entities to communities—extends existing hierarchical RAG strategies.

原始 episode 数据和派生语义实体信息的双重存储,呼应了人类记忆的心理学模型。 这些模型区分了表示不同事件的情节记忆,以及捕捉概念及其含义之间关联的语义记忆。 这种方法使使用 Zep 的 LLM 智能体能够形成更复杂、更细腻的记忆结构,也更符合我们对人类记忆系统的理解。 知识图谱为表示这些记忆结构提供了一种有效媒介;作者对独立 episode 子图和语义子图的实现,也借鉴了 AriGraph 中类似的方法。 作者使用社区节点来表示高层结构和领域概念,这建立在 GraphRAG 工作之上,使系统能够获得更全面的全局领域理解。 由此得到的层级组织方式——从 episodes 到 facts,再到 entities 和 communities——扩展了现有的层级化 RAG 策略。

2.1. Episodes

Zep's graph construction begins with the ingestion of raw data units called Episodes. Episodes can be one of three core types: message, text, or JSON. While each type requires specific handling during graph construction, this paper focuses on the message type, as our experiments center on conversation memory. In our context, a message consists of relatively short text (several messages can fit within an LLM context window) along with the associated actor who produced the utterance.

Zep 的图构建从摄取被称为 Episodes 的原始数据单元开始。 Episode 可以属于三种核心类型之一:message、text 或 JSON。 虽然每种类型在图构建过程中都需要特定处理,但本文聚焦于 message 类型,因为实验围绕对话记忆展开。 在本文语境中,一条 message 包含相对较短的文本(若干条消息可以放入一个 LLM 上下文窗口),以及产生该话语的对应参与者。

Each message includes a reference timestamp tref indicating when the message was sent. This temporal information enables Zep to accurately identify and extract relative or partial dates mentioned in the message content (e.g., "next Thursday," "in two weeks," or "last summer"). Zep implements a bi-temporal model, where timeline T represents the chronological ordering of events, and timeline T represents the transactional order of Zep's data ingestion. While the T timeline serves the traditional purpose of database auditing, the T timeline provides an additional dimension for modeling the dynamic nature of conversational data and memory. This bi-temporal approach represents a novel advancement in LLM-based knowledge graph construction and underlies much of Zep's unique capabilities compared to previous graph-based RAG proposals.

每条 message 都包含一个参考时间戳 tref,表示该消息发送的时间。 这一时间信息使 Zep 能够准确识别和抽取消息内容中提到的相对日期或部分日期,例如“下周四”“两周后”或“去年夏天”。 Zep 实现了一个双时间模型,其中时间线 T 表示事件的时间顺序,时间线 T 表示 Zep 数据摄取的事务顺序。 虽然 T 时间线服务于传统数据库审计目的,但 T 时间线为建模对话数据和记忆的动态性质提供了额外维度。 这种双时间方法代表了基于 LLM 的知识图谱构建中的一项新进展,也构成了 Zep 相较以往图式 RAG 方案许多独特能力的基础。

The episodic edges, Ee, connect episodes to their extracted entity nodes. Episodes and their derived semantic edges maintain bidirectional indices that track the relationships between edges and their source episodes. This design reinforces the non-lossy nature of Graphiti's episodic subgraph by enabling both forward and backward traversal: semantic artifacts can be traced to their sources for citation or quotation, while episodes can quickly retrieve their relevant entities and facts. While these connections are not directly examined in this paper's experiments, they will be explored in future work.

Episode 边 Ee 将 episodes 连接到从中抽取出的实体节点。 Episodes 及其派生出的语义边会维护双向索引,用于追踪边和其来源 episodes 之间的关系。 这种设计通过支持前向和后向遍历,强化了 Graphiti 的 episode 子图的非损失特性:语义产物可以追溯到其来源以进行引用或摘录,而 episodes 也可以快速检索与自身相关的实体和事实。 虽然这些连接没有在本文实验中被直接考察,但作者将在未来工作中探索它们。

2.2. Semantic Entities and Facts

2.2.1. Entities

ntity extraction represents the initial phase of episode processing. During ingestion, the system processes both the current message content and the last n messages to provide context for named entity recognition. For this paper and in Zep's general implementation, n=4, providing two complete conversation turns for context evaluation. Given our focus on message processing, the speaker is automatically extracted as an entity. Following initial entity extraction, we employ a reflection technique inspired by reflexion to minimize hallucinations and enhance extraction coverage. The system also extracts an entity summary from the episode to facilitate subsequent entity resolution and retrieval operations.

实体抽取代表 episode 处理的初始阶段。 在摄取过程中,系统会同时处理当前消息内容和前 n 条消息,为命名实体识别提供上下文。 在本文和 Zep 的一般实现中,n=4,也就是提供两个完整对话轮次作为上下文进行评估。 由于本文关注消息处理,发言者会被自动抽取为一个实体。 在初始实体抽取之后,作者采用一种受 reflexion 启发的反思技术,以减少幻觉并提高抽取覆盖率。 系统还会从 episode 中抽取实体摘要,以便后续进行实体解析和检索操作。

After extraction, the system embeds each entity name into a 1024-dimensional vector space. This embedding enables the retrieval of similar nodes through cosine similarity search across existing graph entity nodes. The system also performs a separate full-text search on existing entity names and summaries to identify additional candidate nodes. These candidate nodes, together with the episode context, are then processed through an LLM using our entity resolution prompt. When the system identifies a duplicate entity, it generates an updated name and summary.

抽取完成后,系统会将每个实体名称嵌入到一个 1024 维向量空间中。 该嵌入使系统能够在已有图实体节点中通过余弦相似度搜索检索相似节点。 系统还会对已有实体名称和摘要执行单独的全文搜索,以识别额外候选节点。 随后,这些候选节点会与 episode 上下文一起,通过 LLM 和实体解析提示词进行处理。 当系统识别出重复实体时,它会生成更新后的名称和摘要。

Following entity extraction and resolution, the system incorporates the data into the knowledge graph using predefined Cypher queries. We chose this approach over LLM-generated database queries to ensure consistent schema formats and reduce the potential for hallucinations. Selected prompts for graph construction are provided in the appendix.

在实体抽取和解析之后,系统会使用预定义的 Cypher 查询将数据并入知识图谱。 作者选择这种方法而不是让 LLM 生成数据库查询,是为了确保一致的 schema 格式,并降低幻觉可能性。 图构建所用的部分提示词在附录中给出。

2.2.2. Facts

or each fact containing its key predicate. Importantly, the same fact can be extracted multiple times between different entities, enabling Graphiti to model complex multi-entity facts through an implementation of hyper-edges. Following extraction, the system generates embeddings for facts in preparation for graph integration. The system performs edge deduplication through a process similar to entity resolution. The hybrid search for relevant edges is constrained to edges existing between the same entity pairs as the proposed new edge. This constraint not only prevents erroneous combinations of similar edges between different entities but also significantly reduces the computational complexity of the deduplication process by limiting the search space to a subset of edges relevant to the specific entity pair.

每条 fact 都包含其关键谓词。 重要的是,同一个 fact 可以在不同实体之间被多次抽取,这使 Graphiti 能够通过超边实现来建模复杂的多实体事实。 抽取完成后,系统会为 facts 生成嵌入,为图整合做准备。 系统会通过类似实体解析的流程执行边去重。 对相关边的混合搜索会被限制在与拟新增边具有相同实体对的既有边之中。 这一约束不仅防止系统把不同实体之间的相似边错误合并,也通过将搜索空间限制为与特定实体对相关的边子集,显著降低了去重过程的计算复杂度。

2.2.3. Temporal Extraction and Edge Invalidation

A key differentiating feature of Graphiti compared to other knowledge graph engines is its capacity to manage dynamic information updates through temporal extraction and edge invalidation processes. The system extracts temporal information about facts from the episode context using tref. This enables accurate extraction and datetime representation of both absolute timestamps (e.g., "Alan Turing was born on June 23, 1912") and relative timestamps (e.g., "I started my new job two weeks ago"). Consistent with our bi-temporal modeling approach, the system tracks four timestamps: tcreated and texpiredT monitor when facts are created or invalidated in the system, while tvalid and tinvalidT track the temporal range during which facts held true. These temporal data points are stored on edges alongside other fact information.

与其他知识图谱引擎相比,Graphiti 的一个关键差异化特征,是它能够通过时间抽取和边失效流程来管理动态信息更新。 系统会使用 tref 从 episode 上下文中抽取与 facts 相关的时间信息。 这使系统能够准确抽取并表示绝对时间戳(例如“Alan Turing 出生于 1912 年 6 月 23 日”)和相对时间戳(例如“两周前我开始了新工作”)。 与双时间建模方法一致,系统跟踪四个时间戳:tcreatedtexpiredT 监控 facts 在系统中何时被创建或失效,而 tvalidtinvalidT 追踪 facts 成立的时间范围。 这些时间数据点会与其他 fact 信息一起存储在边上。

The introduction of new edges can invalidate existing edges in the database. The system employs an LLM to compare new edges against semantically related existing edges to identify potential contradictions. When the system identifies temporally overlapping contradictions, it invalidates the affected edges by setting their tinvalid to the tvalid of the invalidating edge. Following the transactional timeline T, Graphiti consistently prioritizes new information when determining edge invalidation. This comprehensive approach enables the dynamic addition of data to Graphiti as conversations evolve, while maintaining both current relationship states and historical records of relationship evolution over time.

新边的引入可能会使数据库中的已有边失效。 系统使用 LLM 将新边与语义相关的已有边进行比较,以识别潜在矛盾。 当系统识别出时间上重叠的矛盾时,它会将受影响边的 tinvalid 设置为使其失效的边的 tvalid,从而令其失效。 按照事务时间线 T,Graphiti 在判定边失效时始终优先采用新信息。 这种综合方法使 Graphiti 能够随着对话演化动态加入数据,同时维护当前关系状态以及关系随时间演化的历史记录。

2.3. Communities

After establishing the episodic and semantic subgraphs, the system constructs the community subgraph through community detection. While our community detection approach builds upon the technique described in GraphRAG, we employ a label propagation algorithm rather than the Leiden algorithm. This choice was influenced by label propagation's straightforward dynamic extension, which enables the system to maintain accurate community representations for longer periods as new data enters the graph, delaying the need for complete community refreshes.

在建立 episode 子图和语义子图之后,系统会通过社区检测来构建社区子图。 虽然作者的社区检测方法建立在 GraphRAG 描述的技术之上,但这里采用的是标签传播算法,而不是 Leiden 算法。 选择标签传播,是因为它具有直接的动态扩展形式;随着新数据进入图,系统能够在更长时间内维护准确的社区表示,从而推迟完整社区刷新的需求。

The dynamic extension implements the logic of a single recursive step in label propagation. When the system adds a new entity node niNs to the graph, it surveys the communities of neighboring nodes. The system assigns the new node to the community held by the plurality of its neighbors, then updates the community summary and graph accordingly. While this dynamic updating enables efficient community extension as data flows into the system, the resulting communities gradually diverge from those that would be generated by a complete label propagation run. Therefore, periodic community refreshes remain necessary. However, this dynamic updating strategy provides a practical heuristic that significantly reduces latency and LLM inference costs.

这种动态扩展实现了标签传播中单个递归步骤的逻辑。 当系统向图中加入一个新的实体节点 niNs 时,它会检查相邻节点所属的社区。 系统会把新节点分配给其多数邻居所属的社区,然后相应更新社区摘要和图。 虽然这种动态更新能够在数据流入系统时高效扩展社区,但由此得到的社区会逐渐偏离完整标签传播运行所生成的社区。 因此,周期性社区刷新仍然是必要的。 不过,这种动态更新策略提供了一种实用启发式方法,能够显著降低延迟和 LLM 推理成本。

Following GraphRAG, our community nodes contain summaries derived through an iterative map-reduce-style summarization of member nodes. However, our retrieval methods differ substantially from GraphRAG's map-reduce approach. To support our retrieval methodology, we generate community names containing key terms and relevant subjects from the community summaries. These names are embedded and stored to enable cosine similarity searches.

遵循 GraphRAG 的做法,作者的社区节点包含通过对成员节点进行迭代式 map-reduce 风格摘要而得到的摘要。 不过,作者的检索方法与 GraphRAG 的 map-reduce 方法有显著不同。 为了支持本文的检索方法,作者会根据社区摘要生成包含关键术语和相关主题的社区名称。 这些名称会被嵌入并存储,以支持余弦相似度搜索。

3. Memory Retrieval

The memory retrieval system in Zep provides powerful, complex, and highly configurable functionality. At a high level, the Zep graph search API implements a function f:SS that accepts a text-string query αS as input and returns a text-string context βS as output. The output β contains formatted data from nodes and edges required for an LLM agent to generate an accurate response to query α. The process f(α)β comprises three distinct steps:

Zep 中的记忆检索系统提供强大、复杂且高度可配置的功能。 从高层来看,Zep 图搜索 API 实现了一个函数 f:SS,它接受一个文本字符串查询 αS 作为输入,并返回一个文本字符串上下文 βS 作为输出。 输出 β 包含来自节点和边的格式化数据,这些数据是 LLM 智能体为查询 α 生成准确响应所需的。 过程 f(α)β 包含三个不同步骤:

  • Search (φ): The process begins by identifying candidate nodes and edges potentially containing relevant information. While Zep employs multiple distinct search methods, the overall search function can be represented as φ:SEsn×Nsn×Ncn. Thus, φ transforms a query into a 3-tuple containing lists of semantic edges, entity nodes, and community nodes—the three graph types containing relevant textual information.
  • Reranker (ρ): The second step reorders search results. A reranker function or model accepts a list of search results and produces a reordered version of those results: ρ:φ(α),...Esn×Nsn×Ncn.
  • Constructor (χ): The final step, the constructor, transforms the relevant nodes and edges into text context: χ:Esn×Nsn×NcnS. For each eiEs, χ returns the fact and tvalid,tinvalid fields; for each niNs, the name and summary fields; and for each niNc, the summary field.
  • 搜索φ):该过程从识别可能包含相关信息的候选节点和边开始。 虽然 Zep 采用多种不同搜索方法,但整体搜索函数可以表示为 φ:SEsn×Nsn×Ncn。因此,φ 会把查询转化为一个三元组,其中包含语义边、实体节点和社区节点的列表;这三类图对象都包含相关文本信息。
  • 重排序器ρ):第二步会重新排列搜索结果。 重排序函数或模型接受搜索结果列表,并生成这些结果的重排序版本:ρ:φ(α),...Esn×Nsn×Ncn
  • 构造器χ):最后一步中,构造器会把相关节点和边转化为文本上下文:χ:Esn×Nsn×NcnS。 对于每个 eiEsχ 返回 fact 以及 tvalid,tinvalid 字段;对于每个 niNs,返回名称和摘要字段;对于每个 niNc,返回摘要字段。

With these definitions established, we can express f as a composition of these three components: f(α)=χ(ρ(φ(α)))=β. Sample context string template:

有了这些定义之后,可以将 f 表达为这三个组件的复合:f(α)=χ(ρ(φ(α)))=β 下面是一个上下文字符串模板示例:

FACTS and ENTITIES represent relevant context to the current conversation.
These are the most relevant facts and their valid date ranges. If the fact is about an event, the event takes place during this time.
format: FACT (Date range: from - to)
<FACTS>
{facts}
</FACTS>
These are the most relevant entities
ENTITY_NAME: entity summary
<ENTITIES>
{entities}
</ENTITIES>

Zep implements three search functions: cosine semantic similarity search (φcos), Okapi BM25 full-text search (φbm25), and breadth-first search (φbfs). The first two functions utilize Neo4j's implementation of Lucene. Each search function offers distinct capabilities in identifying relevant documents, and together they provide comprehensive coverage of candidate results before reranking. The search field varies across the three object types: for Es, we search the fact field; for Ns, the entity name; and for Nc, the community name, which comprises relevant keywords and phrases covered in the community. While developed independently, our community search approach parallels the high-level key search methodology in LightRAG. The hybridization of LightRAG's approach with graph-based systems like Graphiti presents a promising direction for future research.

Zep 实现了三种搜索函数:余弦语义相似度搜索 (φcos)、Okapi BM25 全文搜索 (φbm25) 和广度优先搜索 (φbfs) 前两个函数使用 Neo4j 对 Lucene 的实现。 每种搜索函数都在识别相关文档方面提供不同能力,三者结合后能够在重排序之前全面覆盖候选结果。 搜索字段会随三种对象类型而变化:对于 Es,搜索 fact 字段;对于 Ns,搜索实体名称;对于 Nc,搜索社区名称,其中包含社区摘要覆盖的相关关键词和短语。 虽然作者的社区搜索方法是独立开发的,但它与 LightRAG 中的高层 key 搜索方法相似。 将 LightRAG 方法与 Graphiti 这样的图式系统混合,是一个有前景的未来研究方向。

While cosine similarity and full-text search methodologies are well-established in RAG, breadth-first search over knowledge graphs has received limited attention in the RAG domain, with notable exceptions in graph-based RAG systems such as AriGraph and Distill-SynthKG. In Graphiti, the breadth-first search enhances initial search results by identifying additional nodes and edges within n-hops. Moreover, φbfs can accept nodes as parameters for the search, enabling greater control over the search function. This functionality proves particularly valuable when using recent episodes as seeds for the breadth-first search, allowing the system to incorporate recently mentioned entities and relationships into the retrieved context.

虽然余弦相似度和全文搜索方法在 RAG 中已经较为成熟,但知识图谱上的广度优先搜索在 RAG 领域受到的关注仍然有限;AriGraph 和 Distill-SynthKG 等图式 RAG 系统是少数例外。 在 Graphiti 中,广度优先搜索会通过识别 n 跳范围内的额外节点和边来增强初始搜索结果。 此外,φbfs 可以接受节点作为搜索参数,从而让系统对搜索函数具有更强控制能力。 当使用最近 episodes 作为广度优先搜索种子时,这一功能尤其有价值,因为系统可以把最近提到的实体和关系纳入检索到的上下文中。

The three search methods each target different aspects of similarity: full-text search identifies word similarities, cosine similarity captures semantic similarities, and breadth-first search reveals contextual similarities—where nodes and edges closer in the graph appear in more similar conversational contexts. This multi-faceted approach to candidate result identification maximizes the likelihood of discovering optimal context.

这三种搜索方法分别针对相似性的不同方面:全文搜索识别词语相似性,余弦相似度捕捉语义相似性,而广度优先搜索揭示上下文相似性,也就是图中距离更近的节点和边往往出现在更相似的对话上下文中。 这种多维度的候选结果识别方法,最大化了发现最优上下文的可能性。

3.2. Reranker

While the initial search methods aim to achieve high recall, rerankers serve to increase precision by prioritizing the most relevant results. Zep supports existing reranking approaches such as Reciprocal Rank Fusion (RRF) and Maximal Marginal Relevance (MMR). Additionally, Zep implements a graph-based episode-mentions reranker that prioritizes results based on the frequency of entity or fact mentions within a conversation, enabling a system where frequently referenced information becomes more readily accessible. The system also includes a node distance reranker that reorders results based on their graph distance from a designated centroid node, providing context localized to specific areas of the knowledge graph. The system's most sophisticated reranking capability employs cross-encoders—LLMs that generate relevance scores by evaluating nodes and edges against queries using cross-attention, though this approach incurs the highest computational cost.

初始搜索方法旨在实现高召回,而重排序器则通过优先排列最相关结果来提高精度。 Zep 支持 Reciprocal Rank Fusion(RRF)和 Maximal Marginal Relevance(MMR)等现有重排序方法。 此外,Zep 实现了一种基于图的 episode-mentions 重排序器,它会根据实体或 fact 在对话中被提及的频率来优先排列结果,从而使频繁被引用的信息更容易被访问。 系统还包含一个节点距离重排序器,会根据结果与指定中心节点之间的图距离重新排序,从而提供局部化到知识图谱特定区域的上下文。 系统最复杂的重排序能力使用 cross-encoder,即通过交叉注意力评估节点和边与查询相关性并生成相关性分数的 LLM;不过这种方法计算成本最高。

4. Experiments

This section analyzes two experiments conducted using LLM-memory based benchmarks. The first evaluation employs the Deep Memory Retrieval (DMR) task developed in MemGPT, which uses a 500-conversation subset of the Multi-Session Chat dataset introduced in "Beyond Goldfish Memory: Long-Term Open-Domain Conversation". The second evaluation utilizes the LongMemEval benchmark from "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory". Specifically, we use the LongMemEvals dataset, which provides an extensive conversation context of on average 115,000 tokens.

本节分析了两个使用基于 LLM 记忆的基准开展的实验。 第一个评测采用 MemGPT 中提出的 Deep Memory Retrieval(DMR)任务,该任务使用 “Beyond Goldfish Memory: Long-Term Open-Domain Conversation” 中 Multi-Session Chat 数据集的 500 段对话子集。 第二个评测使用 “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory” 中的 LongMemEval 基准。 具体来说,作者使用 LongMemEvals 数据集,它提供了平均 115,000 token 的大规模对话上下文。

For both experiments, we integrate the conversation history into a Zep knowledge graph through Zep's APIs. We then retrieve the 20 most relevant edges (facts) and entity nodes (entity summaries) using the techniques described in Section 3. The system reformats this data into a context string, matching the functionality provided by Zep's memory APIs. While these experiments demonstrate key retrieval capabilities of Graphiti, they represent a subset of the system's full search functionality. This focused scope enables clear comparison with existing benchmarks while reserving the exploration of additional knowledge graph capabilities for future work.

在两个实验中,作者都通过 Zep API 将对话历史整合进 Zep 知识图谱。 随后,系统使用第 3 节描述的技术检索 20 条最相关的边(facts)和实体节点(实体摘要)。 系统将这些数据重新格式化为上下文字符串,与 Zep 记忆 API 提供的功能相匹配。 虽然这些实验展示了 Graphiti 的关键检索能力,但它们只代表系统完整搜索功能的一个子集。 这种聚焦范围既能与现有基准进行清晰比较,也为未来探索更多知识图谱能力留下空间。

4.1. Choice of models

Our experimental implementation employs the BGE-m3 models from BAAI for both reranking and embedding tasks. For graph construction and response generation, we utilize gpt-4o-mini-2024-07-18 for graph construction, and both gpt-4o-mini-2024-07-18 and gpt-4o-2024-11-20 for the chat agent generating responses to provided context. To ensure direct comparability with MemGPT's DMR results, we also conducted the DMR evaluation using gpt-4-turbo-2024-04-09. The experimental notebooks will be made publicly available through our GitHub repository, and relevant experimental prompts are included in the Appendix.

实验实现中,作者在重排序和嵌入任务上都使用了 BAAI 的 BGE-m3 模型。 在图构建和响应生成方面,作者使用 gpt-4o-mini-2024-07-18 进行图构建,并使用 gpt-4o-mini-2024-07-18 和 gpt-4o-2024-11-20 作为聊天智能体,根据提供的上下文生成响应。 为了确保与 MemGPT 的 DMR 结果可以直接比较,作者也使用 gpt-4-turbo-2024-04-09 进行了 DMR 评测。 实验 notebook 将通过作者的 GitHub 仓库公开,相关实验提示词则包含在附录中。

4.2. Deep Memory Retrieval (DMR)

The Deep Memory Retrieval evaluation, introduced by MemGPT, comprises 500 multi-session conversations, each containing 5 chat sessions with up to 12 messages per session. Each conversation includes a question/answer pair for memory evaluation. The MemGPT framework currently leads performance metrics with 93.4% accuracy using gpt-4-turbo, a significant improvement over the 35.3% baseline achieved through recursive summarization.

MemGPT 提出的 Deep Memory Retrieval 评测包含 500 段多会话对话,每段对话包含 5 个聊天 session,每个 session 最多 12 条消息。 每段对话都包含一个用于记忆评估的问答对。 MemGPT 框架目前在性能指标上领先:使用 gpt-4-turbo 达到 93.4% 准确率,相较递归摘要得到的 35.3% 基线有显著提升。

To establish comparative baselines, we implemented two common LLM memory approaches: full-conversation context and session summaries. Using gpt-4-turbo, the full-conversation baseline achieved 94.4% accuracy, slightly surpassing MemGPT's reported results, while the session summary baseline achieved 78.6%. When using gpt-4o-mini, both approaches showed improved performance: 98.0% for full-conversation and 88.0% for session summaries. We were unable to reproduce MemGPT's results using gpt-4o-mini due to insufficient methodological details in their published work.

为了建立比较基线,作者实现了两种常见 LLM 记忆方法:完整对话上下文和 session 摘要。 使用 gpt-4-turbo 时,完整对话基线达到 94.4% 准确率,略高于 MemGPT 报告的结果;session 摘要基线达到 78.6%。 使用 gpt-4o-mini 时,这两种方法的表现都有提升:完整对话为 98.0%,session 摘要为 88.0%。 由于 MemGPT 发表工作中的方法细节不足,作者无法使用 gpt-4o-mini 复现 MemGPT 的结果。

We then evaluated Zep's performance by ingesting the conversations and using its search functions to retrieve the top 10 most relevant nodes and edges. An LLM judge compared the agent's responses to the provided golden answers. Zep achieved 94.8% accuracy with gpt-4-turbo and 98.2% with gpt-4o-mini, showing marginal improvements over both MemGPT and the respective full-conversation baselines. However, these results must be contextualized: each conversation contains only 60 messages, easily fitting within current LLM context windows.

随后,作者通过摄取对话并使用 Zep 的搜索函数检索最相关的前 10 个节点和边,来评估 Zep 的性能。 一个 LLM 裁判会将智能体响应与提供的 golden answers 进行比较。 Zep 使用 gpt-4-turbo 达到 94.8% 准确率,使用 gpt-4o-mini 达到 98.2%,相较 MemGPT 和对应完整对话基线都有小幅提升。 不过,这些结果必须放在具体语境中理解:每段对话只有 60 条消息,很容易放入当前 LLM 的上下文窗口。

The limitations of the DMR evaluation extend beyond its small scale. Our analysis revealed significant weaknesses in the benchmark's design. The evaluation relies exclusively on single-turn, fact-retrieval questions that fail to assess complex memory understanding. Many questions contain ambiguous phrasing, referencing concepts like "favorite drink to relax with" or "weird hobby" that were not explicitly characterized as such in the conversations. Most critically, the dataset poorly represents real-world enterprise use cases for LLM agents. The high performance achieved by simple full-context approaches using modern LLMs further highlights the benchmark's inadequacy for evaluating memory systems.

DMR 评测的限制不只是规模小。 作者的分析揭示了该基准设计中的显著弱点。 该评测完全依赖单轮事实检索问题,无法评估复杂记忆理解。 许多问题措辞含糊,涉及“最喜欢用来放松的饮料”或“奇怪的爱好”等概念,但这些概念在对话中并没有被明确这样描述。 最关键的是,该数据集并不能很好表示 LLM 智能体的真实企业用例。 现代 LLM 使用简单完整上下文方法即可取得很高性能,这进一步凸显了该基准在评估记忆系统方面的不足。

This inadequacy is further emphasized by findings in LongMemEval, which demonstrate rapidly declining LLM performance on the LongMemEval benchmark as conversation length increases. The LongMemEval dataset addresses many of these shortcomings by presenting longer, more coherent conversations that better reflect enterprise scenarios, along with more diverse evaluation questions.

LongMemEval 的发现进一步强调了这种不足:随着对话长度增加,LLM 在 LongMemEval 基准上的性能会快速下降。 LongMemEval 数据集通过提供更长、更连贯、也更能反映企业场景的对话,以及更多样的评估问题,解决了其中许多缺陷。

表1:Deep Memory Retrieval。† 表示结果来自 MemGPT 论文报告。
MemoryModelScore
Recursive Summarizationgpt-4-turbo35.3%
Conversation Summariesgpt-4-turbo78.6%
MemGPTgpt-4-turbo93.4%
Full-conversationgpt-4-turbo94.4%
Zepgpt-4-turbo94.8%
Conversation Summariesgpt-4o-mini88.0%
Full-conversationgpt-4o-mini98.0%
Zepgpt-4o-mini98.2%

4.3. LongMemEval (LME)

We evaluated Zep using the LongMemEvals dataset, which provides conversations and questions representative of real-world business applications of LLM agents. The LongMemEvals dataset presents significant challenges to existing LLMs and commercial memory solutions, with conversations averaging approximately 115,000 tokens in length. This length, while substantial, remains within the context windows of current frontier models, enabling us to establish meaningful baselines for evaluating Zep's performance.

作者使用 LongMemEvals 数据集评估 Zep;该数据集提供的对话和问题能够代表 LLM 智能体在真实业务应用中的场景。 LongMemEvals 数据集对现有 LLM 和商业记忆方案构成了显著挑战,其中对话平均长度约为 115,000 tokens。 这一长度虽然很大,但仍然位于当前前沿模型的上下文窗口之内,使作者能够建立有意义的基线来评估 Zep 的性能。

The dataset incorporates six distinct question types: single-session-user, single-session-assistant, single-session-preference, multi-session, knowledge-update, and temporal-reasoning. These categories are not uniformly distributed throughout the dataset; for detailed distribution information, we refer readers to LongMemEval. We conducted all experiments between December 2024 and January 2025. We performed testing using a consumer laptop from a residential location in Boston, MA, connecting to Zep's service hosted in AWS us-west-2. This distributed architecture introduced additional network latency when evaluating Zep's performance, though this latency was not present in our baseline evaluations. For answer evaluation, we employed GPT-4o with the question-specific prompts provided in LongMemEval, which have demonstrated high correlation with human evaluators.

该数据集包含六种不同问题类型:single-session-user、single-session-assistant、single-session-preference、multi-session、knowledge-update 和 temporal-reasoning。 这些类别在数据集中并非均匀分布;关于详细分布信息,作者建议读者参考 LongMemEval。 作者在 2024 年 12 月至 2025 年 1 月之间开展了所有实验。 测试使用位于美国马萨诸塞州波士顿住宅网络中的消费级笔记本完成,并连接到托管在 AWS us-west-2 的 Zep 服务。 这种分布式架构在评估 Zep 性能时引入了额外网络延迟,不过这种延迟并不存在于基线评估中。 在答案评估方面,作者使用 GPT-4o 和 LongMemEval 提供的特定问题提示词;这些提示词已被证明与人类评估者高度相关。

4.3.1. LongMemEval and MemGPT

To establish a comparative benchmark between Zep and the current state-of-the-art MemGPT system, we attempted to evaluate MemGPT using the LongMemEval dataset. Given that the current MemGPT framework does not support direct ingestion of existing message histories, we implemented a workaround by adding conversation messages to the archival history. However, we were unable to achieve successful question responses using this approach. We look forward to seeing evaluations of this benchmark by other research teams, as comparative performance data would benefit the broader development of LLM memory systems.

为了在 Zep 与当前最先进的 MemGPT 系统之间建立比较基准,作者尝试使用 LongMemEval 数据集评估 MemGPT。 鉴于当前 MemGPT 框架不支持直接摄取已有消息历史,作者采用了一种变通方法:将对话消息加入 archival history。 然而,作者未能通过这种方法获得成功的问题响应。 作者期待其他研究团队对这一基准开展评估,因为比较性性能数据将有助于更广泛地推动 LLM 记忆系统发展。

4.3.2. LongMemEval results

Zep demonstrates substantial improvements in both accuracy and latency compared to the baseline across both model variants. Using gpt-4o-mini, Zep achieved a 15.2% accuracy improvement over the baseline, while gpt-4o showed an 18.5% improvement. The reduced prompt size also led to significant latency cost reductions compared to the baseline implementations.

在两个模型变体上,相较基线,Zep 在准确率和延迟方面都展现出显著提升。 使用 gpt-4o-mini 时,Zep 相较基线实现了 15.2% 的准确率提升;而使用 gpt-4o 时,提升达到 18.5%。 相比基线实现,提示词规模的减小也带来了显著的延迟成本降低。

表2:LongMemEvals 总体结果。
MemoryModelScoreLatencyLatency IQRAvg Context Tokens
Full-contextgpt-4o-mini55.4%31.3 s8.76 s115k
Zepgpt-4o-mini63.8%3.20 s1.31 s1.6k
Full-contextgpt-4o60.2%28.9 s6.01 s115k
Zepgpt-4o71.2%2.58 s0.684 s1.6k

Analysis by question type reveals that gpt-4o-mini with Zep showed improvements in four of the six categories, with the most substantial gains in complex question types: single-session-preference, multi-session, and temporal-reasoning. When using gpt-4o, Zep further demonstrated improved performance in the knowledge-update category, highlighting its effectiveness with more capable models. However, additional development may be needed to improve less capable models' understanding of Zep's temporal data.

按问题类型分析可见,搭配 Zep 的 gpt-4o-mini 在六个类别中的四个类别上取得提升,其中最显著的增益出现在复杂问题类型:single-session-preference、multi-session 和 temporal-reasoning。 使用 gpt-4o 时,Zep 进一步在 knowledge-update 类别中展现出性能提升,凸显了它与更强模型结合时的有效性。 不过,要提升较弱模型对 Zep 时间数据的理解,可能还需要进一步开发。

表3:LongMemEvals 问题类型分解。Delta 表示 Zep 相较 Full-context 的相对变化。
Question TypeModelFull-contextZepDelta
single-session-preferencegpt-4o-mini30.0%53.3%77.7%↑
single-session-assistantgpt-4o-mini81.8%75.0%9.06%↓
temporal-reasoninggpt-4o-mini36.5%54.1%48.2%↑
multi-sessiongpt-4o-mini40.6%47.4%16.7%↑
knowledge-updategpt-4o-mini76.9%74.4%3.36%↓
single-session-usergpt-4o-mini81.4%92.9%14.1%↑
single-session-preferencegpt-4o20.0%56.7%184%↑
single-session-assistantgpt-4o94.6%80.4%17.7%↓
temporal-reasoninggpt-4o45.1%62.4%38.4%↑
multi-sessiongpt-4o44.3%57.9%30.7%↑
knowledge-updategpt-4o78.2%83.3%6.52%↑
single-session-usergpt-4o81.4%92.9%14.1%↑

These results demonstrate Zep's ability to enhance performance across model scales, with the most pronounced improvements observed in complex and nuanced question types when paired with more capable models. The latency improvements are particularly noteworthy, with Zep reducing response times by approximately 90% while maintaining higher accuracy. The decrease in performance for single-session-assistant questions—17.7% for gpt-4o and 9.06% for gpt-4o-mini—represents a notable exception to Zep's otherwise consistent improvements, and suggest further research and engineering work is needed.

这些结果表明,Zep 能够跨模型规模提升性能;当与更强模型配合时,最显著的提升出现在复杂且细腻的问题类型上。 延迟改进尤其值得注意:Zep 在保持更高准确率的同时,将响应时间降低了约 90%。 single-session-assistant 问题上的性能下降——gpt-4o 下降 17.7%,gpt-4o-mini 下降 9.06%——是 Zep 一贯提升之外一个值得注意的例外,也说明仍需进一步研究和工程改进。

5. Conclusion

We have introduced Zep, a graph-based approach to LLM memory that incorporates semantic and episodic memory alongside entity and community summaries. Our evaluations demonstrate that Zep achieves state-of-the-art performance on existing memory benchmarks while reducing token costs and operating at significantly lower latencies.

本文介绍了 Zep,一种基于图的 LLM 记忆方法;它结合了语义记忆、情节记忆,以及实体和社区摘要。 评测表明,Zep 在现有记忆基准上取得了最先进性能,同时降低了 token 成本,并以显著更低的延迟运行。

The results achieved with Graphiti and Zep, while impressive, likely represent only initial advances in graph-based memory systems. Multiple research avenues could build upon these frameworks, including integration of other GraphRAG approaches into the Zep paradigm and novel extensions of our work.

Graphiti 和 Zep 取得的结果虽然令人印象深刻,但很可能只是图式记忆系统的初步进展。 多个研究方向可以建立在这些框架之上,包括将其他 GraphRAG 方法整合进 Zep 范式,以及对本文工作的创新扩展。

Research has already demonstrated the value of fine-tuned models for LLM-based entity and edge extraction within the GraphRAG paradigm, improving accuracy while reducing costs and latency. Similar models fine-tuned for Graphiti prompts may enhance knowledge extraction, particularly for complex conversations. Additionally, while current research on LLM-generated knowledge graphs has primarily operated without formal ontologies, domain-specific ontologies present significant potential. Graph ontologies, foundational in pre-LLM knowledge graph work, warrant further exploration within the Graphiti framework.

已有研究证明,在 GraphRAG 范式中,用于基于 LLM 的实体和边抽取的微调模型具有价值,能够在降低成本和延迟的同时提高准确率。 针对 Graphiti 提示词微调的类似模型可能增强知识抽取能力,尤其是在复杂对话中。 此外,虽然当前关于 LLM 生成知识图谱的研究主要在没有形式化本体的情况下进行,但领域特定本体具有显著潜力。 图本体是前 LLM 时代知识图谱工作的基础,值得在 Graphiti 框架内进一步探索。

Our search for suitable memory benchmarks revealed limited options, with existing benchmarks often lacking robustness and complexity, frequently defaulting to simple needle-in-a-haystack fact-retrieval questions. The field requires additional memory benchmarks, particularly those reflecting business applications like customer experience tasks, to effectively evaluate and differentiate memory approaches. Notably, no existing benchmarks adequately assess Zep's capability to process and synthesize conversation history with structured business data. While Zep focuses on LLM memory, its traditional RAG capabilities should be evaluated against established benchmarks such as FinanceBench, BEIR, and those used in LightRAG.

作者在寻找合适记忆基准时发现可选项有限,现有基准往往缺乏鲁棒性和复杂性,经常默认采用简单的大海捞针式事实检索问题。 该领域需要更多记忆基准,尤其是能够反映客户体验任务等业务应用的基准,以便有效评估并区分不同记忆方法。 值得注意的是,现有基准都无法充分评估 Zep 处理和综合对话历史与结构化业务数据的能力。 虽然 Zep 聚焦于 LLM 记忆,但其传统 RAG 能力也应在 FinanceBench、BEIR 以及 LightRAG 使用的既有基准上进行评估。

Current literature on LLM memory and RAG systems insufficiently addresses production system scalability in terms of cost and latency. We have included latency benchmarks for our retrieval mechanisms to begin addressing this gap, following the example set by LightRAG's authors in prioritizing these metrics.

当前关于 LLM 记忆和 RAG 系统的文献,在成本和延迟层面的生产系统可扩展性方面讨论不足。 作者纳入了检索机制的延迟基准,以开始弥补这一空白;这也延续了 LightRAG 作者优先关注这些指标的做法。