LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

MemoryBenchmarkLongMemEvalICLR 2025

Wu D, Wang H, Yu W, et al. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory[J]. arXiv preprint arXiv:2410.10813, 2024.

https://github.com/xiaowu0162/LongMemEval

LongMemEval：面向长期交互记忆的聊天助手基准评测

Abstract

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading. Built upon key experimental insights, we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope. Extensive experiments show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

近期由大语言模型（LLM）驱动的聊天助手系统已经集成记忆组件，用来追踪用户与助手的聊天历史，从而生成更准确、更个性化的回复。然而，它们在持续交互中的长期记忆能力仍然缺乏充分研究。本文介绍 LongMemEval，这是一个综合基准，用于评估聊天助手的五种核心长期记忆能力：信息抽取、多会话推理、时间推理、知识更新和拒答。 LongMemEval 包含 500 个精心构造的问题，并将这些问题嵌入可自由扩展的用户-助手聊天历史中；它对现有长期记忆系统构成了显著挑战，商业聊天助手和长上下文 LLM 在持续交互中记忆信息时出现了 30% 的准确率下降。随后，作者提出一个统一框架，将长期记忆设计拆分为三个阶段：索引、检索和阅读。基于关键实验洞察，作者提出了若干记忆设计优化，包括用于 value 粒度的会话分解、用于索引的事实增强 key 扩展，以及用于细化搜索范围的时间感知 query 扩展。大量实验表明，这些优化显著提升了 LongMemEval 上的记忆召回和下游问答表现。总体而言，本文为推进基于 LLM 的聊天助手长期记忆能力提供了有价值的资源和指导，为更个性化、更可靠的对话式 AI 铺平道路。

1. Introduction

Large language models (LLMs) have exhibited impressive capabilities in solving diverse tasks through natural language, leading to numerous successful chat assistant applications. Nevertheless, LLMs face limitations on tasks relying heavily on personal knowledge accumulated through long-term user-AI interactions, such as psychological counseling or secretarial duties. Failing to incorporate user background and preferences into responses can diminish the response's accuracy as well as user satisfaction. To personalize LLM-based assistants, long-term memory, the ability to memorize, recall, and reason with a long interaction history, is indispensable. Recently, several commercial and open-source assistant systems with memory have been introduced. These systems leverage techniques like compressing, indexing, and retrieving from chat histories to generate more accurate and personalized responses.

大语言模型（LLM）已经展现出通过自然语言解决多样任务的强大能力，并催生了许多成功的聊天助手应用。尽管如此，在高度依赖长期用户-AI 交互中积累的个人知识的任务上，例如心理咨询或秘书工作，LLM 仍然存在局限。如果回复中未能纳入用户背景和偏好，回复准确性和用户满意度都会下降。要让基于 LLM 的助手实现个性化，长期记忆不可或缺，也就是记住、召回并基于长期交互历史进行推理的能力。最近，一些带有记忆能力的商业和开源助手系统被提出。这些系统利用聊天历史压缩、索引和检索等技术，生成更准确、更个性化的回复。

Despite these advances, there has been limited progress in holistically evaluating the memory capability in long-term interactions. While several benchmarks evaluate LLMs on understanding long chat histories, they have two major shortcomings. First, they do not accurately reflect user-AI interactions: many focus solely on human-human conversations, while others omit task-oriented dialogues, which represent a significant portion of chat assistant usage and challenge memorization with the long-context inputs and long-form responses. Their interactive histories also typically have a non-configurable length spanning only a few thousand tokens, limiting the difficulty as current systems continue to improve. Second, current benchmarks' questions only offer a limited coverage of the memory abilities required in dynamic long-term interactions. For instance, MemoryBank and PerLTQA insufficiently evaluate the ability to synthesize information across numerous sessions or to reason with temporal metadata or time references. All long-term memory benchmarks including recent ones such as LoCoMo also fail to evaluate recall of information provided by the assistant or reasoning with updated user information.

尽管已有这些进展，针对长期交互中记忆能力的整体评估仍然进展有限。虽然已有若干基准评估 LLM 对长聊天历史的理解能力，但它们存在两个主要不足。第一，它们不能准确反映用户-AI 交互：许多基准只关注人类之间的对话，另一些则忽略任务导向对话；而任务导向对话占聊天助手使用场景的重要部分，并且其长上下文输入和长格式回复会对记忆能力提出挑战。这些交互历史通常长度不可配置，而且只有几千个 token，随着当前系统不断进步，这会限制评测难度。第二，当前基准的问题对动态长期交互所需的记忆能力覆盖有限。例如，MemoryBank 和 PerLTQA 对跨大量会话综合信息的能力，以及基于时间元数据或时间引用进行推理的能力评估不足。包括 LoCoMo 等近期基准在内的所有长期记忆基准，也都未能评估助手侧提供信息的召回能力，或基于已更新用户信息进行推理的能力。

We introduce LongMemEval, a comprehensive benchmark for assessing the long-term memory capabilities of chat assistants. LongMemEval consists of 500 manually created questions to test five core memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. Each question requires recalling information hidden within one or more task-oriented dialogues between a user and an assistant. Inspired by the "needle-in-a-haystack" test, we design a pipeline to compile a coherent and length-configurable chat history for each question. A chat system, then, is required to parse the dynamic interactions online for memorization, and answer the question after all the interaction sessions. While the length of the history is freely extensible, we provide two standard settings for consistent comparison: LongMemEval $_{S}$ with approximately 115k tokens per problem and LongMemEval $_{M}$ with 500 sessions (around 1.5 million tokens). Preliminary evaluations highlight the difficulty of LongMemEval, as long-context LLMs show a 30% to 60% performance drop on LongMemEval $_{S}$ , and manual evaluations reveal that state-of-the-art commercial systems only achieved 30% to 70% accuracy in a setting much simpler than LongMemEval $_{S}$ (Figure 3).

作者提出 LongMemEval，这是一个用于评估聊天助手长期记忆能力的综合基准。 LongMemEval 包含 500 个人工创建的问题，用于测试五种核心记忆能力：信息抽取、多会话推理、时间推理、知识更新和拒答。每个问题都需要召回隐藏在一个或多个用户与助手之间任务导向对话中的信息。受“needle-in-a-haystack”测试启发，作者设计了一条流程，为每个问题编译连贯且长度可配置的聊天历史。随后，聊天系统需要在线解析动态交互并进行记忆，并在所有交互会话结束后回答问题。虽然历史长度可以自由扩展，作者仍提供两个标准设置以便一致比较：LongMemEval $_{S}$ 每个问题约 115k token，LongMemEval $_{M}$ 包含 500 个会话，约 150 万 token。初步评估凸显了 LongMemEval 的难度：长上下文 LLM 在 LongMemEval $_{S}$ 上出现 30% 到 60% 的性能下降，而人工评估显示，最先进的商业系统在远比 LongMemEval $_{S}$ 简单的设置中也只有 30% 到 70% 的准确率（见图3）。

Finally, we present a unified view for memory-augmented chat assistants. Leveraging LongMemEval, we comprehensively analyze memory design choices across three key execution stages--indexing, retrieval, and reading--and four control points: value, key, query, and reading strategy. Experimental insights identify several effective memory designs:

最后，作者提出了一个面向记忆增强聊天助手的统一视角。借助 LongMemEval，作者围绕三个关键执行阶段（索引、检索和阅读）以及四个控制点（value、key、query 和阅读策略），全面分析记忆设计选择。实验洞察揭示了若干有效的记忆设计：

Instead of sessions, round is the more optimal granularity for storing and utilizing the interactive history.
While further compression into individual user facts harms overall performance due to information loss, it improves the multi-session reasoning accuracy.
While using a flat index with the memory values themselves as the keys is a strong baseline, further expanding the keys with extracted user facts improves both memory recall and downstream question answering.
Naive time-agnostic memory designs perform poorly on temporal reasoning questions.
We propose a simple indexing and query expansion strategy to explicitly associate timestamps with facts and narrow down the search range, improving the memory recall for temporal reasoning by 6.8% to 11.3% when a strong LLM is employed for query expansion.
Even with perfect memory recall, accurately utilizing retrieved items is non-trivial.
Applying Chain-of-Note and structured data format improves question answering accuracy by as much as 10 absolute points across three LLMs.

相比 session，round 是存储和利用交互历史时更优的粒度。
将内容进一步压缩为单独的用户事实会因信息损失而损害整体表现，但会提升多会话推理准确率。
使用记忆 value 本身作为 key 的扁平索引是一个强基线，但进一步用抽取出的用户事实来扩展 key，可以同时提升记忆召回和下游问答表现。
朴素的时间无关记忆设计在时间推理问题上表现较差。
作者提出一种简单的索引和 query 扩展策略，用于显式关联时间戳与事实并缩小搜索范围；当使用强 LLM 进行 query 扩展时，时间推理的记忆召回提升 6.8% 到 11.3%。
即使记忆召回完美，准确利用检索到的条目也并不容易。
应用 Chain-of-Note 和结构化数据格式，在三个 LLM 上最多可将问答准确率提升 10 个绝对点。

Long-Term Dialogue Benchmarks

As the ability of dialogue systems improve, research starts to focus on long-term dialogue understanding beyond traditional dialogue modeling benchmarks. Early works focused on language modeling evaluation on generating personalized responses from human-human or human-AI chat histories. To more precisely evaluate memory accuracy, subsequent benchmarks shifted toward question answering. For example, MemoryBank features multi-day chat histories from 15 users with 194 human-written probing questions. LoCoMo includes 50 long-term chat histories and questions testing single-hop, multi-hop, temporal, commonsense, world knowledge, and adversarial reasoning. PerLTQA scales the evaluation to 3,409 dialogues and 8,593 questions, covering world knowledge, personal profiles, social relationships, events, and dialogue history. DialSim evaluates models' memory ability by roleplaying TV show characters and introduces a time constraint that penalizes slow system responses. Despite these advancements, existing QA-based benchmarks overlook several memory capabilities critical to long-term user-assistant interactions: synthesizing information across numerous sessions, recalling assistant side information, and reasoning about updated user details or complex temporal references. Additionally, the chat histories are often too brief and do not reflect the nature of task-oriented interactions. Table 1 compares between LongMemEval and previous works, highlighting its advantages in both (1) featuring a long and freely extensible iterative history and (2) holistically covering critical memory abilities in a uniquely challenging way (further examples in Figure 1).

随着对话系统能力提升，研究开始超越传统对话建模基准，转向长期对话理解。早期工作侧重于语言建模评估，即从人类-人类或人类-AI 聊天历史中生成个性化回复。为了更精确地评估记忆准确性，后续基准转向问答。例如，MemoryBank 包含来自 15 位用户的多日聊天历史，以及 194 个人类编写的探测问题。 LoCoMo 包含 50 段长期聊天历史，并用问题测试单跳、多跳、时间、常识、世界知识和对抗推理。 PerLTQA 将评估扩展到 3,409 段对话和 8,593 个问题，覆盖世界知识、个人画像、社会关系、事件和对话历史。 DialSim 通过扮演电视剧角色来评估模型记忆能力，并引入时间约束，对响应缓慢的系统进行惩罚。尽管已有这些进展，现有基于 QA 的基准仍忽视了长期用户-助手交互中若干关键记忆能力：跨大量会话综合信息、召回助手侧信息，以及基于更新后的用户细节或复杂时间引用进行推理。此外，聊天历史往往过短，也不能反映任务导向交互的性质。表1 将 LongMemEval 与已有工作进行比较，突出其两方面优势：（1）具备长且可自由扩展的迭代历史；（2）以独特且具有挑战性的方式整体覆盖关键记忆能力（更多示例见图1）。

表1：LongMemEval 与已有长期记忆基准的比较。颜色表示对话来源：青色为 human-human，紫色为 human-AI。#Sess 和 #Q 分别表示会话总数和问题总数；Context Depth 表示历史中的 token 数。IE、MR、KU、TR 和 ABS 分别表示信息抽取、多会话推理、知识更新、时间推理和拒答。

Benchmark	Domain	#Sess	#Q	Context Depth	Core Memory Abilities
					IE	MR	KU	TR	ABS
MSC	Open-Domain	5k	-	1k	✗	✗	✗	✗	✗
DuLeMon	Open-Domain	30k	-	1k	✗	✗	✗	✗	✗
MemoryBank	Personal	300	194	5k	✓	✗	✗	✓	✗
PerLTQA	Personal	4k	8593	1M*	✓	✗	✗	✗	✓
LoCoMo	Personal	1k	7512	10k	✓	✓	✗	✓	✓
DialSim	TV Shows	1k-2k	1M	350k	✓	✓**	✗	✓	✓
LongMemEval	Personal	50k	500	115k, 1.5M	✓	✓	✓	✓	✓

* 原论文未报告，基于作者近似估计。** 最多 2 个会话。

Examples of question types in LongMemEval — 图1：LongMemEval 中七种问题类型的示例。每个示例左侧展示相关证据陈述，右侧展示问题和答案。

Long-Term Memory Methods

To equip chat assistants with long-term memory capabilities, three major techniques are commonly explored. The first approach involves directly adapting LLMs to process extensive history information as long-context inputs. While this method avoids the need for complex architectures, it is inefficient and susceptible to the "lost-in-the-middle" phenomenon, where the ability of LLMs to utilize contextual information weakens as the input length grows. A second line of research integrates differentiable memory modules into language models, proposing specialized architectural designs and training strategies to enhance memory capabilities. Lastly, several studies approach long-term memory from the perspective of context compression, developing techniques to condense lengthy histories into compact representations, whether in the form of LLM internal representations, discrete tokens, or retrievable text segments via retrieval-augmented generation (RAG). Although LongMemEval can evaluate any memory system, we will take an online context compression perspective, where each history interaction session is sequentially processed, stored, and accessed on-demand through indexing and retrieval mechanisms (Figure 4). This formulation aligns with current literature and commercial systems. Its plug-and-play nature also facilitates the integration into existing chat assistant systems.

为了让聊天助手具备长期记忆能力，研究中通常探索三类主要技术。第一类方法是直接改造 LLM，使其能够将大量历史信息作为长上下文输入处理。虽然这种方法避免了复杂架构需求，但效率较低，并且容易受到“lost-in-the-middle”现象影响：随着输入长度增长，LLM 利用上下文信息的能力会减弱。第二类研究将可微记忆模块集成到语言模型中，提出专门的架构设计和训练策略来增强记忆能力。最后，一些研究从上下文压缩视角处理长期记忆，开发技术将冗长历史压缩为紧凑表示，这些表示可以是 LLM 内部表示、离散 token，或通过检索增强生成（RAG）检索的文本片段。虽然 LongMemEval 可以评估任意记忆系统，但本文采用在线上下文压缩视角：每个历史交互会话会被顺序处理、存储，并通过索引和检索机制按需访问（见图4）。这一形式化与当前文献和商业系统保持一致。其即插即用的特性也便于集成到现有聊天助手系统中。

3. LongMemEval

3.1 Problem Formulation

The evaluation of LongMemEval requires an instance of 4-tuple $(S, q, t_{q}, a)$ . $S \equiv [(t_{1}, S_{1}), (t_{2}, S_{2}), . . ., (t_{N}, S_{N})]$ is a sequence of $N$ history chat sessions ordered from the earliest to the latest, where $S_{i}$ is a multi-turn interaction between the user and a chat assistant and $t_{i}$ is the session's timestamp. Each session can be further decomposed into rounds: one user message followed by one assistant response. During test time, $S$ is provided to the system one by one. $q$ and $t_{q} > t_{N}$ represent the question from the user and its date. $a$ is a short phrase indicating the answer, or a natural language rubric describing the preferred answer in the case where $q$ is open-ended.

LongMemEval 的评估需要一个四元组实例 $(S, q, t_{q}, a)$ 。 $S \equiv [(t_{1}, S_{1}), (t_{2}, S_{2}), . . ., (t_{N}, S_{N})]$ 是由早到晚排序的 $N$ 个历史聊天会话序列，其中 $S_{i}$ 是用户和聊天助手之间的多轮交互， $t_{i}$ 是该会话的时间戳。每个会话还可以进一步拆分为 round：一条用户消息后接一条助手回复。测试时， $S$ 会逐个提供给系统。 $q$ 和 $t_{q} > t_{N}$ 分别表示用户的问题及其日期。 $a$ 是表示答案的短语；如果 $q$ 是开放式问题，则 $a$ 是描述偏好答案的自然语言评分准则。

3.2 LongMemEval: Benchmark Curation

One major challenge in building a reliable personalized assistant is performing online recording, recalling, updating, and reasoning on the dynamically evolving user information. To comprehensively reflect the challenge, LongMemEval formulates five core long-term memory abilities:

构建可靠个性化助手的一个主要挑战，是对动态演化的用户信息进行在线记录、召回、更新和推理。为了全面反映这一挑战，LongMemEval 定义了五种核心长期记忆能力：

Information Extraction (IE): Ability to recall specific information from extensive interactive histories, including the details mentioned by either the user or the assistant.
Multi-Session Reasoning (MR): Ability to synthesize the information across multiple history sessions to answer complex questions that involve aggregation and comparison.
Knowledge Updates (KU): Ability to recognize the changes in the user's personal information and update the knowledge of the user dynamically over time.
Temporal Reasoning (TR): Awareness of the temporal aspects of user information, including both explicit time mentions and timestamp metadata in the interactions.
Abstention (ABS): Ability to identify questions seeking unknown information, i.e., information not mentioned by the user in the interaction history, and answer "I don't know".

信息抽取（IE）： 从大量交互历史中召回特定信息的能力，包括用户或助手提到的细节。
多会话推理（MR）： 综合多个历史会话中的信息，以回答涉及聚合和比较的复杂问题的能力。
知识更新（KU）： 识别用户个人信息变化，并随时间动态更新用户知识的能力。
时间推理（TR）： 感知用户信息时间属性的能力，包括交互中的显式时间提及和时间戳元数据。
拒答（ABS）： 识别询问未知信息的问题，即用户在交互历史中未提及的信息，并回答“I don't know”的能力。

As shown in Table 1, this formulation represents a more comprehensive ability coverage compared to prior long-term memory benchmarks like MemoryBank and PerLTQA. To thoroughly assess these abilities, LongMemEval features seven question types. Single-session-user and single-session-assistant test memorizing the information mentioned by user or assistant within a single session. Single-session-preference tests whether the model can utilize the user information to generate a personalized response. Multi-session (MR) tests aggregating user information across two or more sessions. Knowledge-update (KU) focuses on the ability to recognize changes in the user's life states and update the memory accordingly. Temporal-reasoning (TR) tests reasoning with both the timestamp in metadata and explicit time references. Finally, we draw 30 questions from the previous question types and modify them into "false premise" questions, testing whether the model can correctly abstain from answering (ABS). Figure 1 presents an example for each question type.

如表1所示，与 MemoryBank 和 PerLTQA 等已有长期记忆基准相比，这一形式化覆盖了更全面的能力。为了全面评估这些能力，LongMemEval 包含七种问题类型。 Single-session-user 和 single-session-assistant 测试模型是否能记住单个会话中用户或助手提到的信息。 Single-session-preference 测试模型是否能够利用用户信息生成个性化回复。 Multi-session（MR）测试跨两个或更多会话聚合用户信息的能力。 Knowledge-update（KU）关注识别用户生活状态变化并相应更新记忆的能力。 Temporal-reasoning（TR）测试模型能否同时利用元数据中的时间戳和显式时间引用进行推理。最后，作者从前述问题类型中抽取 30 个问题并改写为“false premise”问题，用于测试模型是否能正确拒答（ABS）。图1 展示了每种问题类型的示例。

Question Curation

Figure 2 depicts the question curation pipeline. We define an ontology of 164 user attributes in five categories: lifestyle, belongings, life events, situations context, and demographic information. For each attribute, we leverage an LLM to generate attribute-focused user background paragraphs, each of which includes detailed discussion of the user's life experience. To create a question, we randomly sample a paragraph and use an LLM to propose several seed (question, answer) pairs. As these LLM-proposed questions often lack depth and diversity, human experts manually filter and rewrite all the questions to achieve the desired difficulty. Then, we manually decompose the answer into one or more evidence statements with optional timestamps.

图2 展示了问题构造流程。作者定义了一个包含 164 个用户属性的本体，分为五类：生活方式、物品、人生事件、情境上下文和人口统计信息。对于每个属性，作者使用 LLM 生成聚焦于该属性的用户背景段落，每个段落都包含对用户生活经历的详细讨论。在创建问题时，作者随机采样一个段落，并使用 LLM 提出若干种子（问题，答案）对。由于这些 LLM 提出的问题往往缺乏深度和多样性，人类专家会手动筛选并重写所有问题，以达到期望难度。随后，作者手动将答案分解为一个或多个证据陈述，并可选附加时间戳。

Evidence Session Construction

Each evidence statement is then separately embedded into a task-oriented evidence session created by self-chatting. The user LLM is instructed to convey the evidence statement indirectly, e.g., instead of stating "I bought a new car last month," it might instead ask for help about car insurance and reveal the information incidentally. This approach enhances the benchmark's difficulty by requiring systems to recognize and memorize user details not explicitly emphasized in conversations. We present the full details in Appendix.

随后，每条证据陈述都会被单独嵌入到一个通过 self-chatting 创建的任务导向证据会话中。用户 LLM 会被要求间接传达证据陈述；例如，它不会直接说“我上个月买了一辆新车”，而可能改为询问汽车保险相关帮助，并顺带透露该信息。这种方法提升了基准难度，因为系统必须识别并记住对话中没有被显式强调的用户细节。完整细节见附录。

To ensure the data quality, all the evidence sessions are then manually screened and edited to (1) verify evidence inclusion, (2) distribute the evidence statement across different conversation positions, and (3) rephrase statements into more natural, colloquial language, especially for time mentions, which LLMs often express too formally. We also meticulously annotate the position of the evidence statement within each evidence session. For questions involving temporal information, we then manually add timestamps to both the evidence sessions and the questions. Most questions require evidence from multiple sessions (up to six) with evidence statements positioned diversely within sessions. Appendix presents further statistics of the final constructed data.

为保证数据质量，所有证据会话都会经过人工筛查和编辑：（1）验证证据是否被包含，（2）将证据陈述分布到不同对话位置，（3）将陈述改写得更自然、更口语化，尤其是时间提及，因为 LLM 往往会把它们表达得过于正式。作者还细致标注了每条证据陈述在对应证据会话中的位置。对于涉及时间信息的问题，作者随后会人工为证据会话和问题添加时间戳。大多数问题需要来自多个会话的证据，最多六个，并且证据陈述在会话中的位置多样。附录给出了最终构造数据的更多统计。

Data creation pipeline of LongMemEval — 图2：LongMemEval 数据创建流程：（a）人类专家构造所有问题和证据陈述；（b）证据会话由 LLM 模拟并由人工编辑；（c）完整用户-AI 聊天历史在测试时构造，其长度可自由配置。

History Compilation

For each question, LongMemEval compiles a coherent user-AI chat history (Figure 2c). Our approach is analogous to the needle-in-a-haystack test, which asks a model to retrieve brief information (the "needle") embedded in a long document (the "haystack"). In comparison, LongMemEval is more challenging and realistic as it involves retrieving and synthesizing information from multiple extended evidence sessions. Specifically, we sample a number of unrelated user-AI chat sessions, randomly insert the evidence sessions in the middle, and assign a plausible timestamp to all sessions. We draw the irrelevant sessions from two sources: (1) self-chat sessions simulated based on other non-conflicting attributes and (2) publicly released user-AI style chat data including ShareGPT and UltraChat. This design creates extensible realistic chat histories with minimal conflicts. While the pipeline allows us to compile chat histories of arbitrary length, we provide two standard settings: LongMemEval $_{S}$ (about 115k tokens/question) and LongMemEval $_{M}$ (500 sessions, about 1.5M tokens).

对于每个问题，LongMemEval 都会编译一段连贯的用户-AI 聊天历史（图2c）。作者的方法类似 needle-in-a-haystack 测试：要求模型从长文档（“haystack”）中检索嵌入其中的简短信息（“needle”）。相比之下，LongMemEval 更具挑战性也更真实，因为它涉及从多个扩展证据会话中检索并综合信息。具体而言，作者采样若干无关的用户-AI 聊天会话，将证据会话随机插入中间，并为所有会话分配合理时间戳。无关会话来自两个来源：（1）基于其他不冲突属性模拟的 self-chat 会话；（2）公开发布的用户-AI 风格聊天数据，包括 ShareGPT 和 UltraChat。这一设计创建了可扩展、真实且冲突最少的聊天历史。虽然该流程允许编译任意长度的聊天历史，作者提供了两个标准设置：LongMemEval $_{S}$ （每个问题约 115k token）和 LongMemEval $_{M}$ （500 个会话，约 150 万 token）。

3.3 Evaluation Metric

Question Answering

As the correct answers can take flexible forms, an exact matching strategy as in previous works can result in inaccurate evaluations. To address this, LongMemEval employs a LLM to assess response quality. Specifically, we prompt-engineer the gpt-4o-2024-08-06 model via the OpenAI API. Our meta-evaluation study demonstrates that the evaluator achieves more than 97% agreement with human experts. The prompts for each problem type as well as the human meta-evaluation details are presented in Appendix.

由于正确答案可以有灵活形式，像以往工作那样使用精确匹配策略可能导致评估不准确。为解决这一问题，LongMemEval 使用 LLM 来评估回复质量。具体而言，作者通过 OpenAI API 对 gpt-4o-2024-08-06 模型进行 prompt engineering。作者的元评估研究表明，该评估器与人类专家的一致率超过 97%。每类问题的提示词以及人工元评估细节见附录。

Memory Recall

As LongMemEval contains human-annotated answer location labels, intermediate retrieval metrics can be easily calculated if the chat system exposes its retrieval results. We report Recall@ $k$ and NDCG@ $k$ , where $k$ is the number of top items retrieved by the system.

由于 LongMemEval 包含人工标注的答案位置标签，如果聊天系统暴露其检索结果，就可以方便地计算中间检索指标。作者报告 Recall@ $k$ 和 NDCG@ $k$ ，其中 $k$ 是系统检索出的 top items 数量。

3.4 LongMemEval represents a significant challenge

Using LongMemEval, we conduct a pilot study on commercial systems and long-context LLMs.

作者使用 LongMemEval 对商业系统和长上下文 LLM 进行了试点研究。

Commercial systems

We evaluate two commercial systems that maintain a set of memorized user facts as the user chats with the assistant: ChatGPT and Coze. Since these systems only support memory features via their web interfaces, we randomly selected 97 questions and created a short chat history of 3-6 sessions (approximately 10x shorter than LongMemEval $_{S}$ ). Human annotators interacted with the chat assistants session-by-session and turn-by-turn, and finally ask the question in a new session. In Figure 3, we compare this online memory evaluation with offline reading, where GPT-4o is prompted to answer with the complete history provided as context. Both ChatGPT and Coze exhibited significant performance drops compared to offline reading, underscoring the challenging nature of LongMemEval. We found ChatGPT tended to overwrite crucial information as the chat continues, while Coze often failed to record indirectly provided user information. We provide analyses in Appendix. Overall, this result highlights the gap between building a seemingly personalized chat assistant by recalling isolated facts and demonstrating a genuinely strong memory ability.

作者评估了两个商业系统，它们会在用户与助手聊天时维护一组已记忆的用户事实：ChatGPT 和 Coze。由于这些系统只能通过网页界面使用记忆功能，作者随机选择 97 个问题，并创建了 3 到 6 个会话的短聊天历史，大约比 LongMemEval $_{S}$ 短 10 倍。人工标注者逐会话、逐轮与聊天助手交互，最后在新会话中提出问题。在图3中，作者将这种在线记忆评估与离线阅读进行比较；离线阅读中，GPT-4o 会在给定完整历史作为上下文的情况下回答。与离线阅读相比，ChatGPT 和 Coze 都出现了显著性能下降，这凸显了 LongMemEval 的挑战性。作者发现，随着聊天继续，ChatGPT 往往会覆盖关键信息，而 Coze 常常无法记录间接提供的用户信息。相关分析见附录。总体而言，这一结果强调了通过召回孤立事实来构建看似个性化的聊天助手，与真正展示强记忆能力之间存在差距。

System	LLM	Accuracy
Offline Reading	GPT-4o	0.9184
ChatGPT	GPT-4o	0.5773
ChatGPT	GPT-4o-mini	0.7113
Coze	GPT-4o	0.3299
Coze	GPT-3.5-turbo	0.2474

Model	Size	Oracle	S	% Drop
No Chain-of-Note
GPT-4o	-	0.870	0.606	30.3%↓
Llama 3.1 Instruct	70B	0.744	0.334	55.1%↓
Llama 3.1 Instruct	8B	0.710	0.454	36.1%↓
Phi-3 128k Instruct	14B	0.702	0.380	45.9%↓
Phi-3.5 Mini Instruct	4B	0.660	0.342	48.1%↓
With Chain-of-Note
GPT-4o	-	0.924	0.640	30.7%↓
Llama 3.1 Instruct	70B	0.848	0.286	66.3%↓
Llama 3.1 Instruct	8B	0.710	0.420	40.8%↓
Phi-3 128k Instruct	14B	0.722	0.344	52.4%↓
Phi-3.5 Mini Instruct	4B	0.652	0.324	50.3%↓

图3：对商业系统和长上下文 LLM 的试点研究。（a）商业记忆增强聊天助手在 LongMemEval 上表现较弱；（b）长上下文 LLM 在 LongMemEval

_{S}

上相较 oracle setting 出现明显性能下降。

Long-Context LLMs

While LongMemEval poses a significant challenge to online memory systems, is the benchmark easily tackled with offline reading over the entire history? In Figure 3, we evaluated four advanced long-context LLMs on LongMemEval $_{S}$ (with a history length of approximately 115k tokens): GPT-4o, Llama 3.1 Instruct, and Phi-3. Compared to the oracle retrieval setting (answering with only the evidence sessions as the context), these LLMs showed a 30% to 60% performance decline when tasked with reading the entire LongMemEval $_{S}$ history, regardless of whether the chain-of-note technique was applied. As the histories in LongMemEval $_{S}$ is still short (about 50 sessions), the performance is likely to further degrade as the interaction history expands. Overall, these findings suggest that even the most capable current long-context LLMs struggle to manage an ever-growing interaction history without an effective memory mechanism.

虽然 LongMemEval 对在线记忆系统构成了显著挑战，但如果离线阅读完整历史，这个基准是否容易解决？在图3中，作者在 LongMemEval $_{S}$ （历史长度约 115k token）上评估了四个先进长上下文 LLM：GPT-4o、Llama 3.1 Instruct 和 Phi-3。与 oracle retrieval 设置（仅以证据会话作为上下文进行回答）相比，当被要求阅读完整 LongMemEval $_{S}$ 历史时，这些 LLM 出现了 30% 到 60% 的性能下降，无论是否应用 Chain-of-Note 技术。由于 LongMemEval $_{S}$ 中的历史仍然较短（约 50 个会话），随着交互历史扩展，性能很可能进一步下降。总体而言，这些发现表明：如果没有有效记忆机制，即使当前最强的长上下文 LLM 也难以管理持续增长的交互历史。

4. A Unified View of Long-Term Memory Assistants

In this section, we formulate a three-stage long-term memory model for chat assistants. Despite its simplicity, this model provides a unified view of existing long-term memory assistant works. Along each of its stages, we then investigate crucial control points and propose our optimizations.

本节中，作者为聊天助手形式化一个三阶段长期记忆模型。尽管该模型很简单，但它为现有长期记忆助手工作提供了统一视角。随后，作者沿着每个阶段考察关键控制点，并提出优化方案。

4.1 Long-Term Memory System: Formulation

We formulate long-term memory as a massive key-value datastore $[(k_{1}, v_{1}), (k_{2}, v_{2}), . . .]$ . The keys $k_{i}$ can be heterogeneous, and could be discrete or continuous. In the discrete case, the key could be a sentence, a paragraph, a fact, or an entity, etc. In the continuous case, the key could be e.g., the model's internal representation under some inputs. The values $v_{i}$ might repeat. As shown in Figure 4, we formulate three stages for a memory-augmented assistant: (1) indexing, converting each history session $(t_{i}, S_{i})$ into one or more key-value items, (2) retrieval, formulating a retrieval query and collecting $k$ most relevant items, and (3) reading, an LLM $M$ reads the retrieval result and generates a response. In Table 2, we show how nine memory-augmented chat assistant systems can be viewed as instantiations of this framework. An alternative mathematical formulation is presented in Appendix. For its conciseness, the rest of this paper follows this section's formulation.

作者将长期记忆形式化为一个大规模 key-value 数据存储 $[(k_{1}, v_{1}), (k_{2}, v_{2}), . . .]$ 。 key $k_{i}$ 可以是异构的，也可以是离散或连续的。在离散情况下，key 可以是句子、段落、事实或实体等。在连续情况下，key 可以是模型在某些输入下的内部表示。 value $v_{i}$ 可能重复。如图4所示，作者为记忆增强助手形式化三个阶段：（1）索引，将每个历史会话 $(t_{i}, S_{i})$ 转换为一个或多个 key-value 条目；（2）检索，构造检索 query 并收集 $k$ 个最相关条目；（3）阅读，LLM $M$ 阅读检索结果并生成回复。在表2中，作者展示了九种记忆增强聊天助手系统如何被视作这一框架的实例。另一种数学形式化见附录。为保持简洁，本文其余部分遵循本节的形式化。

4.2 Long-Term Memory System: Design Choices

We identify four crucial control points for long-term memory of chat assistants, as illustrated in Figure 4. We analyze design choices from existing works and their limitations, and propose our optimizations. Due to space constraints, we present these designs at a high level here, with detailed designs further described in Section 5 and Appendix.

作者识别出聊天助手长期记忆的四个关键控制点，如图4所示。作者分析现有工作的设计选择及其局限，并提出自己的优化方案。由于篇幅限制，作者在此只进行高层概述，详细设计将在第 5 节和附录中进一步描述。

Unified view of long-term memory assistants — 图4：带长期记忆的聊天助手在运行中的统一视角。作者形式化了三个阶段和四个控制点（CP）。

CP 1: Value

The value represents the format and granularity of each session stored in memory. Given that user-AI chat sessions are often lengthy and cover multiple topics, storing each session as a single item can hinder effective retrieval and reading. Conversely, compressing sessions into summaries or user-specific facts, as seen in prior work such as MemoryBank and PerLTQA, can lead to information loss, harming the performance of the system to answer detailed questions. In Section 5.2, we compare three value representation strategies: storing entire sessions, decomposing sessions into individual rounds, and further applying summary/fact extraction.

value 表示记忆中存储的每个会话的格式和粒度。由于用户-AI 聊天会话往往很长且覆盖多个话题，将每个会话作为单个条目存储会阻碍有效检索和阅读。相反，像 MemoryBank 和 PerLTQA 等已有工作那样，将会话压缩为摘要或用户特定事实，可能导致信息损失，从而损害系统回答细节问题的表现。在第 5.2 节中，作者比较三种 value 表示策略：存储完整会话、将会话分解为单独 round，以及进一步进行摘要/事实抽取。

CP 2: Key

Even when sessions are decomposed and compressed, each item still contains substantial information, with only a fraction relevant to the user's query. Therefore, using the value itself as the key, a common practice in prior works, may be suboptimal. We introduce a key expansion approach in Section 5.3, where summaries, keyphrases, user facts, and timestamped events are extracted from the values to augment the index. This optimization highlights the key information and enables effective retrieval with multiple pathways.

即使会话被分解和压缩，每个条目仍包含大量信息，其中只有一部分与用户 query 相关。因此，像已有工作中常见做法那样直接使用 value 本身作为 key，可能并非最优。作者在第 5.3 节介绍 key 扩展方法：从 value 中抽取摘要、关键词短语、用户事实和带时间戳事件，以增强索引。这一优化突出关键信息，并通过多条路径实现有效检索。

CP 3: Query

For straightforward user queries, the aforementioned key-value optimizations may yield high retrieval accuracy. However, when queries involve temporal references (e.g., "Which restaurant did you recommend last weekend?"), naive similarity search proves insufficient. We address this with a time-aware indexing and query expansion strategy, where values are indexed with timestamped events, and retrieval is restricted to items within the relevant time range.

对于直接的用户 query，上述 key-value 优化可能带来较高检索准确率。然而，当 query 涉及时间引用时，例如“Which restaurant did you recommend last weekend?”，朴素相似度搜索就不够了。作者用时间感知索引和 query 扩展策略解决这一问题：value 会以带时间戳事件进行索引，检索也被限制在相关时间范围内的条目上。

CP 4: Reading Strategy

Answering complex queries may require recalling numerous memory items. Although the retrieval accuracy can be enhanced through the preceding designs, it does not guarantee that the LLM can effectively reason over the extensive context. In Section 5.5, we explore reading strategies and demonstrate that optimizations such as extracting key information before answering (Chain-of-Note) and using structured format prompting are crucial for achieving high reading performance.

回答复杂 query 可能需要召回大量记忆条目。虽然前述设计可以提升检索准确率，但这并不保证 LLM 能够有效地在大量上下文上进行推理。在第 5.5 节中，作者探索阅读策略，并证明在回答前抽取关键信息（Chain-of-Note）以及使用结构化格式提示等优化，对获得高阅读表现至关重要。

表2：通过本文提出的统一框架视角，对九种记忆增强框架进行比较。对于 ChatGPT 和 Coze，若若干设计未知，则跳过。

Method	Value	Key	Query	Retrieval	Time-aware	Reading
In-context RAG	round/session	K = V	question	flat	No	direct
MemoryBank	summary + round	K = V	question	flat	Yes	direct
LD-Agent	summary + fact	K = V	keyphrase	flat	Yes	direct
CoN	round/session	K = V	question	flat	No	CoN
ChatGPT	fact	-	-	-	No	-
Coze	fact	-	-	-	No	-
RAPTOR	round/session	node summary	question	flat/interactive	No	-
MemWalker	round/session	node summary	question	interactive	No	interactive
HippoRAG	round/session	entity	entity	PPR	No	direct
Our Design	round	K = V + fact	question + time	flat	Yes	CoN

5. Experiment Results

5.1 Experimental Setup

We mainly study three LLMs: GPT-4o, Llama 3.1 70B Instruct, and Llama 3.1 8B Instruct. For the retriever, we choose dense retrieval with the 1.5B Stella V5 model, given its high performance on MTEB. Extensive comparisons between Stella V5 and alternative retrievers are provided in Appendix. For the indexing stage, we employ Llama 3.1 8B Instruct to extract summaries, keyphrases, user facts, and timestamped events. When sessions or rounds are used as the key, we only keep the user-side utterances. In the reading stage, the retrieved items are always sorted by their timestamp to help the reader model maintain temporal consistency. Throughout Section 5.2 to Section 5.4, we apply Chain-of-Note and json format by default.

作者主要研究三个 LLM：GPT-4o、Llama 3.1 70B Instruct 和 Llama 3.1 8B Instruct。对于检索器，作者选择使用 1.5B Stella V5 模型进行稠密检索，因为它在 MTEB 上表现很高。 Stella V5 与其他检索器的广泛比较见附录。对于索引阶段，作者使用 Llama 3.1 8B Instruct 抽取摘要、关键词短语、用户事实和带时间戳事件。当 session 或 round 被用作 key 时，作者只保留用户侧话语。在阅读阶段，检索到的条目始终按时间戳排序，以帮助 reader 模型保持时间一致性。从第 5.2 节到第 5.4 节，作者默认应用 Chain-of-Note 和 json 格式。

QA performance with different value designs — 图5：LongMemEval $_{M}$ 上不同 value 设计的 QA 表现。将 session 分解为 round 可以提升 QA 表现；对于多会话推理问题，进一步用抽取事实表示 value 可以提升 QA 准确率。

5.2 Value: Decomposition improves RAG performance

Using LongMemEval $_{M}$ , we compare different value choices in a budget-aware manner. As shown in Figure 5, decomposing sessions into rounds significantly enhances reading performance with GPT-4o as the reader, while performing similarly to non-decomposed sessions when using Llama 3.1 8B Instruct as the reader. However, despite their efficiency in token usage, replacing sessions or rounds with extracted summaries or facts negatively impacts QA performance due to information loss. The only exception is with multi-session reasoning questions, where fact decomposition consistently improves performance. We hypothesize this is because fact decomposition extracts the same type of information across all sessions in a more uniform and simplified format, aiding retrieval and reading. Finally, we observe that the optimal token budget varies by the reader's capability: while Llama 3.1 8B Instruct's performance drops sharply beyond 3k retrieved tokens, GPT-4o continues to improve even with over 20k retrieved tokens.

作者使用 LongMemEval $_{M}$ ，以预算感知方式比较不同 value 选择。如图5所示，当 GPT-4o 作为 reader 时，将 session 分解为 round 会显著增强阅读表现；而当 Llama 3.1 8B Instruct 作为 reader 时，其表现与未分解 session 相近。然而，尽管用抽取摘要或事实替代 session 或 round 在 token 使用上更高效，但由于信息损失，它会对 QA 表现产生负面影响。唯一例外是多会话推理问题，其中事实分解会持续提升表现。作者推测，这是因为事实分解以更统一、更简化的格式从所有会话中抽取同类信息，从而有助于检索和阅读。最后，作者观察到最优 token 预算会随 reader 能力而变化：Llama 3.1 8B Instruct 在检索 token 超过 3k 后表现急剧下降，而 GPT-4o 即使在超过 20k 检索 token 时仍持续提升。

5.3 Key: Multi-key indexing improves retrieval and RAG

In Table 3, we explore whether summaries, keyphrases, or user facts condensed from the value can serve as better keys than the value itself. Interestingly, despite their more focused semantics, using these condensed forms alone does not enhance the memory recall performance. We hypothesize that this is due to the retriever's ability to already effectively handle long-text semantics.

在表3中，作者探索从 value 中压缩出的摘要、关键词短语或用户事实，能否作为比 value 本身更好的 key。有趣的是，尽管这些压缩形式语义更聚焦，但单独使用它们并不会增强记忆召回表现。作者推测，这是因为检索器已经能够有效处理长文本语义。

To leverage both the highlighted information from compression and the completeness of the original value, we applied a simple document expansion technique, where the compressed information is concatenated with the original value to form the key during indexing. This approach, particularly when using user facts, yielded an average improvement of 9.4% in recall@ $k$ and 5.4% in final accuracy across all models. In Appendix, we further analyze different retrievers and find with alternative retrievers, key expansion with summary and keyphrases could improve Recall@5 when session is used as the value granularity. These results suggest that multi-pathway retrieval can significantly enhance memory recall performance. In the following section, we will investigate the time constraint as another pathway to leverage.

为了同时利用压缩突出信息和原始 value 的完整性，作者采用了一种简单的文档扩展技术：在索引阶段，将压缩信息与原始 value 拼接起来形成 key。这种方法尤其在使用用户事实时，在所有模型上平均带来 9.4% 的 recall@ $k$ 提升和 5.4% 的最终准确率提升。在附录中，作者进一步分析不同检索器，并发现使用其他检索器时，当 session 作为 value 粒度时，用摘要和关键词短语进行 key 扩展可以提升 Recall@5。这些结果表明，多路径检索可以显著增强记忆召回表现。下一节中，作者将考察时间约束这一可利用的另一条路径。

表3：LongMemEval

_{M}

上不同 key 设计的检索和端到端 QA 表现。L3.1 表示 Llama 3.1 Instruct。使用抽取出的用户事实进行文档扩展（K = V + fact）显著提升检索和下游 QA。

Value = Round

Key Design	Retrieval				End-to-End QA
	Metrics@5		Metrics@10		GPT-4o		L3.1 70B		L3.1 8B
	Recall	NDCG	Recall	NDCG	Top-5	Top-10	Top-5	Top-10	Top-5	Top-10
K = V	0.582	0.481	0.692	0.512	0.615	0.670	0.600	0.624	0.518	0.534
K = fact	0.530	0.411	0.654	0.449	0.588	0.664	0.564	0.610	0.510	0.534
K = keyphrase	0.282	0.159	0.392	0.303	0.425	0.489	0.404	0.450	0.378	0.432
K = V + fact	0.644	0.498	0.784	0.536	0.657	0.720	0.638	0.682	0.566	0.572
K = V + keyphrase	0.478	0.359	0.636	0.410	0.541	0.652	0.538	0.620	0.472	0.524

Value = Session

Key Design	Retrieval				End-to-End QA
	Metrics@5		Metrics@10		GPT-4o		L3.1 70B		L3.1 8B
	Recall	NDCG	Recall	NDCG	Top-5	Top-10	Top-5	Top-10	Top-5	Top-10
K = V	0.706	0.617	0.783	0.638	0.670	0.676	0.592	0.570	0.524	0.464
K = summary	0.572	0.448	0.648	0.468	0.554	0.252	0.498	0.512	0.444	0.216
K = fact	0.642	0.524	0.814	0.571	0.644	0.512	0.544	0.550	0.470	0.404
K = keyphrase	0.482	0.375	0.576	0.401	0.618	0.498	0.440	0.450	0.388	0.414
K = V + summary	0.689	0.608	0.749	0.624	0.658	0.666	0.568	0.560	0.518	0.494
K = V + fact	0.732	0.620	0.862	0.652	0.714	0.700	0.588	0.584	0.530	0.490
K = V + keyphrase	0.710	0.587	0.768	0.602	0.665	0.672	0.590	0.566	0.526	0.508

5.4 Query: Time-aware query expansion improves temporal reasoning

A key challenge highlighted by LongMemEval in building real-world assistant systems is the need to utilize temporal information present in both metadata and user utterances to correctly answer time-sensitive queries. To address this need, we introduce a simple yet effective time-aware indexing and query expansion scheme. Specifically, values are additionally indexed by the dates of the events they contain. During retrieval, an LLM $M_{T}$ extracts a time range for time-sensitive queries, which is used to filter out a large number of irrelevant values.

LongMemEval 凸显了构建真实助手系统中的一个关键挑战：需要利用元数据和用户话语中存在的时间信息，正确回答时间敏感 query。为满足这一需求，作者提出一种简单但有效的时间感知索引和 query 扩展方案。具体而言，value 会额外根据其包含事件的日期进行索引。在检索期间，LLM $M_{T}$ 会为时间敏感 query 抽取时间范围，并用该范围过滤掉大量无关 value。

As shown in Table 4, this simple design improves recall by an average of 11.3% when using rounds as the value and by 6.8% when using sessions as the value. This improvement remains consistent when key expansion is applied during indexing. We also find that the effectiveness of this method depends on using a strong LLM for $M_{T}$ to accurately infer time ranges from queries. Llama 8B, on the other hand, struggles to generate accurate time ranges, often hallucinating or missing temporal cues even with numerous in-context examples. Further analysis is provided in Appendix.

如表4所示，这一简单设计在使用 round 作为 value 时平均提升 11.3% 的 recall，在使用 session 作为 value 时平均提升 6.8%。当在索引阶段应用 key 扩展时，这种提升仍然一致。作者还发现，该方法的有效性依赖于使用强 LLM 作为 $M_{T}$ ，以便从 query 中准确推断时间范围。另一方面，Llama 8B 难以生成准确的时间范围，即使提供大量上下文示例，也经常产生幻觉或遗漏时间线索。进一步分析见附录。

表4：LongMemEval

_{M}

时间推理子集上的检索表现。时间感知 query 扩展通过缩小检索范围显著促进检索。

Value = Session

Key Setting	Metric@5		Metric@10
	Recall	NDCG	Recall	NDCG
(1) K = V	0.639	0.630	0.651	0.707
(1) w/ Query Expansion ( $M_{T}$ = GPT-4o)	0.654	0.660	0.707	0.679
(1) w/ Query Expansion ( $M_{T}$ = Llama 3.1 8B Instruct)	0.624	0.627	0.647	0.692
(2) K = V + fact	0.684	0.688	0.721	0.782
(2) w/ Query Expansion ( $M_{T}$ = GPT-4o)	0.722	0.732	0.797	0.758
(2) w/ Query Expansion ( $M_{T}$ = Llama 3.1 8B Instruct)	0.677	0.688	0.711	0.744

Value = Round

Key Setting	Metric@5		Metric@10
	Recall	NDCG	Recall	NDCG
(1) K = V	0.421	0.462	0.499	0.511
(1) w/ Query Expansion ( $M_{T}$ = GPT-4o)	0.451	0.565	0.495	0.538
(1) w/ Query Expansion ( $M_{T}$ = Llama 3.1 8B Instruct)	0.384	0.448	0.489	0.488
(2) K = V + fact	0.489	0.500	0.550	0.598
(2) w/ Query Expansion ( $M_{T}$ = GPT-4o)	0.526	0.548	0.722	0.669
(2) w/ Query Expansion ( $M_{T}$ = Llama 3.1 8B Instruct)	0.481	0.532	0.570	0.647

5.5 Improving reading with chain-of-note and structured format

As LongMemEval requires syntheses across multiple sessions, even an optimal memory retrieval mechanism needs to return a long context to capture all relevant information. To enhance the model's ability to handle long retrieved contexts, we apply two key optimizations. First, we present retrieved items in a structured JSON format, which helps the model clearly recognize memory items as the data for reading. Additionally, we apply the Chain-of-Note (CoN) reading approach, instructing the LLM to first extract information from each memory item and then reason based on these notes. This effectively decomposes long-context reading into two simpler subtasks: copying important details and reasoning with more concise notes.

由于 LongMemEval 需要跨多个会话进行综合，即使最优记忆检索机制也需要返回较长上下文，以捕获所有相关信息。为了增强模型处理长检索上下文的能力，作者应用两项关键优化。首先，作者以结构化 JSON 格式呈现检索条目，这有助于模型清楚识别哪些记忆条目是用于阅读的数据。此外，作者应用 Chain-of-Note（CoN）阅读方法，指示 LLM 先从每个记忆条目中抽取信息，再基于这些 notes 进行推理。这有效地将长上下文阅读分解为两个更简单的子任务：复制重要细节，以及基于更简洁的 notes 进行推理。

In Figure 6, we evaluate the reading designs under the oracle retrieval setting, where only evidence sessions are provided. Surprisingly, even with perfect retrieval, a suboptimal reading strategy results in up to a 10-point absolute performance drop compared to the best approach for GPT-4o. Notably, when CoN is not applied, JSON format does not consistently outperform the natural language format. However, with CoN, JSON format consistently benefits reader LLMs of various capabilities. Appendix further analyzes error patterns of LLMs with the enhanced reading strategy.

在图6中，作者在 oracle retrieval 设置下评估阅读设计，此时只提供证据会话。令人意外的是，即使检索完美，对于 GPT-4o 而言，次优阅读策略相较最佳方法仍会导致最高 10 个绝对点的性能下降。值得注意的是，当不应用 CoN 时，JSON 格式并不会稳定超过自然语言格式。然而，在使用 CoN 时，JSON 格式会稳定惠及不同能力水平的 reader LLM。附录进一步分析了增强阅读策略下 LLM 的错误模式。

Reading strategy comparison — 图6：Oracle retrieval 设置下的问答表现。CoN 与 JSON 格式组合明显优于其他三种参数组合。NL 表示 Natural Language。

6. Conclusion

In this paper, we introduced LongMemEval, a comprehensive and challenging benchmark designed to evaluate the long-term memory abilities of chat assistants across five core memory tasks: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. Through extensive experiments with both commercial systems and long-context LLMs, we demonstrated the significant challenges posed by LongMemEval, with current systems exhibiting substantial performance drops. By analyzing key design choices across indexing, retrieval, and reading stages, we proposed effective strategies such as session decomposition, fact-augmented key expansion, and time-aware query expansion, which collectively improved both memory recall and the question answering performance. Our findings highlight the need for more sophisticated memory mechanisms to achieve personalized and reliable conversational AI, and LongMemEval offers a valuable benchmark to drive future advancements in long-term memory capabilities for chat assistants.

本文介绍了 LongMemEval，这是一个综合且具有挑战性的基准，用于围绕五个核心记忆任务评估聊天助手的长期记忆能力：信息抽取、多会话推理、时间推理、知识更新和拒答。通过对商业系统和长上下文 LLM 的大量实验，作者证明 LongMemEval 带来了显著挑战，当前系统出现了明显性能下降。通过分析索引、检索和阅读阶段的关键设计选择，作者提出了有效策略，例如 session 分解、事实增强 key 扩展和时间感知 query 扩展，这些策略共同提升了记忆召回和问答表现。作者的发现强调了更复杂记忆机制的必要性，以实现个性化且可靠的对话式 AI；LongMemEval 也提供了一个有价值的基准，用于推动聊天助手长期记忆能力的未来发展。

🤖 Rasa

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory ​

LongMemEval：面向长期交互记忆的聊天助手基准评测 ​

Abstract ​

1. Introduction ​

2. Related Work ​

Long-Term Dialogue Benchmarks ​

Long-Term Memory Methods ​

3. LongMemEval ​

3.1 Problem Formulation ​

3.2 LongMemEval: Benchmark Curation ​

Question Curation ​

Evidence Session Construction ​

History Compilation ​

3.3 Evaluation Metric ​

Question Answering ​

Memory Recall ​

3.4 LongMemEval represents a significant challenge ​

Commercial systems ​

Long-Context LLMs ​

4. A Unified View of Long-Term Memory Assistants ​

4.1 Long-Term Memory System: Formulation ​

4.2 Long-Term Memory System: Design Choices ​

CP 1: Value ​

CP 2: Key ​

CP 3: Query ​

CP 4: Reading Strategy ​

5. Experiment Results ​

5.1 Experimental Setup ​

5.2 Value: Decomposition improves RAG performance ​

5.3 Key: Multi-key indexing improves retrieval and RAG ​

5.4 Query: Time-aware query expansion improves temporal reasoning ​

5.5 Improving reading with chain-of-note and structured format ​

6. Conclusion ​