Skip to content


重新思考如何记忆:超越终身 LLM 智能体记忆中的原子事实

Abstract

To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines.

为了支持可靠的长期交互,LLM 智能体需要一种记忆系统,能够忠实存储、高效检索,并对累积的对话历史进行深入推理。 大多数现有方法采用基于抽取事实的范式:人工编写的静态提示将原始对话压缩成原子事实,随后这些事实被存储、匹配,并注入下游推理。 然而,这种以事实为中心的设计不可避免地丢弃原始对话中的细粒度细节,也无法支持对分散孤立事实的深层推理。 此外,静态提示无法在多样化对话风格中维持一致的抽取粒度。 为了解决这些限制,作者提出 TriMem,它维护三种共存的表示粒度:由来源标识锚定的原始对话片段用于保证存储保真度,抽取出的原子事实用于高效记忆检索,合成画像则聚合分散事实,形成用于深度推理的整体语义理解。 作者进一步采用基于 TextGrad 的提示优化,通过回答质量反馈迭代细化抽取提示和画像提示,在不更新任何参数的情况下实现终身演化。 在 LoCoMo 和 PerLTQA 上跨多个 LLM 骨干进行的大量实验表明,TriMem 持续优于强记忆基线。

1. Introduction

The rapid development of large language models (LLMs) has driven significant breakthroughs in agent technology, which has demonstrated remarkable capabilities in a wide range of scenarios. However, in practical applications, constrained by the capacity of the context window, text overflow inevitably occurs when it comes to long-context and multi-turn interaction scenarios, leading to significant limitations in agents due to the loss of key historical interaction information and broken context logic. To enable reliable long-term interaction, LLM agents require a robust memory system to effectively manage and leverage historical experiences.

大语言模型(LLM)的快速发展推动了智能体技术的重大突破,并在广泛场景中展现出显著能力。 然而在实际应用中,受上下文窗口容量限制,在长上下文和多轮交互场景中不可避免会出现文本溢出,关键历史交互信息丢失和上下文逻辑断裂会严重限制智能体能力。 为了支持可靠的长期交互,LLM 智能体需要一个稳健的记忆系统,来有效管理并利用历史经验。

Recent research has extensively investigated the design of memory modules for LLM agents, with the overarching goal of realizing three core functional capabilities: storing information from historical interactions, retrieving relevant memories upon query, and incorporating them into prompts to support reasoning. As shown on the left side of Figure 1, existing systems typically rely on extracted factual information generated by static hand-written prompts to fulfill these three objectives. They only store extracted key facts produced by fixed prompts, perform retrieval based on similarity matching against these facts, and utilize the relevant information to support reasoning, which naturally raises two critical questions: Can extracted facts beneficially affect all three stages of agent memory systems? and can static fixed prompts adaptively maintain consistent and rational extraction granularity when facing highly heterogeneous real-world dialogue scenarios?

近期研究已经广泛探索了 LLM 智能体记忆模块的设计,其总体目标是实现三种核心功能能力:存储历史交互中的信息,在查询时检索相关记忆,并将其纳入提示以支持推理。 图1 左侧所示,现有系统通常依赖由静态人工提示生成的抽取事实信息来完成这三个目标。 它们只存储固定提示产生的关键事实,基于这些事实进行相似度匹配检索,并利用相关信息支持推理,这自然引出两个关键问题:抽取事实是否能对智能体记忆系统的三个阶段都产生有益影响?面对高度异质的真实对话场景时,静态固定提示能否自适应地维持一致且合理的抽取粒度?

Comparison with previous memory systems
图1:与已有系统的比较。TriMem 建立三层架构:利用原始对话保证存储中的信息保真度,依靠关键事实实现高效检索,并提供对事实的深入理解以保证推理可靠性。构造提示也会基于回答反馈持续优化。

In this paper, we start by highlighting two concerning phenomena: extracted facts suffer from information loss, and reasoning for many real-world questions relies on understanding rather than simple fact matching. As shown in Figure 2, we analyze the performance of existing memory systems from storage fidelity, retrieval efficiency, and reasoning quality. The results demonstrate that although structured factual information ensures efficient retrieval, some precise details are inevitably lost since fact extraction necessarily compresses lengthy dialogues into concise representations. Fortunately, the original dialogue preserves complete information and can serve as an auxiliary source to recover these unstored details. Furthermore, while some questions can be answered using a single retrieved fact, others require integrating scattered information to form a comprehensive understanding. This suggests that a memory system should be capable of further information integration and in-depth comprehension beyond discrete factual pieces. Such a three-level architecture, consisting of raw dialogue storage, extracted fact based retrieval, and understanding-aided reasoning, offers potential for balancing the three core functionalities of a memory system. We also illustrate that the diversity of information styles leads to inconsistent extraction granularity, which significantly degrade system performance. Therefore, we require a mechanism that can continuously optimize prompts, enabling the system to adaptively balance storage depth, retrieval efficiency, and reasoning quality over time.

本文首先强调两个值得关注的现象:抽取事实会遭受信息损失,而许多真实问题的推理依赖理解,而不是简单的事实匹配。 图2 所示,作者从存储保真度、检索效率和推理质量三个角度分析现有记忆系统的表现。 结果表明,尽管结构化事实信息能够保证高效检索,但由于事实抽取必然将长对话压缩成简洁表示,一些精确细节不可避免会丢失。 幸运的是,原始对话保留了完整信息,可以作为辅助来源来恢复这些未被存储的细节。 此外,有些问题可以用单个检索事实回答,而另一些问题则需要整合分散信息以形成综合理解。 这表明,记忆系统应具备超越离散事实片段的进一步信息整合和深入理解能力。 这种由原始对话存储、基于抽取事实的检索和理解辅助推理构成的三层架构,有潜力平衡记忆系统的三种核心功能。 作者还说明,信息风格的多样性会导致抽取粒度不一致,从而显著降低系统性能。 因此,需要一种能够持续优化提示的机制,使系统随时间自适应地平衡存储深度、检索效率和推理质量。

Analysis of existing agent memory systems
图2:现有智能体记忆系统分析。完全基于抽取事实的系统虽然能实现高效检索,却存在有损存储和浅层推理问题。此外,固定提示词无法很好适用于所有上下文,从而损害性能。

Based on our findings in the analysis, we propose Tri-Granularity Memory (TriMem), a memory architecture that maintains three coexisting granularities of representation, including verbatim dialogues, extracted facts, and synthesized profiles. In general, it builds source dialogue identifiers to preserve pointers to original dialogue segments for each compressed memory entry and employs a profile module that incrementally integrates scattered facts during ingestion to form understanding (as shown in the right part of Figure 1). On the one hand, considering the insufficient retention of detailed information, a traceable link is created for each compressed memory entry, which is directly associated with the original dialogue from which the entry is derived. On the other hand, considering the lack of in-depth understanding of factual information, incremental profile is constructed based on the new information of each window and pre-integrated understanding is provided at reasoning time, thus eliminating the need to perform complex comprehensive analysis among scattered facts. Furthermore, unlike previous methods that utilize hand-written prompts, we employ TextGrad-based prompt optimization to iteratively refine system prompts, thereby achieving the precise and high-quality construction of the memory module. To demonstrate the effectiveness of our method, we conduct extensive experiments on commonly used benchmarks and provide discussions on various ablations to justify the rationality. Our main contributions are summarized as follows:

基于上述分析发现,作者提出 Tri-Granularity Memory(TriMem),这是一种维护三种共存表示粒度的记忆架构,包括逐字原始对话、抽取事实和合成画像。 总体而言,它为每个压缩记忆条目构建来源对话标识,用于保留指向原始对话片段的指针,并使用画像模块在摄取过程中增量整合分散事实以形成理解(如 图1 右侧所示)。 一方面,考虑到详细信息保留不足,系统为每个压缩记忆条目创建可追踪链接,直接关联到该条目来源的原始对话。 另一方面,考虑到对事实信息缺乏深入理解,系统基于每个窗口的新信息构建增量画像,并在推理时提供预整合理解,从而无需在分散事实之间执行复杂综合分析。 此外,不同于使用人工提示的既有方法,作者采用基于 TextGrad 的提示优化来迭代细化系统提示,从而实现记忆模块的精确高质量构建。 为了展示方法有效性,作者在常用基准上进行了大量实验,并讨论多种消融以证明设计合理性。 主要贡献总结如下:

  • Conceptually, we revisit the memory systems of LLM agents from a novel three-dimensional perspective, systematically exploring the limitations of existing paradigms spanning storage fidelity, indexing efficiency and reasoning quality (in Section 2).
  • Technically, we propose TriMem, which innovatively introduces entity profile module to avoid shallow reasoning, preserves raw dialogue identifier to avoid detail loss and iteratively optimize system prompts to avoid performance fluctuation (in Section 3).
  • Experimentally, we conduct extensive explorations to verify the effectiveness of TriMem under different scenarios, including the significant improvement across various benchmarks, compatibility with diverse model structure and model size, etc (in Section 4).
  • 从概念上,作者从一种新的三维视角重新审视 LLM 智能体的记忆系统,系统探索现有范式在存储保真度、索引效率和推理质量方面的限制(见第 2 节)。
  • 从技术上,作者提出 TriMem,创新性地引入实体画像模块以避免浅层推理,保留原始对话标识以避免细节丢失,并迭代优化系统提示以避免性能波动(见第 3 节)。
  • 从实验上,作者在不同场景中进行广泛探索,验证 TriMem 的有效性,包括在多个基准上的显著提升、对不同模型结构和模型规模的兼容性等(见第 4 节)。

2. Preliminary and Motivation

In this section, we review the preliminaries of conventional agent memory systems (in Section 2.1), and conduct an in-depth analysis of the inherent limitations of existing agent memory systems (in Section 2.2). The experimental details of our analysis can be found in Appendix.

本节回顾传统智能体记忆系统的预备知识(见第 2.1 节),并深入分析现有智能体记忆系统的内在限制(见第 2.2 节)。 相关分析的实验细节可见附录。

2.1 Conventional Agent Memory Systems

We characterize the mainstream memory architecture into three core phases: (I) Storage. Given a historical dialogue H={(rt,ut)}t=1T consisting of T turns of interaction between the user r and u, we first partition it into discrete windows W={w1,w2,...,wN} following a specific granularity g{turn,session,topic}. Key factual information is then extracted by one LLM agent ϕ and fixed extraction prompt pe from each window to generate memory entries E={ei|ei=ϕ(W,pe)}i=1K. (II) Retrieval. Given question q, relevant entries are retrieved from the memory bank through similarity matching E(q)=top-KeiEsim(ei,kq), kq is either the question itself or query about the question, such as keywords. (III) Reasoning. The retrieved entries are concatenated with the question, which is subsequently fed into the language model to generate a final response a=ϕ(E(q),q).

作者将主流记忆架构刻画为三个核心阶段:(I)存储。 给定历史对话 H={(rt,ut)}t=1T,其中包含用户 ru 之间的 T 轮交互,系统首先按照特定粒度 g{turn,session,topic} 将其划分为离散窗口 W={w1,w2,...,wN} 随后,一个 LLM 智能体 ϕ 和固定抽取提示 pe 从每个窗口中抽取关键事实信息,生成记忆条目 E={ei|ei=ϕ(W,pe)}i=1K (II)检索。 给定问题 q,系统通过相似度匹配 E(q)=top-KeiEsim(ei,kq) 从记忆库中检索相关条目,其中 kq 可以是问题本身,也可以是关于问题的查询,例如关键词。 (III)推理。 检索到的条目与问题拼接后输入语言模型,生成最终回答 a=ϕ(E(q),q)

2.2 Analysis of Existing Systems

Efficient Retrieval. As illustrated in Figure 2, we compare one representative system with full-context based RAG strategy, in terms of the ratio of successfully retrieved question-related evidence. It can be seen that the extracted fact based retrieval enables more efficient and accurate localization of relevant memory entries. This alleviates the critical limitation of full-context based retrieval, which tends to introduce massive irrelevant and distracting content. For instance, when a query contains the keyword career, the full-context based RAG strategy retrieves massive sentences containing lexical variants such as job and career, where redundant noise severely degrades retrieval efficiency. In contrast, extracted facts are summarized across multi-turn dialogues, thus enabling precise semantic matching and substantially improving retrieval performance. However, the heavy reliance of subsequent phases on these extracted fact introduces several limitations we examine next.

高效检索。图2 所示,作者从成功检索到问题相关证据的比例出发,将一种代表性系统与基于 full-context 的 RAG 策略进行比较。 可以看到,基于抽取事实的检索能够更高效、更准确地定位相关记忆条目。 这缓解了基于 full-context 检索的关键限制,即它往往会引入大量无关且分散注意力的内容。 例如,当查询包含关键词 career 时,基于 full-context 的 RAG 策略会检索大量包含 jobcareer 等词汇变体的句子,而冗余噪声会严重降低检索效率。 相比之下,抽取事实是跨多轮对话总结得到的,因此能够进行精确语义匹配,并显著提升检索表现。 然而,后续阶段对这些抽取事实的高度依赖,也引入了作者接下来要考察的若干限制。

Lossy Storage. Although retrieval on extracted facts enables accurate memory localization, the fact extraction process inevitably introduces lossy compression of raw information, where fine-grained details are discarded, resulting in inherent information incompleteness within stored memory entries. As illustrated in Figure 2, we calculate the coverage rate of reference answer token, it can be seen that the extracted fact loss 14.5% more information than original dialogue. For example, the modifier trans in the original dialogue is omitted during fact extraction. This directly leads to a critical issue: even if the system retrieves topically relevant entries, it still fails to provide accurate and complete answers due to missing key contextual details from raw dialogues. This implies that completely abandoning raw dialogues and relying solely on extracted facts for storage will result in permanent loss of semantic details. Such systems become incapable of solving detail-dependent and high-precision constrained queries, thus suffering from inevitable performance degradation.

有损存储。 尽管在抽取事实上进行检索能够准确定位记忆,但事实抽取过程不可避免地会对原始信息进行有损压缩,细粒度细节被丢弃,导致存储记忆条目存在固有的信息不完整。 图2 所示,作者计算参考答案 token 覆盖率,可以看到抽取事实相比原始对话多丢失 14.5% 的信息。 例如,原始对话中的修饰词 trans 在事实抽取过程中被遗漏。 这直接导致一个关键问题:即使系统检索到了主题相关条目,由于缺少来自原始对话的关键上下文细节,它仍然无法提供准确完整的答案。 这意味着,完全放弃原始对话、仅依赖抽取事实进行存储,会造成语义细节的永久损失。 这样的系统无法解决依赖细节和高精度约束的查询,因此不可避免地遭受性能下降。

Shallow Reasoning. Beyond the aforementioned issue of irreversible information loss during extraction, existing extracted fact based memory systems additionally suffer from a fundamental capability bottleneck of shallow reasoning. As illustrated in Figure 2, we test the reasoning performance of those correctly retrieved questions. The results reveal that the reasoning performance on multi-evidence questions is considerably inferior to that on single-evidence ones, as single-evidence questions mostly rely on explicit factual content and can be answered by directly restating content in retrieved memory entries. In contrast, multi-evidence questions demand deep understanding of dispersed facts, such as emotional inference and logical induction. This phenomenon adequately demonstrates that reasoning mechanisms solely relying on extracted fact completely lack the ability of deep comprehension towards entity semantic portraits or behavioral tendency modeling, thereby severely hindering the comprehensive performance of LLM agents in long-term interactions.

浅层推理。 除了抽取过程中不可逆信息损失的问题,现有基于抽取事实的记忆系统还面临浅层推理这一根本能力瓶颈。 图2 所示,作者测试了那些已正确检索问题的推理表现。 结果显示,多证据问题上的推理表现明显低于单证据问题,因为单证据问题大多依赖显式事实内容,可以通过直接复述检索记忆条目中的内容来回答。 相比之下,多证据问题要求对分散事实进行深度理解,例如情感推断和逻辑归纳。 这一现象充分说明,仅依赖抽取事实的推理机制完全缺乏对实体语义画像或行为倾向建模的深层理解能力,从而严重阻碍 LLM 智能体在长期交互中的综合表现。

Suboptimal Prompt. In addition to the information loss and shallow reasoning issues incurred by extracted facts, practical real-world deployments further suffer from performance degradation caused by unstable extraction granularity with fixed prompts. Realistic long-term interactions involve highly diverse information styles, expression patterns and content categories, whereas conventional systems rely on static hand-written extraction prompts. Such fixed prompts fail to adaptively accommodate heterogeneous dialogue content, making it impossible to maintain consistent and rational fact extraction granularity. As illustrated in Figure 2, we compare the reasoning performance across dialogues between different speakers where the performance severely fluctuate. For example, the Pomodoro technique is sometimes explicitly mentioned by name, while in other cases it is implicitly described as 25 minutes on and 5 minutes off. The fixed prompt cannot recognize such high-level semantic concepts, resulting in inconsistent extraction granularity. This discrepancy further degrades the overall performance and stability of the memory system.

次优提示。 除了抽取事实带来的信息损失和浅层推理问题,真实部署还会受到固定提示下抽取粒度不稳定所造成的性能下降影响。 真实长期交互包含高度多样的信息风格、表达模式和内容类别,而传统系统依赖静态人工抽取提示。 这类固定提示无法自适应地适配异质对话内容,因此不可能维持一致且合理的事实抽取粒度。 图2 所示,作者比较了不同说话者对话中的推理表现,发现性能严重波动。 例如,Pomodoro technique 有时会被直接按名称提及,而在其他情况下会被隐式描述为 25 minutes on and 5 minutes off 固定提示无法识别这种高层语义概念,导致抽取粒度不一致。 这种差异进一步削弱记忆系统的整体性能和稳定性。

TriMem pipeline
图3:TriMem 概览。TriMem 将历史对话切分为窗口,抽取带有可追踪索引的多维事实,并构建实体画像。相关记忆根据查询被检索,提示则通过回答反馈持续优化。

3. Method

In this section, we present TriMem, which constructs a three-level architecture, from raw dialogue to extracted key fact and integrated profiles (in Section 3.1). Specially, we construct index between raw dialogue and extracted fact to improve storage fidelity while keeping retrieval efficiency (in Section 3.2), integrate scattered facts to profiles to support understanding based reasoning (in Section 3.3), optimize prompts via failure case analysis to realize lifelong evolution (in Section 3.4).

本节介绍 TriMem,它构建了一种从原始对话到抽取关键事实再到整合画像的三层架构(见第 3.1 节)。 具体来说,作者在原始对话和抽取事实之间构建索引,在保持检索效率的同时提升存储保真度(见第 3.2 节);将分散事实整合为画像以支持基于理解的推理(见第 3.3 节);并通过失败案例分析优化提示以实现终身演化(见第 3.4 节)。

3.1 Overall Pipeline

As shown in Figure 3, our framework consists of the following three core components: (1) Given a set of dialogue, we extract factual information and bind each fact with raw dialogue identifier and profile identifier to implement memory construction. (2) Given a question, the agent first analyzes the required information and keywords to generate search query. It then matches with extracted facts to retrieve relevant memory, and further obtains raw dialogues with full details and integrated profiles via predefined index to complete memory retrieval. (3) According to the reasoning performance, we invoke a powerful senior model to conduct failure analysis and provide revision suggestions for system prompts. In the following subsections, we will elaborate on these components in detail.

图3 所示,框架包含三个核心组件:(1)给定一组对话,系统抽取事实信息,并将每个事实与原始对话标识和画像标识绑定,以实现记忆构建。 (2)给定问题,智能体首先分析所需信息和关键词,生成搜索查询。 随后,它与抽取事实匹配以检索相关记忆,并通过预定义索引进一步获得包含完整细节的原始对话和整合画像,从而完成记忆检索。 (3)根据推理表现,系统调用强大的 senior model 进行失败分析,并为系统提示提供修订建议。 接下来的小节会详细说明这些组件。

3.2 Dense Storage with Efficient Retrieval

Same as conventional systems, we also perform fact extraction on original dialogue. We first partition the raw dialogue H into a sequence of overlapping windows via a sliding window fseg:

与传统系统相同,TriMem 也会在原始对话上执行事实抽取。 作者首先通过滑动窗口 fseg 将原始对话 H 划分为一系列重叠窗口:

W=fseg(H;l,s)={wi}i=1N,wi={(rt,ut)}t=(i1)s+1min(T,(i1)s+l).

where l denotes the window size, s denotes the stride, and N=(Tl)/s+1 is the total number of windows. The overlapping design ensures that contextually related utterances spanning window boundaries are not artificially severed, thereby preserving local coherence for subsequent extraction. For each window wi, we employ an agent ϕ driven by extraction prompt pe, the extraction prompt is composed of a task instruction and a multi-dimensional schema:

其中 l 表示窗口大小,s 表示步长,N=(Tl)/s+1 是窗口总数。 重叠设计确保跨越窗口边界的上下文相关话语不会被人为切断,从而为后续抽取保留局部连贯性。 对于每个窗口 wi,作者使用由抽取提示 pe 驱动的智能体 ϕ;抽取提示由任务指令和多维 schema 组成:

pe=[instructionschema:{frestate,ftime,fperson,}].

where each dimension corresponds to a dedicated extraction function capturing a distinct semantic aspect of the dialogue content: frestate performs lossless restatement to preserve complete and unambiguous factual content, ftime extracts absolute temporal references, fperson identifies all persons mentioned in the window, and additional dimensions such as location, keywords, and named entities can be flexibly incorporated into the schema. This extensible multi-dimensional design ensures that extraction covers diverse semantic granularities while remaining adaptable to domain-specific requirements. Each extracted entry is thus a structured tuple, i.e., ei=(frestate(i),ftime(i),fperson(i),). Critically, unlike previous systems, we add an additional dimension fsrc to the extraction prompt to obtain a source dialogue identifier ei.src{rt,utwi} for each entry ei to address the lossy storage problem. Given question q, the system first extracts a set of search query kq from the question by analysing the required information and relevant keywords. The top-K most relevant entries are then retrieved from E through similarity matching over extracted fact E(q)=top-KeiEsim(kq,ei). Each retrieved entry eiE(q) can serve as an anchor to recover all original details by the source identifier. In this way, our system guarantees storage fidelity while keeping retrieval efficiency.

其中每个维度对应一个专门的抽取函数,用于捕获对话内容的不同语义方面:frestate 执行无损复述,以保留完整且无歧义的事实内容;ftime 抽取绝对时间引用;fperson 识别窗口中提到的所有人物;位置、关键词和命名实体等附加维度也可以灵活纳入 schema。 这种可扩展的多维设计确保抽取覆盖多样语义粒度,同时仍能适配领域特定需求。 因此,每个抽取条目都是一个结构化元组,即 ei=(frestate(i),ftime(i),fperson(i),) 关键在于,不同于以往系统,作者在抽取提示中加入额外维度 fsrc,为每个条目 ei 获得来源对话标识 ei.src{rt,utwi},以解决有损存储问题。 给定问题 q,系统首先通过分析所需信息和相关关键词,从问题中抽取一组搜索查询 kq 随后,系统通过对抽取事实进行相似度匹配,从 E 中检索 top-K 个最相关条目 E(q)=top-KeiEsim(kq,ei) 每个检索到的条目 eiE(q) 都可以作为锚点,通过来源标识恢复所有原始细节。 这样,系统在保持检索效率的同时保证了存储保真度。

3.3 Scattered Fact with Integrated Profile

To address the shallow reasoning problem, we construct structured profiles over scattered fact. We first group extracted fact according to different person pV:

为了解决浅层推理问题,作者在分散事实之上构建结构化画像。 系统首先按照不同人物 pV 对抽取事实进行分组:

Ep={eiEfperson(i)=p},pV.

where V=i=1Kfperson(i) denotes the set of all identified persons. Then, driven by a profile prompt pψ, the person-specific entries are integrated and synthesized into a structured profile Prof(p)=ϕ(Ep,pψ), which captures multi-faceted entity-level understanding, including identity attributes (e.g., gender, age, hometown), inferred personality traits, career and educational background, interests and hobbies, as well as interpersonal dynamics and behavioral tendencies. Ultimately, also anchored by the retrieved entry, we can obtain integrated profiles during reasoning through the person identifier. In this way, we can provide agents with understanding over scattered facts to avoid shallow reasoning.

其中 V=i=1Kfperson(i) 表示所有已识别人物的集合。 然后,在画像提示 pψ 的驱动下,人物特定条目被整合并合成为结构化画像 Prof(p)=ϕ(Ep,pψ),捕获多方面的实体级理解,包括身份属性(例如性别、年龄、家乡)、推断出的人格特征、职业和教育背景、兴趣爱好,以及人际动态和行为倾向。 最终,同样以检索条目为锚点,系统可以在推理时通过人物标识获得整合画像。 这样,系统能够向智能体提供对分散事实的理解,避免浅层推理。

3.4 Lifelong Evolution with Optimized Prompt

In order to maintain the fine-grained consistency, we utilize TextGrad to optimize the prompts. Given the retrieved context R(q), the system generates a response via an answer generation prompt pa, whose output can be compared against reference answers a to obtain a quality signal. We leverage this failure signal to continuously optimize the upstream extraction prompt pe and profile prompt pψ, treating them as joint trainable parameters θ=(pe,pψ)Σ×Σ, where Σ denotes the space of natural language strings. The complete optimization objective is:

为了维持细粒度一致性,作者利用 TextGrad 优化提示。 给定检索上下文 R(q),系统通过回答生成提示 pa 生成回答,其输出可以与参考答案 a 比较,以获得质量信号。 作者利用该失败信号持续优化上游抽取提示 pe 和画像提示 pψ,并将它们视为联合可训练参数 θ=(pe,pψ)Σ×Σ,其中 Σ 表示自然语言字符串空间。 完整优化目标为:

θ=argminθE(q,a)D[ScoreLLM(q,LLM(q,R(q)),a)].

where ScoreLLM evaluates answer quality via an LLM judge, compatible with metrics such as F1 and BLEU. Rather than numerical gradients, TextGrad backpropagates the loss through natural language editing suggestions given by LLM agent TextGradLLM:

其中 ScoreLLM 通过 LLM 裁判评估回答质量,并兼容 F1 和 BLEU 等指标。 TextGrad 不使用数值梯度,而是通过 LLM 智能体 TextGradLLM 给出的自然语言编辑建议反向传播损失:

ge,gψ=Lθ=TextGradLLM(L,θ)Σ.

The prompts are then updated by applying these textual gradients as rewriting instructions: pe(t+1)=pe(t)ge(t), pψ(t+1)=pψ(t)gψ(t), where denotes the prompt editing operator. Through iterative optimization, pe progressively refines extraction granularity to retain semantically valuable details, while pψ improves its capacity to capture behavioral patterns and entity-level understanding, enabling our system to continuously evolve without manual prompt engineering.

随后,系统将这些文本梯度作为重写指令来更新提示:pe(t+1)=pe(t)ge(t)pψ(t+1)=pψ(t)gψ(t),其中 表示提示编辑算子。 通过迭代优化,pe 逐步细化抽取粒度以保留有语义价值的细节,而 pψ 提升捕获行为模式和实体级理解的能力,使系统能够在无需人工提示工程的情况下持续演化。

4. Experiments

In this section, we provide comprehensive verification of TriMem. First, we introduce several critical parts of experimental setups (in Section 4.1). Second, we provide performance comparison and compatibility experiments of Entity Profile with different previous methods (in Section 4.2). Third, we conduct extensive ablation studies to better understand our TriMem (in Section 4.3).

本节对 TriMem 进行全面验证。 首先,作者介绍实验设置中的几个关键部分(见第 4.1 节)。 其次,作者给出实体画像与不同既有方法的性能比较和兼容性实验(见第 4.2 节)。 第三,作者进行大量消融研究,以更好理解 TriMem(见第 4.3 节)。

4.1 Experimental Setups

Baselines and Benchmarks. We compare our method with Naive RAG and several competitive agent memory systems, including Mem0, MemoryOS, A-Mem, LightMem, SimpleMem and xMemory. For a fair comparison, we keep the original hyperparameter setups of the comparative methods. The evaluation is conducted in two commonly used benchmarks, LoCoMo and PerLTQA. More details of each system are provided in Appendix.

基线和基准。 作者将方法与 Naive RAG 以及若干有竞争力的智能体记忆系统进行比较,包括 Mem0、MemoryOS、A-Mem、LightMem、SimpleMem 和 xMemory。 为了公平比较,作者保留对比方法的原始超参数设置。 评测在两个常用基准 LoCoMo 和 PerLTQA 上进行。 每个系统的更多细节见附录。

Implementation Details. We set the size of the window to 40 and stride to 38. The Qwen3-embedding-0.6b model is utilized to encode the extracted fact. During retrieval, the maximum number of relevant entries is set to 25. The number of optimization rounds for prompts is set to 4 to enable in-depth reasoning. We perform prompt training on LoCoMo with Qwen3-8B model, and then directly apply the optimized prompts to other models and benchmark. The failure analysis is conducted by Claude Opus 4.6. The prompts used in our experiments are provided in Appendix.

实现细节。 作者将窗口大小设为 40,步长设为 38。 使用 Qwen3-embedding-0.6b 模型编码抽取事实。 检索时,相关条目的最大数量设为 25。 提示优化轮数设为 4,以支持深入推理。 作者使用 Qwen3-8B 模型在 LoCoMo 上进行提示训练,然后将优化后的提示直接应用到其他模型和基准。 失败分析由 Claude Opus 4.6 执行。 实验中使用的提示见附录。

4.2 Main Results

表1:TriMem 在 LoCoMo 高能力模型上的表现。TriMem 在不同模型上都取得出色表现,同时保持较高效率。
ModelMethodMultiHopTemporalOpenDomainSingleHopAverageToken
BLEUF1BLEUF1BLEUF1BLEUF1BLEUF1Cost
GPT-4.1-miniLoCoMo8.0017.2610.1714.898.2916.2817.4319.3613.6217.8516863
Naïve RAG11.4913.2420.5228.8011.7910.7522.8530.2919.5925.641119
Mem028.8131.4435.4146.2418.5117.9331.2535.3430.8835.811153
MemoryOS16.4624.0234.7846.5214.8919.5836.1843.9230.9539.30936
A-Mem15.1120.6641.5750.9411.1813.2038.2543.7233.0139.101276
LightMem32.9340.3347.5355.2318.3121.9137.6848.3937.6646.69695
SimpleMem32.4039.3343.6958.0119.5624.5043.4153.9939.9750.30587
Ours35.2042.5949.5664.7236.8643.8845.2555.3643.7954.261217
GPT-4oLoCoMo19.6419.209.5013.9511.8716.6013.8116.1213.8616.2616863
Naïve RAG14.3615.3511.4816.179.039.0926.6735.0320.1525.881119
Mem025.5232.3632.4842.7014.5018.5030.0239.8428.7437.741195
MemoryOS22.5231.7638.3147.0812.9118.0638.2643.6733.8140.60944
A-Mem20.9026.1235.3948.6410.7412.3337.1142.0832.1438.671152
LightMem35.3045.1643.6058.5710.5623.2036.7246.6036.2647.37677
SimpleMem31.3435.5835.7846.9618.9617.0137.1143.9434.6441.36627
Ours40.3646.0051.3960.4139.2750.1540.6147.7842.7350.231272
GPT-5-nanoLoCoMo20.4519.0412.6916.5613.8320.8513.5015.2314.6216.5616863
Naïve RAG10.1313.298.7813.099.2512.2420.2928.4415.3421.461119
Mem022.5528.5835.5248.8218.3316.7528.9935.6528.5135.921074
MemoryOS10.7423.5032.5039.7110.0220.3034.2840.3428.0935.88952
A-Mem15.5420.1127.2332.4310.8612.5527.2631.9124.0928.651175
LightMem28.6338.2139.7255.5118.7922.7431.1942.0131.7342.93723
SimpleMem25.4233.2832.1545.7520.7724.3139.6546.7134.3042.65655
Ours34.8645.2542.4557.0533.5540.5254.2662.8846.9657.041256
表2:TriMem 在 LoCoMo 高效模型上的表现。TriMem 在代表性高效模型上仍然取得更优的性能-效率平衡。
ModelMethodMultiHopTemporalOpenDomainSingleHopAverageToken
BLEUF1BLEUF1BLEUF1BLEUF1BLEUF1Cost
Qwen3-8BLoCoMo12.6720.5412.3218.5510.5914.3919.7623.7816.3421.5116863
Naïve RAG13.7515.838.8613.539.6810.7721.5728.7616.7522.101119
Mem028.3230.0723.1526.1511.7915.1530.7534.9727.5430.101140
MemoryOS14.3822.7218.6722.7911.0613.5225.6533.5221.2228.06911
A-Mem16.0221.0828.1037.5114.0114.1933.6040.7728.0134.831180
LightMem22.8432.5437.6248.3718.0519.0223.0331.3725.7334.36740
SimpleMem23.3930.3924.6634.5114.0415.3935.7341.2629.8136.25608
xMemory28.4439.1328.6535.4117.7621.5740.6650.5734.4943.512230
Ours33.0941.2238.7153.1330.5937.6445.1052.5240.6649.651339
Llama-3.1-8B-InsLoCoMo13.7323.3613.1520.3011.5419.4218.6425.8616.1523.8416863
Naïve RAG14.3116.187.6312.289.0911.5822.9931.0217.3323.181119
Mem013.2716.408.2612.627.458.4521.7531.2816.4923.241085
MemoryOS13.5722.6319.1823.3110.5913.0123.4631.0519.9526.77964
A-Mem15.8022.8423.7936.1911.1912.5131.1937.8625.5833.181340
LightMem13.1919.6416.9328.0617.3920.6827.6241.0622.1133.16758
SimpleMem18.8126.2221.1530.4415.8118.7726.7931.2323.4729.37674
xMemory21.8931.2421.7826.8412.3716.6227.7541.3624.4734.942375
Ours25.4234.5625.9832.3628.4032.7135.7643.2031.3738.701388
表3:PerLTQA 基准上的性能比较。所有数值均报告 LLM 判定的正确性分数(%),其中 gpt-4.1-mini 判断预测答案是否匹配真实答案。TriMem 在不同子任务中均取得最佳表现。
MethodQwen3-8BLlama-3.1-8B-Ins
ProfileSocial Rel.EventsDialoguesProfileSocial Rel.EventsDialogues
Full-Context65.8056.7252.7518.5152.4654.5847.5417.27
Mem089.5676.4666.4827.5973.0472.2957.1426.31
LightMem64.9378.0073.0347.0153.8574.0869.2144.72
SimpleMem88.1282.4079.8742.0984.6476.4670.3937.90
TriMem92.4683.2385.7255.7992.1782.2878.1745.01

Performance on High-Capability Models. In Table 1, we compare representative agent memory systems to validate the effectiveness of TriMem. Results demonstrate that our method consistently outperforms prior approaches when integrated with various high-capability models, including GPT-4o, GPT-4.1-mini and GPT-5-nano. Meanwhile, to illustrate the information density of TriMem, we report the average tokens consumption of retrieved contexts. It can be seen that our system only consumes around 1.2k tokens. Although retrieving raw dialogues and entity profiles slightly increases the token overhead, it brings substantial performance gains.

高能力模型上的表现。表1 中,作者比较代表性智能体记忆系统,以验证 TriMem 的有效性。 结果表明,当与 GPT-4o、GPT-4.1-mini 和 GPT-5-nano 等多种高能力模型集成时,TriMem 持续优于既有方法。 同时,为了说明 TriMem 的信息密度,作者报告了检索上下文的平均 token 消耗。 可以看到,系统仅消耗约 1.2k token。 尽管检索原始对话和实体画像会略微增加 token 开销,但它带来了显著性能收益。

Compatibility with Efficient Models. To evaluate the capability of TriMem to support small-parameter efficient models, we conduct experiments on lightweight models including Qwen3-8B and Llama-3.1-8B-Instruct. As shown in Table 2, our system still achieves substantial performance improvements. Unlike xMemory, which only supports limited open-source models due to its requirement for model output logits, TriMem is compatible with models of various parameter sizes. This verifies the extensive superiority and broad applicability of our method.

与高效模型的兼容性。 为了评估 TriMem 支持小参数高效模型的能力,作者在 Qwen3-8B 和 Llama-3.1-8B-Instruct 等轻量模型上进行实验。 表2 所示,系统仍然取得显著性能提升。 与 xMemory 不同,xMemory 由于需要模型输出 logits,只支持有限的开源模型;TriMem 则兼容多种参数规模的模型。 这验证了方法的广泛优势和普遍适用性。

Generalization on Different Datasets. PerLTQA is also a widely adopted benchmark for long-term agent QA, consisting of multi-dimensional evaluations covering personal profiles, social relationship, historical events and dialogue memories. To comprehensively validate the effectiveness of our proposed method, we additionally report experimental results on this dataset (see Table 3), which demonstrates the strong generalization ability of TriMem.

在不同数据集上的泛化。 PerLTQA 也是一个被广泛采用的长期智能体问答基准,包含覆盖个人画像、社会关系、历史事件和对话记忆的多维评测。 为了全面验证所提方法的有效性,作者还在该数据集上报告实验结果(见 表3),展示了 TriMem 的强泛化能力。

Ablation of profile and raw dialogue module
图4:画像和原始对话模块消融。结合实体画像和原始对话取得最佳结果,说明该设计具有合理性。
Impact of varying evolution step
图5:不同演化步数的影响。适当的更新步数会带来整体提升,而进一步更新会导致过度细化,从而对性能产生负面影响。
Performance with different retrieval numbers
图6:不同检索数量下的性能。结果表明,当检索数量为 25 时性能最优;过低或过高都会导致性能下降。

4.3 Ablation and Further Analysis

Ablation of Profile and Raw Dialogue. To verify the necessity of entity profiles and raw dialogues for reasoning reliability and storage fidelity, we conduct ablation experiments on LoCoMo by removing these two contextual components. The results are presented in Figure 4, it can be seen that the agent suffers a noticeable performance drop when either component is excluded. This demonstrates that incorporating entity profiles and raw dialogues is a reasonable design which effectively boosts the overall capability of agent memory systems. More detailed results can be found in Appendix.

画像和原始对话消融。 为了验证实体画像和原始对话对于推理可靠性与存储保真度的必要性,作者在 LoCoMo 上移除这两个上下文组件进行消融实验。 结果见 图4,可以看到任一组件被排除时,智能体都会出现明显性能下降。 这说明引入实体画像和原始对话是一种合理设计,能够有效提升智能体记忆系统的整体能力。 更详细结果见附录。

Impact of Varying Evolution Step. We perform multi-step iterative evolution of prompt. To evaluate the impact of different evolution steps on overall system performance, we conduct an ablation study by increasing the number of update steps from 1 to 5. As shown in Figure 5, the performance gradually improves with the increase of evolution steps, and reaches the optimal level when the number of steps is set to 4. Further updates lead to excessive refinement of the prompt's granularity, which may exceed the model's capability boundary and result in performance degradation. Therefore, we set the number of evolution steps to 4. Visualization of update process can be found in Appendix.

不同演化步数的影响。 作者对提示进行多步迭代演化。 为了评估不同演化步数对整体系统性能的影响,作者将更新步数从 1 增加到 5 进行消融研究。 图5 所示,性能随着演化步数增加逐渐提升,并在步数设为 4 时达到最优水平。 进一步更新会导致提示粒度过度细化,可能超过模型能力边界并造成性能下降。 因此,作者将演化步数设为 4。 更新过程可视化见附录。

Performance in Different Retrieval Numbers. In Figure 6, we present the performance of our method when retrieving different numbers of memory entries. It can be observed that when the number of retrieved memory entries is too low, key information may fail to be effectively retrieved, resulting in suboptimal performance. When the number of retrieved entries is excessively high, irrelevant redundant information is likely to be introduced, interfering with model reasoning and leading to performance degradation. Overall, setting the number of retrieved memory entries to 25 is a reasonable choice. Visualization of the retrieved memory entries can be found in Appendix.

不同检索数量下的性能。图6 中,作者展示了检索不同数量记忆条目时的方法表现。 可以观察到,当检索记忆条目数量过低时,关键信息可能无法被有效检索,导致次优表现。 当检索条目数量过高时,可能引入无关冗余信息,干扰模型推理并导致性能下降。 总体而言,将检索记忆条目数量设为 25 是一个合理选择。 检索记忆条目的可视化见附录。

Necessity of search query and efficiency analysis
图7:搜索查询的必要性与效率分析。引入搜索查询虽然增加检索时间,但显著提升系统性能。较小窗口会增加记忆构建时间,因此作者将窗口大小设为 40,以保持效率,并使构建时间相较既有方法具有竞争力。
Ablation of window size
图8:窗口大小消融。记忆系统在较小窗口下也能取得良好性能。但考虑效率,作者最终将窗口大小设为 40。

Necessity of Search Query. In the retrieval phase, instead of directly relying on the original question for retrieval, we prompt the agent to first analyze the required information and keywords of the question, and then perform retrieval based on the analysis results. To verify the necessity of this module, we compare the system performance with and without search query in Figure 7. The experimental results show that although the generation process of search query increase the retrieval time, the performance is significantly improved after adding the search query, indicating that the search query helps the model achieve more accurate retrieval, thus confirming the necessity of this module. We show some obtained search queries of different questions in Appendix.

搜索查询的必要性。 在检索阶段,系统不是直接依赖原始问题进行检索,而是提示智能体先分析问题所需信息和关键词,再基于分析结果执行检索。 为了验证该模块的必要性,作者在 图7 中比较有无搜索查询时的系统表现。 实验结果表明,虽然搜索查询的生成过程增加了检索时间,但加入搜索查询后性能显著提升,说明搜索查询帮助模型实现更准确的检索,从而确认该模块的必要性。 不同问题获得的若干搜索查询见附录。

Comparison with Diverse Window Size. To compare the system performance under different window sizes, we present the performance comparison with various window settings in Figure 8. It can be observed that, although the model can also achieve good performance with smaller windows, the number of communications with the agent significantly increases. As shown on the right side of Figure 7, although the inference time is not affected, the memory construction time highly extends. Therefore, to balance system performance and efficiency, we finally set the window size to 40. Examples of information extraction under different window sizes can be found in Appendix.

不同窗口大小比较。 为了比较不同窗口大小下的系统性能,作者在 图8 中展示了多种窗口设置的性能比较。 可以观察到,尽管模型在较小窗口下也能取得良好性能,但与智能体的通信次数显著增加。 图7 右侧所示,虽然推理时间不受影响,但记忆构建时间大幅延长。 因此,为了平衡系统性能和效率,作者最终将窗口大小设为 40。 不同窗口大小下的信息抽取示例见附录。

Memory Systems for LLM Agents. Long-term memory has emerged as a key capability for LLM agents engaged in multi-session interactions, with most existing systems organized around three functional stages: memory construction, memory retrieval, and memory-supported reasoning. For memory construction, Mem0, A-Mem, and MemoryOS prompt LLMs to extract atomic facts from dialogues and consolidate them via dynamic updates or hierarchical stores. For memory retrieval, methods from Naive RAG to xMemory match queries with fact embeddings through similarity search. For memory-supported reasoning, retrieved facts are concatenated into the prompt as contextual supplements. Despite their differences, these systems share a common design choice: extracted facts serve as the atomic unit across all three stages, i.e., what is stored, what is matched, and what is injected into the prompt. This fact-centric pipeline risks losing fidelity to the original dialogue and limits the agent's ability to handle queries that require holistic understanding rather than discrete fact lookup. In contrast, our work revisits the role of each stage, establishing index between extracted facts and source dialogues to ensure storage fidelity and constructing entity profiles from these facts to support comprehension-oriented reasoning that goes beyond fact recall. This perspective also clarifies why the three stages should not be designed in isolation: the granularity of what is stored constrains what can be retrieved, and the granularity of what is retrieved in turn dictates the kind of reasoning the agent can perform downstream. Treating the three stages as a single pipeline with consistent yet multi-level representations, rather than three independently optimized modules over the same atomic unit, is therefore central to our design.

LLM 智能体的记忆系统。 长期记忆已经成为参与多会话交互的 LLM 智能体的一项关键能力,大多数现有系统围绕三个功能阶段组织:记忆构建、记忆检索和记忆支持推理。 在记忆构建方面,Mem0、A-Mem 和 MemoryOS 会提示 LLM 从对话中抽取原子事实,并通过动态更新或层次化存储进行整合。 在记忆检索方面,从 Naive RAG 到 xMemory 的方法通过相似度搜索将查询与事实嵌入匹配。 在记忆支持推理方面,检索事实会作为上下文补充拼接进提示。 尽管这些系统存在差异,它们共享一个共同设计选择:抽取事实作为三个阶段的原子单位,即被存储的内容、被匹配的内容以及被注入提示的内容。 这种以事实为中心的流程存在丧失原始对话保真度的风险,并限制智能体处理那些需要整体理解而非离散事实查找的问题。 相比之下,TriMem 重新审视每个阶段的作用,在抽取事实和来源对话之间建立索引以保证存储保真度,并基于这些事实构建实体画像,以支持超越事实回忆的理解导向推理。 这一视角也说明为什么三个阶段不应孤立设计:被存储内容的粒度会限制可检索内容,而被检索内容的粒度又会决定智能体下游能够执行何种推理。 因此,将三个阶段视为一个具有一致但多层表示的单一流程,而不是围绕同一原子单位独立优化的三个模块,是 TriMem 设计的核心。

Lifelong Evolution Agents. Lifelong evolution has emerged as a key capability for LLM agents that need to improve from accumulated experience over long horizons. One prominent line formulates memory management as a reinforcement learning problem: MemAgent reshapes long-context LLMs into multi-conversation memory agents, MemBuilder reinforces memory construction with attributed dense rewards, AgentFold learns proactive context management for long-horizon web agents, MemGen weaves generative latent memories into self-evolving agents, and MEM1 jointly synergizes memory and reasoning for efficient execution. While effective, these methods all require RL-based parameter updates that disrupt the original pretrained weights, incur substantial training cost, and are impractical when only API-accessible models are available. Motivated by these limitations, a complementary line pursues lifelong evolution without altering model parameters: Voyager maintains a skill library of executable code, ExpeL distills cross-task experience into natural-language rules, and MemSkill learns task-agnostic skill memories that transfer to unseen tasks. In line with this paradigm, our work applies TextGrad to evolve the fact extraction and entity profile construction prompts, achieving lifelong adaptability without tuning parameters. Compared with skill or rule based externalizations that target the agent's action space, prompt-level evolution directly reshapes how raw experience is parsed into memory, which is particularly suited to memory systems whose behavior is largely determined by the extraction and synthesis prompts rather than by a fixed policy network.

终身演化智能体。 终身演化已经成为 LLM 智能体的一项关键能力,这类智能体需要从长期积累经验中持续改进。 一条重要路线将记忆管理表述为强化学习问题:MemAgent 将长上下文 LLM 重塑为多对话记忆智能体,MemBuilder 用归因密集奖励强化记忆构建,AgentFold 学习长程网页智能体的主动上下文管理,MemGen 将生成式潜在记忆编织进自演化智能体,而 MEM1 联合协同记忆和推理以实现高效执行。 这些方法虽然有效,但都需要基于 RL 的参数更新,这会扰动原始预训练权重,带来大量训练成本,并且在只能访问 API 的模型场景中并不现实。 受这些限制启发,另一条互补路线追求在不改变模型参数的情况下实现终身演化:Voyager 维护可执行代码技能库,ExpeL 将跨任务经验蒸馏为自然语言规则,MemSkill 学习可迁移到未见任务的任务无关技能记忆。 与这一范式一致,TriMem 应用 TextGrad 来演化事实抽取和实体画像构建提示,在不调参的情况下实现终身适应。 与面向智能体动作空间的技能或规则外化相比,提示级演化直接重塑原始经验被解析为记忆的方式,因此特别适合那些行为主要由抽取和合成提示,而非固定策略网络决定的记忆系统。

6. Conclusion

In this work, we revisit the design of memory systems for LLM agents and identify three concrete limitations of the prevailing extracted fact based paradigm: lossy storage, shallow reasoning, and suboptimal prompts under heterogeneous dialogue styles. Motivated by these findings, we propose TriMem, which maintains three coexisting representation granularities, including verbatim dialogues for storage fidelity, atomic facts for retrieval efficiency and progressively synthesized profiles for deep reasoning. TextGrad-based prompt optimization is further employed to evolve the extraction and profile construction prompts from accumulated experience, enabling lifelong adaptation without modifying the underlying model parameters and thus remaining applicable to API-only LLMs. Extensive experiments on LoCoMo and PerLTQA across various LLMs confirm the effectiveness of our method. We hope TriMem can bring new insights for future researches on agent memory system.

本文重新审视 LLM 智能体记忆系统的设计,并识别出当前基于抽取事实范式的三个具体限制:有损存储、浅层推理,以及异质对话风格下的次优提示。 受这些发现启发,作者提出 TriMem,它维护三种共存的表示粒度,包括用于存储保真度的逐字对话、用于检索效率的原子事实,以及用于深度推理的渐进合成画像。 系统进一步采用基于 TextGrad 的提示优化,从累积经验中演化抽取和画像构建提示,在不修改底层模型参数的情况下实现终身适应,因此也适用于仅 API 可访问的 LLM。 在 LoCoMo 和 PerLTQA 上跨多种 LLM 进行的大量实验确认了方法有效性。 作者希望 TriMem 能为未来智能体记忆系统研究带来新的启发。