Skip to content

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

MemoryAgentUser-CentricTsinghua

Yan S, Ni J, Zheng L, et al. AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents[J]. arXiv preprint arXiv:2603.16496v2, 2026.


AdaMem:面向长程对话智能体的自适应用户中心记忆

Abstract

Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.

大语言模型(LLM)智能体越来越依赖外部记忆来支持长程交互、个性化辅助和多步推理。 然而,现有记忆系统仍面临三项核心挑战:它们往往过度依赖语义相似度,可能错过对用户中心理解至关重要的证据;它们经常把相关经历存成孤立片段,削弱时间和因果连贯性;并且它们通常使用静态记忆粒度,难以适应不同问题的需求。 我们提出 AdaMem,这是一个面向长程对话智能体的自适应用户中心记忆框架。 AdaMem 将对话历史组织为工作记忆、情景记忆、人格记忆和图记忆,使系统能够在统一框架内保留近期上下文、结构化长期经历、稳定用户特征以及关系感知连接。 在推理时,AdaMem 首先解析目标参与者,然后构建问题条件化检索路径,该路径只在需要时结合语义检索和关系感知图扩展,最后通过角色专门化流水线进行证据综合和回答生成。 我们在 LoCoMo 和 PERSONAMEM 基准上评估 AdaMem,以考察长程推理和用户建模能力。 实验结果表明,AdaMem 在两个基准上都达到最先进性能。 代码将在接收后发布。

1. Introduction

Recent advances in large language model (LLM)-based agents have enabled increasingly capable systems for open-ended dialogue, multi-step reasoning, and interactive assistance. Yet these settings are inherently long-horizon: an agent must continually accumulate information across many turns, preserve salient details as user goals evolve, and recover the right evidence when it becomes relevant again. This makes memory a central requirement rather than a peripheral add-on. A useful memory system should not only store past interactions, but also organize them in a form that remains queryable, coherent, and robust as the conversation grows longer and more diverse. Otherwise, memory can become redundant, fragmented, or misaligned with the needs of downstream reasoning, leading to inconsistent behavior and poorly grounded responses.

近期基于大语言模型(LLM)的智能体取得进展,使面向开放式对话、多步推理和交互式辅助的系统能力越来越强。 然而,这些设置本质上都是长程的:智能体必须在许多轮次中持续积累信息,随着用户目标演化保留重要细节,并在证据再次变得相关时恢复正确证据。 这使记忆成为核心需求,而不是外围附加组件。 有用的记忆系统不仅应存储过去交互,还应把它们组织成随着对话变得更长、更复杂仍然可查询、连贯且稳健的形式。 否则,记忆可能变得冗余、碎片化,或与下游推理需求错位,从而导致不一致行为和缺乏根据的回答。

AdaMem teaser
图1:AdaMem 与先前方法的比较。传统方法依赖固定长度块或粗粒度摘要配合语义检索,而 AdaMem 强调用户中心的自适应结构化记忆和多智能体协作检索。

Many recent agent frameworks therefore augment LLMs with explicit external memory modules that support incremental writing, updating, and retrieval throughout interaction. Despite this progress, current approaches still face three limitations.

因此,许多近期智能体框架为 LLM 增加显式外部记忆模块,以便在整个交互过程中支持增量写入、更新和检索。 尽管有这些进展,当前方法仍面临三项限制。

Limitation 1: Memory systems that rely primarily on semantic retrieval may overlook evidence that is not lexically or semantically similar to the query, but is still crucial for user-centric understanding, such as stable preferences, personal attributes, or broader behavioral patterns.

限制 1: 主要依赖语义检索的记忆系统可能忽略那些在词汇或语义上与查询不相似、但对用户中心理解仍然至关重要的证据,例如稳定偏好、个人属性或更广泛的行为模式。

Limitation 2: When related experiences are stored as isolated fragments, their temporal and causal coherence can be weakened, making it difficult to reconstruct how events unfolded and how different pieces of evidence should be connected during reasoning.

限制 2: 当相关经历被存储为孤立片段时,它们的时间和因果连贯性可能被削弱,使推理时难以重建事件如何展开,以及不同证据片段应如何连接。

Limitation 3: Different questions require different memory structures and retrieval strategies. As illustrated in Figure 1, many systems construct memory entries using either fixed-length text chunks. Such static segmentation is often a poor fit for long-horizon reasoning: overly coarse memories may introduce substantial irrelevant context, while overly fine-grained fragments can obscure dependencies across events and topics.

限制 3: 不同问题需要不同的记忆结构和检索策略。 图1所示,许多系统使用固定长度文本块构建记忆条目。 这种静态切分通常并不适合长程推理:过粗的记忆可能引入大量无关上下文,而过细的片段可能掩盖事件和主题之间的依赖关系。

These limitations suggest that effective long-horizon memory should be both structured and adaptive: it should preserve information at multiple levels of abstraction while dynamically selecting retrieval routes that match each question. Motivated by this observation, we propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem maintains participant-specific working, episodic, persona, and graph memories, organizing evidence around the user and the assistant in a unified yet target-aware manner. Unlike prior systems that rely on a monolithic controller for memory writing, retrieval, verification, and response generation, AdaMem uses a role-specialized agent pipeline. The Memory Agent maintains structured memories, the Research Agent performs question-conditioned retrieval and evidence integration with reflection, and the Working Agent turns the collected evidence into a concise answer. This decomposition reduces interference between memory maintenance and answer-time reasoning, enabling finer control over retrieval, verification, and memory evolution. Together, these design choices recover user information beyond pure semantic similarity, preserve links across related events, and adapt retrieval granularity to the demands of the question.

这些限制表明,有效的长程记忆应同时是结构化自适应的:它应在多个抽象层级保存信息,同时动态选择与每个问题匹配的检索路径。 受这一观察启发,我们提出 AdaMem,这是一个面向长程对话智能体的自适应用户中心记忆框架。 AdaMem 维护特定于参与者的工作记忆、情景记忆、人格记忆和图记忆,以统一但目标感知的方式围绕用户和助手组织证据。 不同于依赖单体控制器进行记忆写入、检索、验证和回答生成的先前系统,AdaMem 使用角色专门化智能体流水线。 Memory Agent 维护结构化记忆,Research Agent 执行问题条件化检索并通过反思整合证据,Working Agent 将收集到的证据转化为简洁回答。 这种分解减少了记忆维护和回答时推理之间的干扰,使检索、验证和记忆演化能够被更细粒度地控制。 这些设计选择共同恢复超越纯语义相似度的用户信息,保留相关事件之间的链接,并让检索粒度适应问题需求。

In summary, our contributions are:

  • We introduce AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents that organizes dialogue history into complementary working, episodic, persona, and graph-based memory structures.
  • We propose a question-conditioned retrieval and response pipeline that resolves target participants, invokes relation-aware graph expansion only when needed, and uses specialized agents for evidence synthesis and answer generation.
  • We validate our approach on the LoCoMo and PERSONAMEM benchmarks, achieving state-of-the-art performance and demonstrating its strong effectiveness.

总之,我们的贡献如下:

  • 我们提出 AdaMem,这是一个面向长程对话智能体的自适应用户中心记忆框架,它把对话历史组织为互补的工作记忆、情景记忆、人格记忆和基于图的记忆结构。
  • 我们提出一个问题条件化检索与回答流水线,它解析目标参与者,只在需要时调用关系感知图扩展,并使用专门化智能体进行证据综合和答案生成。
  • 我们在 LoCoMo 和 PERSONAMEM 基准上验证我们的方法,达到最先进性能并展示其强有效性。

2.1 Agentic Memory

Recent research on memory systems for large language model agents has evolved from simple context extension toward more structured and adaptive management. Early approaches typically process long contexts by partitioning them into smaller chunks. Subsequent work introduces more advanced memory mechanisms. For instance, MemGPT manages long-term memory through paging and segmentation. Later studies explore more modular and system-level designs; Mem0 abstracts memory as an independent layer for long-term management. Meanwhile, several works investigate structured memory representations to improve organization, including graph-based memories such as A-Mem and semantic memory structures built upon events or temporal knowledge graphs, such as Zep. Despite these advances, many existing approaches still rely on static retrieval strategies or largely unstructured storage, which may lead to information fragmentation and limited coordination across different abstraction levels and task categories. Designing a unified and adaptive memory system that robustly supports long-term interactions therefore remains an important challenge.

近期关于大语言模型智能体记忆系统的研究,已经从简单上下文扩展演进到更结构化、更自适应的管理。 早期方法通常通过把长上下文划分为更小块来处理长上下文。 后续工作引入更高级的记忆机制。 例如,MemGPT 通过分页和分段管理长期记忆。 后来的研究探索更模块化和系统层面的设计;Mem0 将记忆抽象为用于长期管理的独立层。 与此同时,一些工作研究结构化记忆表示以改进组织方式,包括 A-Mem 等基于图的记忆,以及 Zep 等建立在事件或时间知识图谱之上的语义记忆结构。 尽管有这些进展,许多现有方法仍依赖静态检索策略或基本非结构化存储,这可能导致信息碎片化,以及不同抽象层级和任务类别之间的协调受限。 因此,设计一个能稳健支持长期交互的统一且自适应的记忆系统,仍然是一项重要挑战。

2.2 Multi-Agent LLMs

Recent studies have shown that multi-agent LLMs can effectively address complex tasks by enabling role specialization, collaborative problem solving, and interactive decision making. At the same time, an increasing number of works on agentic memory aim to enhance the capability of large language model agents to model and retain long-term information. Although these efforts provide useful perspectives on coordination and task decomposition, they typically do not focus on how long-horizon dialogue evidence should be organized and adaptively retrieved in a user-centric manner. A recent study, MIRIX, takes an initial step in this direction by introducing specialized agents for memory organization. However, it does not provide explicit mechanisms to ensure the consistency of long-term memory. Motivated by these observations, our work draws on role specialization from the multi-agent literature, but focuses primarily on user-centric memory construction and question-conditioned retrieval for long-horizon dialogue agents.

近期研究表明,多智能体 LLM 可以通过角色专门化、协作式问题求解和交互式决策来有效处理复杂任务。 与此同时,越来越多关于智能体记忆的工作旨在增强大语言模型智能体建模和保留长期信息的能力。 虽然这些努力为协调和任务分解提供了有用视角,但它们通常并不聚焦于长程对话证据应如何以用户中心方式组织并自适应检索。 近期研究 MIRIX 通过引入用于记忆组织的专门化智能体,在这一方向上迈出了初步一步。 然而,它没有提供显式机制来确保长期记忆的一致性。 受这些观察启发,我们的工作借鉴多智能体文献中的角色专门化,但主要聚焦于长程对话智能体的用户中心记忆构建和问题条件化检索。

3. Approach

We present AdaMem, a memory-augmented dialogue framework that continuously organizes user-centric memories at inference time and answers questions through adaptive retrieval over heterogeneous memory sources. AdaMem consists of four tightly coupled components: memory construction, question-conditioned retrieval planning, evidence fusion, and response generation. The overall pipeline is illustrated in Figure 2.

我们提出 AdaMem,这是一个记忆增强对话框架,它在推理时持续组织用户中心记忆,并通过对异质记忆源的自适应检索回答问题。 AdaMem 由四个紧密耦合的组件组成:记忆构建、问题条件化检索规划、证据融合和回答生成。 整体流水线如图2所示。

AdaMem overview
图2:模型概览。对话历史被组织为工作记忆、情景记忆、人格记忆和图记忆,问答通过目标感知、问题条件化检索和角色专门化证据综合完成。

3.1 Method Overview

Given a dialogue history D={ut}t=1T and a question q, AdaMem returns an answer a grounded in dynamic memories. AdaMem maintains participant-specific memory bundles for the user and the assistant, and each bundle contains four memory structures:

  • Working Memory Mw: a bounded FIFO buffer that preserves recent conversational context and short-term discourse states.
  • Episodic Memory Me: long-term structured records including events, facts, attributes, and topic-centric summaries.
  • Persona Memory Mp: compact user profiles distilled from episodic evidence to capture relatively stable preferences and traits.
  • Graph Memory G: a heterogeneous graph connecting messages, topics, facts, attributes, and event or persona snapshots for relation-aware retrieval.

This participant-specific organization is important because many questions in multi-party conversations implicitly target one speaker, both speakers, or an ambiguous referent. Accordingly, AdaMem follows a single end-to-end pipeline: it first writes each utterance into structured participant memories, then resolves the likely target participant for q, retrieves evidence from one or more memory channels, and finally produces the answer from the fused evidence set R(q).

给定对话历史 D={ut}t=1T 和问题 q,AdaMem 返回一个基于动态记忆的答案 a AdaMem 为用户和助手维护特定于参与者的记忆束,每个记忆束包含四种记忆结构:

  • 工作记忆 Mw:一个有界 FIFO 缓冲区,保存近期对话上下文和短期话语状态。
  • 情景记忆 Me:长期结构化记录,包括事件、事实、属性和以主题为中心的摘要。
  • 人格记忆 Mp:从情景证据中蒸馏出的紧凑用户画像,用于捕捉相对稳定的偏好和特征。
  • 图记忆 G:一个异质图,连接消息、主题、事实、属性以及事件或人格快照,用于关系感知检索。

这种特定于参与者的组织方式很重要,因为多方对话中的许多问题会隐式指向某一说话者、双方说话者或模糊指代对象。 因此,AdaMem 遵循单一端到端流水线:它首先把每个话语写入结构化参与者记忆,然后解析 q 的可能目标参与者,从一个或多个记忆通道检索证据,最后从融合证据集 R(q) 生成答案。

3.2 Memory Construction

Message understanding and normalized write. For each incoming utterance ut from participant p, the Memory Agent first produces a normalized record zt containing a short summary, topic, attitude, reason, factual snippets, attributes, timestamp, and speaker identity. All downstream memory updates operate on zt rather than on raw free-form text, so the same canonical parse is reused across working, episodic, persona, and graph memories. This design reduces prompt drift across memory modules and makes the write path explicit.

消息理解和规范化写入。 对于来自参与者 p 的每个输入话语 ut,Memory Agent 首先生成规范化记录 zt,其中包含简短摘要、主题、态度、原因、事实片段、属性、时间戳和说话者身份。 所有下游记忆更新都作用于 zt,而不是原始自由文本,因此同一个规范化解析会在工作记忆、情景记忆、人格记忆和图记忆之间复用。 这一设计减少了记忆模块之间的提示漂移,并使写入路径显式化。

Working-to-episodic consolidation. For each participant, working memory is a bounded FIFO queue with capacity Cw. When the queue becomes full, AdaMem pops the oldest contiguous segment of r messages and consolidates that segment into Me. The segment is therefore determined by recency order rather than by question-aware salience, preventing future questions from implicitly influencing write-time memory selection. Within the popped segment, three router modules independently process event-, fact-, and attribute-level evidence. Each router predicts one of ADD, UPDATE, or IGNORE, together with a target key if an existing entry should be revised. The resulting updates populate event, fact, and attribute stores, while the original messages remain available as cacheable provenance for later evidence recovery.

从工作记忆到情景记忆的整合。 对每个参与者而言,工作记忆是一个容量为 Cw 的有界 FIFO 队列。 当队列变满时,AdaMem 弹出最旧的连续 r 条消息段,并将该段整合到 Me 中。 因此,这个片段由近期顺序决定,而不是由问题感知显著性决定,从而防止未来问题隐式影响写入时的记忆选择。 在弹出的片段内,三个路由模块独立处理事件、事实和属性级证据。 每个路由器预测 ADDUPDATEIGNORE 之一,并在需要修订现有条目时给出目标键。 由此产生的更新填充事件、事实和属性存储,而原始消息仍作为可缓存来源保留,以便之后恢复证据。

Topic regrouping, persona refresh, and graph synchronization. After consolidation, AdaMem converts fine-grained episodic keys into reusable higher-level memories in two stages. First, event or attribute keys are embedded and linked by a sparse nearest-neighbor graph: each key is connected to its most similar peer, and the connected components define merge groups. Second, an LLM merge prompt rewrites each group into a topic-centric or aspect-centric summary. Topic groups are used to build topic episodic memories and preference-oriented persona descriptors, while clustered attributes are merged into aspect-based persona summaries. In parallel, both message-level and consolidated records are indexed into G, so later retrieval can move between local discourse evidence and longer-term structured abstractions.

主题重组、人格刷新和图同步。 整合之后,AdaMem 通过两个阶段把细粒度情景键转换为可复用的更高层记忆。 首先,事件或属性键被嵌入并通过稀疏最近邻图连接:每个键连接到其最相似的同伴,连通分量定义合并组。 其次,一个 LLM 合并提示将每个组重写为以主题为中心或以方面为中心的摘要。 主题组用于构建主题情景记忆和面向偏好的人格描述符,而聚类后的属性会合并为基于方面的人格摘要。 同时,消息级记录和整合后的记录都会索引到 G 中,因此之后的检索可以在局部话语证据和更长期的结构化抽象之间移动。

3.3 Question-Conditioned Retrieval

Target participant resolution. Before retrieval, AdaMem resolves whether q refers to the user, the assistant, both, or an ambiguous participant. The implementation uses a lightweight four-way resolver based on explicit participant mentions; if the referent is ambiguous, the system avoids a hard commitment and instead runs a small retrieval on both participant bundles before fusion. This makes target uncertainty visible to the retrieval stage rather than hiding it inside later answer generation.

目标参与者解析。 在检索之前,AdaMem 解析 q 是指向用户、助手、双方还是模糊参与者。 实现中使用一个基于显式参与者提及的轻量四分类解析器;如果指代对象模糊,系统会避免硬性承诺,而是在融合前对两个参与者记忆束都运行小规模检索。 这让目标不确定性在检索阶段可见,而不是隐藏在之后的答案生成中。

Route planning. Given q, AdaMem builds a question-conditioned route plan π(q) that specifies whether graph expansion should be used, how far it may propagate, how many graph seeds are activated, which edge-type priors are applied, and how baseline and graph evidence will be fused. The planner first applies deterministic cue detection over temporal, relational, attribute, and single-hop question patterns. These cues initialize a rule-based plan. When the question remains uncertain, an optional LLM refinement step can revise the plan, but only within narrow bounded ranges so that the planner remains conservative and semantic retrieval stays dominant for simple questions. As a result, single-hop factoid questions usually remain on lightweight semantic retrieval, whereas temporal or causal questions trigger broader structural exploration.

路径规划。 给定 q,AdaMem 构建问题条件化路径计划 π(q),它指定是否使用图扩展、可以传播多远、激活多少图种子、应用哪些边类型先验,以及如何融合基线证据和图证据。 规划器首先对时间、关系、属性和 single-hop 问题模式应用确定性线索检测。 这些线索初始化基于规则的计划。 当问题仍然不确定时,可选的 LLM 细化步骤可以修订计划,但只在狭窄有界范围内进行,以便规划器保持保守,并让语义检索在简单问题上保持主导。 因此,single-hop 事实问题通常停留在轻量语义检索上,而时间或因果问题会触发更广泛的结构探索。

Target-aware baseline retrieval. For a selected participant bundle, baseline retrieval aggregates semantic candidates from persona summaries, episodic facts, and topic-linked messages:

目标感知基线检索。 对于选定的参与者记忆束,基线检索会从人格摘要、情景事实和主题链接消息中聚合语义候选:

Rbase(q)=TopK(Rattr(q)Rfact(q)Rtopic(q)).

where attribute candidates come from persona summaries, fact candidates come from episodic fact memory, and topic candidates are linked back to original messages through topic-to-message maps. Beyond pure top-K aggregation, AdaMem applies two recovery mechanisms. First, high-confidence fact hits reactivate their supporting discourse messages. Instead of attaching all neighboring contexts, AdaMem uses a score-drop rule over ranked fact matches to decide how many supporting messages should be reintroduced. Second, a lightweight keyword backoff over a word-to-detail index recalls raw messages that are weakly covered by dense retrieval but lexically salient for the question.

其中,属性候选来自人格摘要,事实候选来自情景事实记忆,主题候选通过主题到消息映射链接回原始消息。 除了纯 top-K 聚合之外,AdaMem 还应用两种恢复机制。 首先,高置信度事实命中会重新激活其支持话语消息。 AdaMem 不会附加所有邻近上下文,而是对排序后的事实匹配使用分数下降规则,以决定应重新引入多少支持消息。 其次,在词到细节索引上的轻量关键词回退会召回那些被密集检索弱覆盖、但对问题在词汇上显著的原始消息。

Graph retrieval and evidence fusion. When π(q) requests relation-aware evidence, AdaMem selects top semantic seed nodes in G and performs bounded multi-hop expansion. If a seed node u reaches a neighbor v through edge type e, the propagated score is updated by a fixed multiplicative rule

图检索和证据融合。π(q) 请求关系感知证据时,AdaMem 会在 G 中选择顶层语义种子节点并执行有界多跳扩展。 如果种子节点 u 通过边类型 e 到达邻居 v,传播分数会通过固定乘法规则更新:

sd+1(v)=sd(u)weλ,

where we is the edge-type prior and λ is a hop-decay factor. Owner-aware filtering is applied only when the target participant is unambiguous. This yields Rgraph(q), which is fused with baseline evidence by

其中 we 是边类型先验,λ 是跳数衰减因子。 只有当目标参与者无歧义时,才应用所有者感知过滤。 这产生 Rgraph(q),它会通过下式与基线证据融合:

score(mq)=αsbase(m,q)+βsgraph(m,q)+γsrecency(m)+δsfact(m).

Here sbase(m,q) and sgraph(m,q) are rank-derived scores from baseline and graph retrieval, respectively; srecency(m) is a linear recency prior over the merged candidate list; and sfact(m) is a small confidence bonus for items also supported by fact retrieval. Notably, for the fusion weights, a global default prior is fixed before evaluation.

这里,sbase(m,q)sgraph(m,q) 分别是来自基线检索和图检索的排序派生分数;srecency(m) 是合并候选列表上的线性近期先验;sfact(m) 是对也由事实检索支持的条目给予的小置信度奖励。 值得注意的是,对于融合权重,全局默认先验会在评估之前固定。

3.4 Multi-Agent Collaboration

Memory Agent. The Memory Agent is responsible for online message understanding and memory updates. For each new utterance, it extracts a normalized representation, writes it into the participant-specific working memory, triggers working-to-episodic consolidation when the short-term buffer saturates, and synchronizes the resulting message and memory artifacts to graph memory. Persona descriptors are then refreshed from aggregated episodic evidence at the indexing stage, so the system maintains both up-to-date local context and compact long-term user models.

Memory Agent。 Memory Agent 负责在线消息理解和记忆更新。 对于每个新话语,它抽取规范化表示,将其写入特定于参与者的工作记忆,在短期缓冲区饱和时触发从工作记忆到情景记忆的整合,并把得到的消息和记忆工件同步到图记忆。 随后,人格描述符会在索引阶段从聚合情景证据中刷新,因此系统同时维护最新局部上下文和紧凑长期用户模型。

Research Agent. At question answering time, the Research Agent performs iterative evidence gathering over the unified retrieval interface described above. It follows a Planning Search Integrate Reflection loop: it first decomposes the information needed to answer q, then issues one or more retrieval requests through the same participant-aware retrieval API, integrates newly recovered evidence into a consolidated research summary, and decides whether additional search is still necessary. Importantly, this agent-level planning is distinct from the route plan π(q): the Research Agent decides what missing information to ask for, while the route planner described above decides how each retrieval call is executed.

Research Agent。 在问答时,Research Agent 会在上述统一检索接口上执行迭代证据收集。 它遵循 Planning Search Integrate Reflection 循环:首先分解回答 q 所需的信息,然后通过同一个参与者感知检索 API 发出一个或多个检索请求,把新恢复的证据整合为统一研究摘要,并判断是否仍需额外搜索。 重要的是,这种智能体级规划不同于路径计划 π(q):Research Agent 决定要询问什么缺失信息,而上述路径规划器决定每次检索调用应如何执行。

Working Agent. The Working Agent converts the research summary into the final concise answer. It conditions generation primarily on the integrated summary returned by the Research Agent and, when needed, supplements it with high-confidence persona attributes or factual snippets as auxiliary grounding. This separation allows evidence collection and answer realization to be optimized for different roles while preserving a single memory interface. AdaMem answers each question through a fixed collaboration order among these roles. As a result, response generation remains tightly coupled with the same user-centric memory interface and retrieval backbone introduced above, while allocating explicit deliberation to multi-step evidence synthesis before final answer generation.

Working Agent。 Working Agent 将研究摘要转换为最终简洁答案。 它主要以 Research Agent 返回的整合摘要为条件生成,并在需要时用高置信度人格属性或事实片段作为辅助依据补充。 这种分离使证据收集和答案实现可以面向不同角色优化,同时保留单一记忆接口。 AdaMem 通过这些角色之间固定的协作顺序回答每个问题。 因此,回答生成与上文介绍的同一个用户中心记忆接口和检索骨干保持紧密耦合,同时在最终答案生成之前为多步证据综合分配显式审议过程。

4. Experiments

4.1 Experimental Setting

Benchmarks. To assess the effectiveness of our approach, we conduct experiments on the LoCoMo benchmark. LoCoMo poses a challenging setting for long-context modeling, consisting of dialogue histories that span an average of 35 sessions and roughly 9,000 tokens. Following the benchmark’s standard evaluation protocol, we report quantitative results across four core capabilities: single-hop reasoning, multi-hop reasoning, temporal reasoning, and open-domain question answering. The original benchmark also includes an adversarial question category designed to test a model’s ability to identify unanswerable queries.

基准。 为评估我们方法的有效性,我们在 LoCoMo 基准上进行实验。 LoCoMo 为长上下文建模提出了具有挑战性的设置,其对话历史平均跨越 35 个会话,约 9,000 个 token。 遵循该基准的标准评估协议,我们报告四项核心能力上的定量结果:single-hop reasoning、multi-hop reasoning、temporal reasoning 和 open-domain question answering。 原始基准还包含一个 adversarial question 类别,用于测试模型识别不可回答查询的能力。

To further examine the generalization capability of our approach, we conduct additional experiments on the PERSONAMEM benchmark. It is designed to assess how well large language models maintain and update user representations and produce personalized responses over extended interactions. The benchmark comprises multi-session dialogue histories in which user attributes and preferences gradually evolve as a result of life events and changing contexts. Following its standard evaluation protocol, we report quantitative results across seven categories, each targeting different aspects of user modeling and memory utilization.

为进一步考察我们方法的泛化能力,我们在 PERSONAMEM 基准上进行额外实验。 该基准旨在评估大语言模型在扩展交互中维护和更新用户表示并生成个性化回答的能力。 该基准包含多会话对话历史,其中用户属性和偏好会随着生活事件和上下文变化逐渐演化。 遵循其标准评估协议,我们报告七个类别上的定量结果,每个类别针对用户建模和记忆利用的不同方面。

Evaluation Metrics. For the LoCoMo benchmark, we follow the standard evaluation protocol and report F1 and BLEU-1 as the primary metrics. For the PERSONAMEM benchmark, which primarily consists of multiple-choice questions, we evaluate model performance using accuracy.

评估指标。 对于 LoCoMo 基准,我们遵循标准评估协议,并报告 F1 和 BLEU-1 作为主要指标。 对于主要由多项选择题组成的 PERSONAMEM 基准,我们使用准确率评估模型性能。

4.2 Implementation Details

We conduct experiments on both closed-source APIs and open-source models. The closed-source models include GPT-4.1-mini and GPT-4o-mini. For open-source models, we evaluate Qwen3-4B-Instruct and Qwen3-30B-A3B-Instruct. Experiments are conducted on a NVIDIA RTX A800 GPU. We ensure that the AdaMem framework uses the same backbone model as the response generator. To ensure reproducibility, the temperature is fixed to 0 in all experiments. For the RAG component in our framework, the retrieval top-k is set to 10, and the maximum number of retrieval iterations Li is limited to 2 for efficiency. All memory embeddings are computed using the all-MiniLM-L6-v2 model. More implementation details and prompt templates are provided in the appendix.

我们在闭源 API 和开源模型上都进行实验。 闭源模型包括 GPT-4.1-mini 和 GPT-4o-mini。 对于开源模型,我们评估 Qwen3-4B-Instruct 和 Qwen3-30B-A3B-Instruct。 实验在 NVIDIA RTX A800 GPU 上进行。 我们确保 AdaMem 框架使用与回答生成器相同的 backbone 模型。 为确保可复现性,所有实验中的温度都固定为 0。 对于我们框架中的 RAG 组件,检索 top-k 设为 10,最大检索迭代次数 Li 限制为 2 以提高效率。 所有记忆嵌入都使用 all-MiniLM-L6-v2 模型计算。 更多实现细节和提示模板见附录。

4.3 Comparison to Competitive Approaches

Baselines. We compare AdaMem with five representative open-source memory frameworks: (1) MemGPT, a framework that addresses long-context limitations through an OS-inspired memory management mechanism; (2) A-Mem, an agentic memory system that enables dynamic organization and evolution of long-term memory; (3) Mem0, a memory-centric architecture that provides scalable long-term memory; (4) LangMem, a framework that explicitly models long-term memory; (5) Zep, a memory layer that represents conversational memory using a temporally aware knowledge graph.

基线。 我们将 AdaMem 与五个代表性开源记忆框架比较:(1) MemGPT,一个通过受操作系统启发的记忆管理机制处理长上下文限制的框架;(2) A-Mem,一个支持长期记忆动态组织和演化的智能体记忆系统;(3) Mem0,一个提供可扩展长期记忆的记忆中心架构;(4) LangMem,一个显式建模长期记忆的框架;(5) Zep,一个使用时间感知知识图谱表示对话记忆的记忆层。

Results on LoCoMo. The quantitative results on the LoCoMo benchmark using closed-source backbones are summarized in Table 1. With GPT-4.1-mini, AdaMem achieves an overall F1 score of 44.65%, corresponding to a +4.4% relative improvement over the previous state-of-the-art method. The improvement is consistent across evaluation metrics, with the largest gain observed in the temporal question category, where AdaMem improves the F1 score by up to +23.4%. Using GPT-4o-mini, AdaMem achieves an overall F1 score of 41.84%, which corresponds to a larger relative improvement of +12.8% over the previous state-of-the-art method. AdaMem does not outperform the strongest baselines in the Open Domain categories, likely because some tasks favor methods tailored to specific structural assumptions. This result suggests a trade-off between optimizing for specialized task structures and achieving consistent performance across diverse long-horizon interactions. Overall, the results demonstrate the effectiveness and robustness of our approach across different closed-source backbones. We further report results using open-source models in Table 4. Case studies are provided in the appendix.

LoCoMo 结果。 使用闭源 backbone 在 LoCoMo 基准上的定量结果汇总于表1 使用 GPT-4.1-mini 时,AdaMem 达到 44.65% 的 overall F1 分数,对应相对于先前最先进方法 +4.4% 的相对提升。 该提升在各项评估指标上保持一致,最大增益出现在 temporal question 类别,AdaMem 在该类别上将 F1 分数最高提升 +23.4%。 使用 GPT-4o-mini 时,AdaMem 达到 41.84% 的 overall F1 分数,对应相对于先前最先进方法更大的 +12.8% 相对提升。 AdaMem 在 Open Domain 类别中没有超过最强基线,这可能是因为某些任务偏好针对特定结构假设调优的方法。 这一结果表明,在为专门任务结构优化和跨多样长程交互实现一致性能之间存在权衡。 总体而言,结果展示了我们方法在不同闭源 backbone 上的有效性和稳健性。 我们进一步在表4报告使用开源模型的结果。 案例研究见附录。

表1:LoCoMo benchmark 上的性能。最佳结果以粗体表示,次优结果以下划线表示。
MethodMulti-hopTemporalOpen DomainSingle-hopOverall
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
GPT-4.1-mini
MemGPT10.3510.2630.0624.0328.3523.4922.9517.1022.5017.64
A-Mem11.1611.0742.3033.7530.6925.3824.8418.5426.3720.70
Mem035.1926.5534.3829.0720.8815.5742.6636.9038.1632.04
LangMem36.4528.5342.5735.9128.8023.1344.8238.2541.7635.10
Zep26.7317.9120.9717.5521.2416.9239.9635.1032.4027.09
AdaMem37.7032.0855.9042.3725.8721.7044.8440.0244.6537.92
GPT-4o-mini
MemGPT10.089.9929.3423.4927.7222.9522.4116.7421.9617.28
A-Mem10.8910.8041.3133.0329.9724.8424.3018.0925.7422.95
Mem030.6022.5039.6033.5724.2117.2839.6930.6037.0830.51
LangMem30.2421.6028.8923.6726.5521.2435.0129.8832.3126.55
Zep24.5717.2839.9634.0220.4314.0435.4630.0633.4827.63
AdaMem35.1826.3251.4937.0625.8221.4642.2136.4341.8433.78

Results on PERSONAMEM. To evaluate the generalization capability of AdaMem across benchmarks, we further evaluate it on the PERSONAMEM benchmark. As shown in Table 2, AdaMem achieves 63.25% accuracy, outperforming all baselines with a relative improvement of 5.9%. Notably, AdaMem demonstrates a substantial advantage on the generalize to new scenarios task, achieving a relative improvement of 27.3%. These consistent gains across benchmarks with different data distributions suggest that AdaMem generalizes effectively to diverse long-term reasoning scenarios.

PERSONAMEM 结果。 为评估 AdaMem 跨基准的泛化能力,我们进一步在 PERSONAMEM 基准上评估它。 表2所示,AdaMem 达到 63.25% 准确率,以 5.9% 的相对提升超过所有基线。 值得注意的是,AdaMem 在 generalize to new scenarios 任务上展现出显著优势,达到 27.3% 的相对提升。 这些在不同数据分布基准上的一致增益表明,AdaMem 能有效泛化到多样长期推理场景。

表2:PERSONAMEM benchmark 上的性能。
MethodRecall user
shared facts
Suggest new
ideas
Track full
preference
evolution
Revisit reasons
behind
preference updates
Provide preference-
aligned
recommendations
Generalize to
new scenarios
Average
A-Mem63.0127.9654.6885.8669.0957.8959.75
Mem032.1315.0554.6880.8152.7357.8948.55
LangMem31.2924.7353.2481.8240.008.7740.64
AdaMem67.8121.5161.1589.9065.4573.6863.25

4.4 Ablation Study

Unless otherwise stated, we perform all ablations of AdaMem on the LoCoMo benchmark using the GPT-4.1-mini backbone.

除非另有说明,我们在 LoCoMo 基准上使用 GPT-4.1-mini backbone 进行 AdaMem 的所有消融。

Components. We perform component ablations to quantify the contribution of AdaMem's main design choices. As shown in Table 3, removing any one of the three components consistently reduces performance, suggesting that the improvement of AdaMem does not come from a single dominant module. Disabling graph memory produces the largest drop, reducing overall F1 from 44.65 to 42.63. This result is consistent with our motivation in Section 3: relation-aware memory is important for recovering cross-turn dependencies and temporally linked evidence that may be missed by semantic retrieval alone. Removing the fusion module also leads to clear degradation (42.77 F1), indicating that jointly combining baseline retrieval, graph evidence, and lightweight temporal/factual signals yields more reliable evidence selection than relying on a single signal source. Replacing the multi-agent response pipeline with a single-agent variant causes a smaller but still consistent decline (43.24 F1), suggesting that role specialization mainly improves evidence organization and final answer synthesis after retrieval. Overall, the three components are complementary: graph memory improves structural evidence coverage, fusion improves evidence aggregation, and multi-agent coordination improves downstream reasoning quality.

组件。 我们进行组件消融,以量化 AdaMem 主要设计选择的贡献。 表3所示,移除三个组件中的任何一个都会持续降低性能,表明 AdaMem 的改进并不来自单个主导模块。 禁用图记忆会产生最大下降,使 overall F1 从 44.65 降至 42.63。 这一结果与我们在第 3 节中的动机一致:关系感知记忆对于恢复跨轮依赖和时间关联证据很重要,而这些证据可能被单独的语义检索错过。 移除融合模块也会导致明显退化(42.77 F1),说明联合组合基线检索、图证据和轻量时间/事实信号,比依赖单一信号源能产生更可靠的证据选择。 用单智能体变体替换多智能体回答流水线会造成较小但仍一致的下降(43.24 F1),说明角色专门化主要改进检索后的证据组织和最终答案综合。 总体而言,三个组件是互补的:图记忆改进结构证据覆盖,融合改进证据聚合,多智能体协调改进下游推理质量。

表3:关键组件消融。
ConfigurationGraphFusionMulti-agentF1BLEU-1
AdaMem (full)44.6537.92
w/o graph×42.6335.85
w/o fusion×42.7736.26
w/o multi-agent×43.2436.34

Model Sizes. We further study whether AdaMem remains effective when the response backbone changes in both scale and model family. Table 4 compares two closed-source GPT-4 backbones with two open-source Qwen3 models while keeping the AdaMem framework unchanged. AdaMem generalizes well across backbone families. Even with the smaller open-source Qwen3-4B-Instruct model, AdaMem still reaches 36.78 F1 on LoCoMo, showing that the memory architecture remains useful under a constrained model budget. Scaling the backbone to Qwen3-30B-A3B-Instruct yields consistent gains on every category, improving overall performance by +6.24 F1 over Qwen3-4B. The largest gain appears on temporal reasoning (+13.17 F1), followed by multi-hop reasoning (+6.91 F1), which suggests that larger models make better use of AdaMem's structured evidence when questions require linking events over time or synthesizing multiple memory fragments. Overall, this experiment shows that AdaMem is both robust under smaller models and able to translate additional model capacity into stronger long-horizon reasoning.

模型规模。 我们进一步研究当回答 backbone 在规模和模型家族上变化时,AdaMem 是否仍然有效。 表4在保持 AdaMem 框架不变的情况下,比较两个闭源 GPT-4 backbone 和两个开源 Qwen3 模型。 AdaMem 能很好地跨 backbone 家族泛化。 即使使用较小的开源 Qwen3-4B-Instruct 模型,AdaMem 在 LoCoMo 上仍达到 36.78 F1,表明该记忆架构在受限模型预算下仍然有用。 将 backbone 扩展到 Qwen3-30B-A3B-Instruct 会在每个类别上带来一致增益,相比 Qwen3-4B 将 overall 性能提升 +6.24 F1。 最大增益出现在 temporal reasoning(+13.17 F1),其次是 multi-hop reasoning(+6.91 F1),这表明当问题需要随时间链接事件或综合多个记忆片段时,更大的模型能更好利用 AdaMem 的结构化证据。 总体而言,该实验表明 AdaMem 在较小模型下稳健,并能把额外模型容量转化为更强的长程推理能力。

表4:模型规模消融。
ModelMulti-hopTemporalOpen DomainSingle-hopOverall
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
GPT-4o-mini35.1826.3251.4937.0625.8221.4642.2136.4341.8433.78
GPT-4.1-mini37.7032.0855.9042.3725.8721.7044.8440.0244.6537.92
Qwen3-4B-Instruct29.5023.3540.4529.7420.0517.6839.7434.9836.7830.68
Qwen3-30B-A3B-Instruct30.8225.1743.5130.9026.5523.3140.6835.9938.5832.16

Hyperparameters. We study the sensitivity of AdaMem to two key hyperparameters that directly control evidence breadth and deliberation depth: the retrieval top-K and the maximum number of Research Agent iterations Li. Results are shown in Figure 3. This analysis helps determine whether AdaMem's gains come from a stable operating regime or from overly aggressive evidence accumulation.

超参数。 我们研究 AdaMem 对两个关键超参数的敏感性,这两个超参数直接控制证据广度和审议深度:检索 top-K 和 Research Agent 的最大迭代次数 Li 结果如图3所示。 该分析有助于判断 AdaMem 的增益来自稳定工作区间,还是来自过于激进的证据积累。

Fixing Li=2, we vary K{5,10,15}. Performance improves substantially from K=5 to K=10, indicating that very small candidate pools often fail to cover the dispersed evidence needed for multi-session reasoning. Increasing K further to 15 yields only marginal gains over K=10, suggesting diminishing returns once the main supporting evidence has already been retrieved. We therefore use K=10 as the default setting, as it achieves nearly the best performance while avoiding the additional latency and noise introduced by a larger evidence set.

固定 Li=2,我们改变 K{5,10,15} 性能从 K=5K=10 大幅提升,表明很小的候选池往往无法覆盖多会话推理所需的分散证据。 进一步将 K 增加到 15 仅比 K=10 带来边际增益,说明一旦主要支持证据已经被检索到,收益就会递减。 因此,我们使用 K=10 作为默认设置,因为它几乎达到最佳性能,同时避免更大证据集带来的额外延迟和噪声。

Fixing K=10, we vary Li{1,2,3}. Both F1 and BLEU-1 peak at Li=2. Using only 1 iteration is often insufficient for questions that require decomposition or follow-up retrieval, whereas extending the loop to 3 iterations slightly hurts performance, likely because later rounds accumulate redundant or weakly related evidence. This trend is consistent with the design of the Research Agent: iterative retrieval is beneficial, but only under a tight budget that preserves evidence precision. Thus, we use Li=2 as the default setting. Overall, AdaMem is moderately sensitive rather than brittle, and the default configuration used in our main experiments provides a favorable performance--efficiency trade-off.

固定 K=10,我们改变 Li{1,2,3} F1 和 BLEU-1 都在 Li=2 时达到峰值。 仅使用 1 次迭代通常不足以处理需要分解或后续检索的问题,而将循环扩展到 3 次迭代会略微损害性能,这可能是因为后续轮次积累了冗余或弱相关证据。 这一趋势与 Research Agent 的设计一致:迭代检索是有益的,但只有在保持证据精度的紧预算下才有益。 因此,我们使用 Li=2 作为默认设置。 总体而言,AdaMem 是中等敏感而不是脆弱的,并且我们主实验中使用的默认配置提供了有利的性能--效率权衡。

AdaMem hyperparameter ablation
图3:关键超参数消融。

4.5 Efficiency Analysis

We analyze the performance-efficiency trade-off of AdaMem. Table 5 reports the average input token budget, end-to-end inference latency, and overall F1 and BLEU-1. AdaMem is not the cheapest system in raw computation: Mem0 uses fewer input tokens and lower latency. However, AdaMem delivers the strongest answer quality, reaching 44.65 F1, which corresponds to absolute gains of +7.57 F1 over Mem0. Compared with A-Mem and Zep, AdaMem also achieves substantially better accuracy while operating with a token budget of the same order.

我们分析 AdaMem 的性能--效率权衡。 表5报告平均输入 token 预算、端到端推理延迟,以及 overall F1 和 BLEU-1。 AdaMem 在原始计算上并不是最便宜的系统:Mem0 使用更少输入 token,并且延迟更低。 然而,AdaMem 提供最强回答质量,达到 44.65 F1,相比 Mem0 对应 +7.57 F1 的绝对增益。 与 A-Mem 和 Zep 相比,AdaMem 也在相同量级 token 预算下达到显著更好的准确率。

These results suggest that AdaMem's advantage does not come from aggressively minimizing retrieval cost, but from allocating a moderate amount of additional computation to recover higher-quality evidence. The extra latency is consistent with our design: question-conditioned route planning, graph-based expansion, and the role-specialized response loop introduce overhead, but they also improve evidence coverage and synthesis for long-horizon questions. Overall, AdaMem occupies a favorable operating point among memory-based methods, converting a moderate increase in token usage and latency into clearly stronger reasoning performance.

这些结果表明,AdaMem 的优势并不来自激进地最小化检索成本,而是来自分配适量额外计算来恢复更高质量的证据。 额外延迟与我们的设计一致:问题条件化路径规划、基于图的扩展以及角色专门化回答循环会引入开销,但它们也改善了长程问题的证据覆盖和综合。 总体而言,AdaMem 在基于记忆的方法中处于有利工作点,将 token 使用和延迟的适度增加转化为明显更强的推理性能。

表5:性能--效率权衡。
MethodF1BLEU-1TokensLatency (s)
A-Mem26.3720.7027203.227
Mem038.1632.0413403.739
Zep32.4027.0924613.255
AdaMem44.6537.9222484.722

5. Conclusion

In this paper, we presented AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem combines participant-specific working, episodic, persona, and graph-based memories with question-conditioned retrieval planning and unified evidence fusion, enabling the system to retrieve and integrate evidence in a more structured and target-aware manner. Experiments on long-horizon reasoning and user modeling benchmarks demonstrate the effectiveness of our framework. In particular, AdaMem achieves state-of-the-art performance on LoCoMo, showing the value of adaptive memory organization and retrieval for complex multi-session interactions. More broadly, our results suggest that memory for long-horizon dialogue agents should move beyond uniform storage and fixed retrieval heuristics toward more adaptive, question-aware, and user-centric designs.

在本文中,我们提出 AdaMem,这是一个面向长程对话智能体的自适应用户中心记忆框架。 AdaMem 将特定于参与者的工作记忆、情景记忆、人格记忆和基于图的记忆,与问题条件化检索规划和统一证据融合结合起来,使系统能够以更结构化和目标感知的方式检索并整合证据。 在长程推理和用户建模基准上的实验展示了我们框架的有效性。 特别是,AdaMem 在 LoCoMo 上达到最先进性能,显示出自适应记忆组织和检索对于复杂多会话交互的价值。 更广泛地说,我们的结果表明,长程对话智能体的记忆应超越统一存储和固定检索启发式,走向更自适应、问题感知和用户中心的设计。

Limitations

AdaMem improves answer quality through structured memories, adaptive retrieval, and role-specialized evidence synthesis, but this design also increases system complexity, token cost, and latency. In addition, the framework still depends on upstream parsing and backbone reasoning, making errors in target resolution, entity linking, and temporal normalization difficult to recover. In future work, we intend to explore solutions to these challenges in order to further enhance both the efficiency and the generality of the proposed framework.

AdaMem 通过结构化记忆、自适应检索和角色专门化证据综合提升回答质量,但这一设计也增加了系统复杂性、token 成本和延迟。 此外,该框架仍依赖上游解析和 backbone 推理,使目标解析、实体链接和时间规范化中的错误难以恢复。 在未来工作中,我们计划探索这些挑战的解决方案,以进一步增强所提出框架的效率和通用性。