Skip to content

AMA: Adaptive Memory via Multi-Agent Collaboration

MemoryAgentACL 2026 Findings HKUST(GZ)SDUNTUSUSTech

Huang W, Wang Z, Lin H, et al. AMA: Adaptive Memory via Multi-Agent Collaboration[J]. arXiv preprint arXiv:2601.20352v3, 2026.


AMA:通过多智能体协作实现自适应记忆

Abstract

The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long-term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has shifted from simple context extension to the development of dedicated agentic memory systems. However, existing approaches typically rely on rigid retrieval granularity, accumulation-heavy maintenance strategies, and coarse-grained update mechanisms. These design choices create a persistent mismatch between stored information and task-specific reasoning demands, while leading to the unchecked accumulation of logical inconsistencies over time. To address these challenges, we propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities. AMA employs a hierarchical memory design that dynamically aligns retrieval granularity with task complexity. Specifically, the Constructor and Retriever jointly enable multi-granularity memory construction and adaptive query routing. The Judge verifies the relevance and consistency of retrieved content, triggering iterative retrieval when evidence is insufficient or invoking the Refresher upon detecting logical conflicts. The Refresher then enforces memory consistency by performing targeted updates or removing outdated entries. Extensive experiments on challenging long-context benchmarks show that AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods, demonstrating its effectiveness in maintaining retrieval precision and long-term memory consistency.

大语言模型(LLM)智能体的快速演进,使稳健的记忆系统成为支持连贯长期交互和复杂推理的必要条件。 得益于 LLM 的强大能力,近期研究重点已经从简单的上下文扩展转向专门的智能体记忆系统开发。 然而,现有方法通常依赖僵化的检索粒度、偏重累积的维护策略以及粗粒度更新机制。 这些设计选择造成了存储信息与任务特定推理需求之间的持续错配,同时导致逻辑不一致随时间不受控制地累积。 为了解决这些挑战,我们提出 Adaptive Memory via Multi-Agent Collaboration(AMA),这是一个利用协同智能体在多种粒度上管理记忆的新框架。 AMA 采用层级化记忆设计,动态地将检索粒度与任务复杂度对齐。 具体而言,Constructor 和 Retriever 共同支持多粒度记忆构建与自适应查询路由。 Judge 验证检索内容的相关性和一致性,在证据不足时触发迭代检索,或在检测到逻辑冲突时调用 Refresher。 随后,Refresher 通过执行定向更新或移除过时条目来强制保证记忆一致性。 在具有挑战性的长上下文基准上的大量实验表明,AMA 显著优于最先进基线,同时相比全上下文方法将 token 消耗减少约 80%,证明了它在保持检索精度和长期记忆一致性方面的有效性。

1. Introduction

Large Language Model (LLM) agents have demonstrated strong capabilities in complex reasoning, tool use, and multi-turn interaction scenarios. Supporting such behaviors requires long-term memory to preserve contextual coherence and consistency. Existing approaches to long-term memory can be broadly categorized into internal and external memory paradigms. Internal memory implicitly absorbs historical information into model parameters, but is constrained by limited capacity and incurs substantial costs for continual updates. In contrast, external memory relies on explicit storage and retrieval, providing superior scalability and editability. As a result, it has become the dominant approach, making the design of efficient and reliable external memory systems a critical foundation for sustained agent evolution.

大语言模型(LLM)智能体已经在复杂推理、工具使用和多轮交互场景中展示出强大能力。 支持这些行为需要长期记忆来保持上下文连贯性和一致性。 现有长期记忆方法大体可以分为内部记忆和外部记忆范式。 内部记忆将历史信息隐式吸收到模型参数中,但受容量限制,并且持续更新的成本很高。 相比之下,外部记忆依赖显式存储和检索,提供更好的可扩展性和可编辑性。 因此,它已经成为主导方法,使高效可靠的外部记忆系统设计成为持续智能体演化的关键基础。

Comparison of static paradigms and the AMA framework
图1:静态范式与 AMA 框架的比较。(a)静态方法受固定粒度困境影响,会导致噪声或信息损失。(b)AMA 动态确定要使用的记忆粒度,使检索精度与推理需求对齐。

Building on the growing adoption of external memory, many systems support dynamic memory management through explicit Create-Read-Update-Delete operations, enabling agents to incrementally maintain memory over time. Despite these advantages, they exhibit a fundamental limitation: a mismatch between the granularity at which memories are stored and the granularity required for effective retrieval and reasoning. As illustrated in Figure 1, these approaches typically rely on static text chunking with fixed lengths or coarse-grained summaries. Such static strategies often disrupt the inherent semantic coherence of stored information, which in turn leads to suboptimal retrieval behavior: overly coarse retrieval introduces substantial irrelevant noise, while excessively fine-grained or isolated chunks fragment essential logical dependencies, ultimately leading to reasoning failures in complex tasks. These limitations highlight the necessity of an adaptive memory paradigm capable of dynamically aligning memory granularity with task-specific requirements.

随着外部记忆被越来越多地采用,许多系统通过显式的 Create-Read-Update-Delete 操作支持动态记忆管理,使智能体能够随时间增量维护记忆。 尽管有这些优势,它们仍表现出一项根本限制:记忆存储粒度与有效检索和推理所需粒度之间存在错配。 图1所示,这些方法通常依赖固定长度的静态文本分块或粗粒度摘要。 这种静态策略经常破坏存储信息内在的语义连贯性,进而导致次优检索行为:过粗的检索会引入大量无关噪声,而过细或孤立的块会切碎关键逻辑依赖,最终在复杂任务中导致推理失败。 这些限制凸显了自适应记忆范式的必要性,它需要能够动态地将记忆粒度与任务特定需求对齐。

To address these challenges, recent work has shifted toward agentic memory mechanisms, leveraging the generative capabilities of LLMs to mitigate the rigidity of static storage granularity. Typically, these frameworks employ LLMs to synthesize interaction history into flexible representations like summaries or vector entries, extending the effective context window. While these designs improve representation flexibility, they leave two fundamental challenges largely unaddressed. First, the absence of an explicit adaptive routing mechanism prevents agents from selecting the appropriate memory granularity at inference time, leading to persistent mismatches with task demands. Second, reliance on accumulation-heavy strategies and coarse-grained update mechanisms fails to support precise modifications, resulting in the unchecked accumulation of redundancy and errors.

为了解决这些挑战,近期工作转向智能体式记忆机制,利用 LLM 的生成能力缓解静态存储粒度的僵化。 通常,这些框架使用 LLM 将交互历史合成为摘要或向量条目等灵活表示,从而扩展有效上下文窗口。 虽然这些设计提高了表示灵活性,但它们很大程度上没有解决两个根本挑战。 首先,缺少显式自适应路由机制会阻止智能体在推理时选择合适的记忆粒度,从而与任务需求持续错配。 其次,依赖偏重累积的策略和粗粒度更新机制无法支持精确修改,导致冗余和错误不受控制地累积。

To overcome the coupled challenges of adaptive retrieval control and long-term memory evolution, we propose Adaptive Memory via Multi-Agent Collaboration (AMA), as illustrated in Figures Figure 1 and Figure 2. Unlike prior agentic memory systems that mainly rely on a monolithic controller, AMA adopts a multi-agent design that decomposes the memory lifecycle into four functionally distinct yet interdependent roles: the Constructor, Retriever, Judge, and Refresher. Specifically, the Constructor transforms unstructured dialogue streams into hierarchical granularities, including Raw Text, Fact Knowledge, and Episode Memory, to accommodate diverse storage requirements. The Retriever acts as an adaptive gateway, dynamically routing queries to the most appropriate memory form based on current reasoning demands. To ensure consistency, the Judge serves as a logic auditor, verifying relevance to trigger feedback loops and detecting conflicts to activate the Refresher for updates. This separation of responsibilities enables fine-grained control over retrieval, verification, and memory evolution, which would be difficult to achieve within a single-agent design without entangling conflicting objectives. Extensive experiments across multiple long-term memory benchmarks demonstrate that AMA consistently outperforms strong memory baselines. By adaptively controlling retrieval granularity and explicitly maintaining memory consistency over time, AMA achieves state-of-the-art performance while reducing token consumption by up to 80% compared to using full context. Moreover, our analysis highlight the importance of the logic-driven Refresher, which plays a critical role in dynamic knowledge maintenance and enables AMA to achieve nearly 90% accuracy in knowledge update scenarios.

为克服自适应检索控制和长期记忆演化这两个相互耦合的挑战,我们提出 Adaptive Memory via Multi-Agent Collaboration(AMA),如图1图2所示。 不同于主要依赖单体控制器的先前智能体记忆系统,AMA 采用多智能体设计,将记忆生命周期分解为四个功能不同但相互依赖的角色:Constructor、Retriever、Judge 和 Refresher。 具体而言,Constructor 将非结构化对话流转化为层级化粒度,包括 Raw Text、Fact Knowledge 和 Episode Memory,以适应多样化存储需求。 Retriever 充当自适应网关,基于当前推理需求将查询动态路由到最合适的记忆形式。 为确保一致性,Judge 作为逻辑审计器,验证相关性以触发反馈循环,并检测冲突以激活 Refresher 进行更新。 这种职责分离使检索、验证和记忆演化能够被细粒度控制;如果在单智能体设计中不让相互冲突的目标纠缠在一起,这一点很难实现。 在多个长期记忆基准上的大量实验表明,AMA 持续优于强记忆基线。 通过自适应控制检索粒度,并随时间显式维护记忆一致性,AMA 达到最先进性能,同时相比使用完整上下文将 token 消耗最多减少 80%。 此外,我们的分析强调了逻辑驱动 Refresher 的重要性,它在动态知识维护中发挥关键作用,并使 AMA 在知识更新场景中达到接近 90% 的准确率。

In summary, our main contributions are threefold:

1. We introduce a comprehensive memory paradigm featuring multi-granularity storage and adaptive routing, which incorporates logic-driven conflict detection to maintain long-term consistency and reasoning fidelity.

2. We design a unified multi-agent framework to orchestrate storage, retrieval, and maintenance, facilitating robust memory evolution in long-context applications.

3. Through extensive experiments and analysis, we demonstrate that AMA significantly outperforms state-of-the-art baselines, verifying its effectiveness and robustness in complex long-context tasks.

总之,我们的主要贡献有三点:

1. 我们提出一个综合记忆范式,具备多粒度存储和自适应路由,并纳入逻辑驱动的冲突检测,以维护长期一致性和推理忠实性。

2. 我们设计了一个统一的多智能体框架来编排存储、检索和维护,从而促进长上下文应用中的稳健记忆演化。

3. 通过大量实验和分析,我们证明 AMA 显著优于最先进基线,验证了它在复杂长上下文任务中的有效性和稳健性。

Overview of the AMA framework
图2:AMA 框架概览。系统编排四个智能体以实现自适应记忆演化。Retriever 根据意图将输入路由到最佳粒度。Judge 审计内容相关性以触发反馈循环并检测冲突。Refresher 执行更新或删除来纠正这些不一致。最后,Constructor 将验证后的上下文合成为结构化记忆条目。

2.1 Memory for LLM Agents

Prior research on memory for LLM agents has investigated a wide range of approaches, ranging from full interaction storage to system-level frameworks. These methods typically evolve from context extension to structured organization. Specifically, MemGPT focuses on context management, adopting a cache-like organization to prioritize salient information. Moving towards modularity, Mem0 abstracts memory as an independent layer dedicated to long-term management. To further enhance retrieval precision, Nemori and Zep introduce semantic structures, leveraging self-organizing events and temporal knowledge graphs, respectively. Despite their progress, these methods rely on static retrieval strategies, which limits their ability to adaptively coordinate information across different abstraction levels and task stages. Therefore, designing an adaptive memory system that can robustly support long-term interactions remains a critical challenge.

关于 LLM 智能体记忆的既有研究考察了广泛方法,从完整交互存储到系统级框架都有涉及。 这些方法通常从上下文扩展演进到结构化组织。 具体来说,MemGPT 专注于上下文管理,采用类似缓存的组织方式来优先保留显著信息。 向模块化方向发展时,Mem0 将记忆抽象为专用于长期管理的独立层。 为了进一步增强检索精度,Nemori 和 Zep 分别引入语义结构,利用自组织事件和时间知识图谱。 尽管取得了进展,这些方法仍依赖静态检索策略,限制了它们在不同抽象层级和任务阶段之间自适应协调信息的能力。 因此,设计一个能够稳健支持长期交互的自适应记忆系统仍是一项关键挑战。

2.2 Multi-Agent System

Multi-agent systems have demonstrated clear advantages in tackling complex tasks by enabling role-based collaboration and interactive decision making. In software engineering, multi-agent approaches improve system reliability through explicit role specialization and structured workflows. In mathematical reasoning, multi-agent frameworks enhance solution accuracy via collaborative interaction and process-level verification. In parallel, a growing body of work on agentic memory focuses on improving long-term information modeling for LLM agents. While this line of research provides valuable insights into memory abstraction and maintenance, most existing approaches are built around a monolithic controller and do not explicitly leverage multi-agent collaboration. A notable recent exception is MIRIX, which explores assigning specialized agents for memory organization, but lacks dedicated mechanisms for long-term memory consistency. We did not include MIRIX as a baseline in this work because its official implementation was not publicly available during our experimental phase. Building on these complementary lines of research, our work integrates multi-agent collaboration with agentic memory design to support long-term memory for LLM agents.

多智能体系统通过支持基于角色的协作和交互式决策,在处理复杂任务方面展示出明显优势。 在软件工程中,多智能体方法通过显式角色专门化和结构化工作流提升系统可靠性。 在数学推理中,多智能体框架通过协作交互和过程级验证提高解题准确性。 与此同时,越来越多关于智能体记忆的工作专注于改进 LLM 智能体的长期信息建模。 虽然这条研究线为记忆抽象和维护提供了有价值的见解,但大多数现有方法都围绕单体控制器构建,并没有显式利用多智能体协作。 一个值得注意的近期例外是 MIRIX,它探索为记忆组织分配专门智能体,但缺少长期记忆一致性的专用机制。 我们没有在本文中把 MIRIX 纳入基线,因为其实验阶段官方实现尚未公开。 基于这些互补研究线,我们的工作将多智能体协作与智能体记忆设计结合起来,以支持 LLM 智能体的长期记忆。

3. Method

We introduce Adaptive Memory via Multi-Agent Collaboration (AMA) to address the critical challenge of aligning retrieval granularity with diverse task requirements, as well as the unchecked accumulation of logical inconsistencies. As illustrated in Figure 2, the framework operates through a coordinated multi-agent pipeline. The process begins with the Retriever, which accesses memory across multiple granularities based on the input intent. The Judge then evaluates the relevance of the retrieved content and identifies potential conflicts, triggering feedback retrieval or activating the Refresher to perform targeted memory updates when necessary. Finally, the Constructor consolidates the validated information and organizes it into memory representations at different granularities, supporting continual memory evolution. In the following sections, we present the detailed design of the Constructor, Retriever, Judge, and Refresher.

我们引入 Adaptive Memory via Multi-Agent Collaboration(AMA),以解决检索粒度与多样化任务需求对齐这一关键挑战,以及逻辑不一致不受控制地累积的问题。 图2所示,该框架通过协同多智能体流水线运行。 流程从 Retriever 开始,它基于输入意图跨多种粒度访问记忆。 随后,Judge 评估检索内容的相关性并识别潜在冲突,在必要时触发反馈检索或激活 Refresher 执行定向记忆更新。 最后,Constructor 巩固验证后的信息,并将其组织为不同粒度的记忆表示,从而支持持续记忆演化。 在以下小节中,我们给出 Constructor、Retriever、Judge 和 Refresher 的详细设计。

3.1 Constructor

To clearly delineate the functional roles of different memory granularities within the overall pipeline, we begin by introducing the Constructor. Its primary responsibility is to construct multi-granular memory by generating structured semantic components from the current input ut, context window Wt, and conflict-free memory history Ht, conditioned on a carefully designed prompt Pcon. Drawing inspiration from prior work and established linguistic theory, the Constructor decomposes natural language into stable and parsable fact templates. Specifically, it leverages five fundamental sentence patterns defined by combinations of Subject (S), Verb (V), Object (O), and Complement (C): S-V, S-V-O, S-V-C, S-V-O-O, and S-V-O-C. Through this decomposition, the Constructor simultaneously extracts a set of facts and the indices of conversation turns relevant to the current input:

为了清晰划分整体流水线中不同记忆粒度的功能角色,我们首先介绍 Constructor。 它的主要职责是构建多粒度记忆,即在精心设计的提示 Pcon 条件下,从当前输入 ut、上下文窗口 Wt 和无冲突记忆历史 Ht 生成结构化语义组件。 受先前工作和成熟语言学理论启发,Constructor 将自然语言分解为稳定且可解析的事实模板。 具体来说,它利用由主语(S)、动词(V)、宾语(O)和补语(C)组合定义的五种基本句型:S-V、S-V-O、S-V-C、S-V-O-O 和 S-V-O-C。 通过这种分解,Constructor 同时抽取一组事实和与当前输入相关的对话轮次索引:

Kt,RtConstructor(ut,WtHtPcon)

The set Kt={kt,1,kt,2,} represents the structured fact knowledge parsed from the current content. We index dialogue contents by a unique identifier Ds:j, which denotes the j-th turn in the s-th session. Based on this indexing scheme, the Constructor automatically selects a subset of relevant historical turns Rt{Ds:j}. In parallel, the Constructor constructs unified meta-information Ωt={τt,dt,speakert} for the current turn t, where dt=Ds:t. The timestamp τt encodes precise temporal information, and speakert{user,assistant}. If the input contains an explicit temporal expression (e.g., dates or event times), it is directly extracted as τt; otherwise, the current system time is assigned. This design ensures chronological consistency across multi-turn memories and facilitates time-sensitive conflict detection. Given the tuple (ut,Kt,Rt,Ωt), the Constructor then generates memory entries at varying granularities (Figure 3).

集合 Kt={kt,1,kt,2,} 表示从当前内容中解析出的结构化事实知识。 我们用唯一标识符 Ds:j 为对话内容建立索引,它表示第 s 个会话中的第 j 轮。 基于这一索引方案,Constructor 自动选择相关历史轮次的子集 Rt{Ds:j} 同时,Constructor 为当前轮次 t 构建统一元信息 Ωt={τt,dt,speakert},其中 dt=Ds:t 时间戳 τt 编码精确时间信息,而 speakert{user,assistant} 如果输入包含显式时间表达(例如日期或事件时间),则直接抽取为 τt;否则,分配当前系统时间。 这种设计确保多轮记忆之间的时间顺序一致性,并促进时间敏感的冲突检测。 给定元组 (ut,Kt,Rt,Ωt),Constructor 随后生成不同粒度的记忆条目(图3)。

Memory Construction Stage
图3:记忆构建阶段。在这一阶段,Constructor 从话语生成原始文本和事实知识记忆,同时在触发激活时有条件地合成抽象 episode。

Raw Text Memory. This component records the content of the current turn in its original form ut, together with the reference information (Rt and Ωt) generated by the Constructor. Formally, we define mtraw={ut,Rt,Ωt}. This granularity preserves the fundamental conversational trajectory, ensuring both data traceability and retrieval flexibility.

Raw Text Memory。 该组件以原始形式 ut 记录当前轮次内容,并连同 Constructor 生成的引用信息(RtΩt)一起保存。 形式上,我们定义 mtraw={ut,Rt,Ωt} 这一粒度保留了基本对话轨迹,确保数据可追溯性和检索灵活性。

Fact Knowledge Memory. Each extracted fact is treated as an independent memory unit. Accordingly, we define Fact Knowledge Memory as mt,ifact={kt,i,Rt,Ωt} with kt,iKt. By transforming unstructured text into structured knowledge units, Fact Knowledge Memory enables associative retrieval, facilitates conflict detection, and supports the long-term accumulation and refinement of knowledge within the AMA framework.

Fact Knowledge Memory。 每个抽取出的事实都被视为独立记忆单元。 因此,我们将 Fact Knowledge Memory 定义为 mt,ifact={kt,i,Rt,Ωt},其中 kt,iKt 通过将非结构化文本转化为结构化知识单元,Fact Knowledge Memory 支持关联检索、促进冲突检测,并支持 AMA 框架内知识的长期积累和细化。

Episode Memory. It is designed to capture high-level abstractions across turns. Following a gatekeeping mechanism inspired by prior work, we introduce a trigger function with prompt Ptri to determine a binary activation state Tt{0,1}. The trigger is activated under three conditions: detection of a topic shift, an explicit user request, or saturation of the context window threshold. This generation process is formalized as Tt=Constructor(ut,WtPtri). When activated (Tt=1), the Constructor employs a dedicated prompt Pepi to synthesize an abstract summary Et, which directly constitutes the episodic memory entry:

Episode Memory。 它被设计用于捕获跨轮次的高层抽象。 遵循受先前工作启发的门控机制,我们引入带提示 Ptri 的触发函数,以确定二元激活状态 Tt{0,1} 触发器在三种条件下被激活:检测到主题转移、用户显式请求,或上下文窗口阈值饱和。 这一生成过程形式化为 Tt=Constructor(ut,WtPtri) 当被激活(Tt=1)时,Constructor 使用专用提示 Pepi 合成抽象摘要 Et,它直接构成 episodic memory 条目:

mtepi=Et=Constructor(ut,WtPepi)

Memory Encoding. To support efficient retrieval across memory granularities, we compute a dense vector representation for each memory entry based on its primary semantic content. For a memory entry mi, we extract its core text ci (i.e., the raw utterance ut, the fact kt,i, or the episode summary Et) and encode it into a high-dimensional embedding using a text encoder: ei=fenc(ci). These embeddings serve as keys for the granularity-specific retrieval mechanisms, which are detailed in Section 3.2.

Memory Encoding。 为支持跨记忆粒度的高效检索,我们基于每个记忆条目的主要语义内容计算稠密向量表示。 对于记忆条目 mi,我们抽取其核心文本 ci(即原始话语 ut、事实 kt,i 或 episode 摘要 Et),并使用文本编码器将其编码为高维嵌入:ei=fenc(ci) 这些嵌入作为粒度特定检索机制的键,详细说明见第 3.2 节。

3.2 Retriever

The Retriever functions as the memory access gateway within the AMA framework. Its primary role is to dynamically route queries to the most appropriate memory granularity. To address referential ambiguity and missing context commonly observed in raw dialogue, the Retriever first rewrites the query into a self-contained form and then performs adaptive retrieval based on multi-dimensional intent analysis.

Retriever 是 AMA 框架中的记忆访问网关。 它的主要作用是将查询动态路由到最合适的记忆粒度。 为处理原始对话中常见的指代歧义和上下文缺失,Retriever 首先将查询改写为自包含形式,然后基于多维意图分析执行自适应检索。

Query Rewriting and Intent Routing. Given the current input ut and the context window Wt, the Retriever employs a dedicated prompt Pret to guide the LLM in simultaneously generating three outputs: a context-independent rewritten query ut, a four-dimensional binary intent vector B, and a dynamic retrieval count Kdyn:

Query Rewriting and Intent Routing。 给定当前输入 ut 和上下文窗口 Wt,Retriever 使用专用提示 Pret 来引导 LLM 同时生成三个输出:上下文无关的改写查询 ut、四维二元意图向量 B 以及动态检索数量 Kdyn

ut,B,KdynRetriever(ut,WtPret)

The rewritten query ut resolves ambiguous references and omissions, which is suitable for retrieval. The intent vector B=[bfine,babs,bevent,batomic] encodes the activation of four query dimensions: fine-grained details, abstract summaries, cross-temporal events, and atomic facts. Based on B, a routing function fM dynamically selects the appropriate retrieval operator O. The mapping O=fM(B) is determined by priority:

改写后的查询 ut 解析歧义指代和省略,因此适合检索。 意图向量 B=[bfine,babs,bevent,batomic] 编码四个查询维度的激活状态:细粒度细节、抽象摘要、跨时间事件和原子事实。 基于 B,路由函数 fM 动态选择合适的检索算子 O 映射 O=fM(B) 由优先级决定:

O=fM(B)={Oraw,if bfine=1,Oepi,if babs=1bevent=1,Ofact,otherwise.

This routing strategy explicitly prioritizes specialized retrieval intents. When fine-grained detail is required (bfine=1), the Retriever chooses Raw Text Memory for precise phrasing. When abstract or event-level understanding is needed (babs=1 or bevent=1), Episodic Memory is queried to obtain high-level semantic representations. In all other cases, the Retriever defaults to Fact Knowledge Memory to access structured information.

这一路由策略显式优先处理专门化检索意图。 当需要细粒度细节(bfine=1)时,Retriever 选择 Raw Text Memory 以获取精确措辞。 当需要抽象或事件级理解(babs=1bevent=1)时,系统查询 Episodic Memory 以获得高层语义表示。 在所有其他情况下,Retriever 默认使用 Fact Knowledge Memory 来访问结构化信息。

Similarity-based Retrieval. Once the target memory repository M is determined, we compute the cosine similarity between the embedding of the rewritten query ut and the pre-computed embedding ei of each memory entry mi within the selected repository M, defined as

Similarity-based Retrieval。 一旦确定目标记忆仓库 M,我们计算改写查询 ut 的嵌入与所选仓库 M 中每个记忆条目 mi 的预计算嵌入 ei 之间的余弦相似度,定义为

si=cos(fenc(ut),ei)=fenc(ut)eifenc(ut)ei

Memory entries are ranked by their similarity scores, with the Top-K entries forming the final retrieval set. To prevent the predicted Kdyn from being too small to capture sufficient information, we enforce a minimum threshold Km and set the effective cutoff as K=max(Kdyn,Km), where Kdyn is dynamically predicted by the Retriever. Accordingly, the candidate memory set Ht is obtained as Ht=TopK({mi}i=1|M|,key=si). The resulting set Ht is then passed to the Judge for verification.

记忆条目按相似度分数排序,Top-K 条目形成最终检索集合。 为了防止预测的 Kdyn 过小而无法捕获足够信息,我们强制设置最小阈值 Km,并将有效截断值设为 K=max(Kdyn,Km),其中 Kdyn 由 Retriever 动态预测。 相应地,候选记忆集合 Ht 被得到为 Ht=TopK({mi}i=1|M|,key=si) 所得集合 Ht 随后被传递给 Judge 进行验证。

3.3 Judge

While the Retriever recalls a candidate memory set Ht based on vector similarity, directly injecting unverified memories may introduce noise or amplify hallucinations. To ensure robustness and reliability, the Judge acts as a dynamic filter, performing a sequential dual-verification to refine Ht into a validated set Ht, guided by the prompt Pjud.

虽然 Retriever 基于向量相似度召回候选记忆集合 Ht,但直接注入未经验证的记忆可能引入噪声或放大幻觉。 为确保稳健性和可靠性,Judge 充当动态过滤器,在提示 Pjud 引导下执行顺序双重验证,将 Ht 细化为验证集合 Ht

Relevance Assessment. The Judge first evaluates the pragmatic utility of the retrieved content with respect to the current input ut. To optimize the utilization of Ht, we incorporate a relevance-based rejection mechanism. If the density of valid information in Ht falls below a predefined threshold, the system triggers a Retry action. This issues a feedback signal to the Retriever, prompting it to traverse remaining memory granularities or perform retrieval expansion using relation indices Rt to broaden the scope. This feedback loop is bounded by the retrieval round limit Kr. Upon successfully passing this check, the retained relevant memory set is denoted as Hr, which serves as the input for the conflict detection phase.

Relevance Assessment。 Judge 首先评估检索内容相对于当前输入 ut 的实际效用。 为优化 Ht 的使用,我们纳入基于相关性的拒绝机制。 如果 Ht 中有效信息密度低于预定义阈值,系统触发 Retry 动作。 这会向 Retriever 发出反馈信号,促使其遍历剩余记忆粒度,或使用关系索引 Rt 执行检索扩展以扩大范围。 这一反馈循环受检索轮数上限 Kr 约束。 成功通过该检查后,保留下来的相关记忆集合记为 Hr,它作为冲突检测阶段的输入。

Conflict Detection. Subsequently, the Judge conducts logical consistency checks to identify contradictions between the current input ut and the filtered memory Hr. Typical conflicts include outdated facts that contradict updated user status. When detecting such inconsistencies, the Judge isolates a conflict set Cerr, which comprises the specific memory entries identified as contradictory, and triggers a Refresh action to activate the Refresher for targeted updates. In the absence of conflicts, the filtered memory Hr is directly instantiated as the validated memory set Ht, ready for downstream utilization. The overall verification process is formalized as:

Conflict Detection。 随后,Judge 进行逻辑一致性检查,以识别当前输入 ut 与过滤后记忆 Hr 之间的矛盾。 典型冲突包括与更新后用户状态相矛盾的过时事实。 当检测到这类不一致时,Judge 分离冲突集合 Cerr,其中包含被识别为矛盾的特定记忆条目,并触发 Refresh 动作以激活 Refresher 进行定向更新。 在没有冲突时,过滤后的记忆 Hr 会直接实例化为验证后的记忆集合 Ht,可供下游使用。 整体验证过程形式化为:

Ht,Cerr,ActionJudge(ut,Ht,WtPjud)

where Action{Pass,Retry,Refresh} dictates the system flow based on the verification outcome:

1. Action=Retry is triggered when retrieval relevance is insufficient;

2. Action=Refresh is triggered when a non-empty conflict set Cerr is detected, detailed in Section 3.4;

3. Action=Pass occurs when memories are both relevant and consistent, forwarding the validated set Ht to the Constructor for memory synthesis and to the downstream agent for generating the final response.

其中 Action{Pass,Retry,Refresh} 基于验证结果决定系统流程:

1. 当检索相关性不足时触发 Action=Retry

2. 当检测到非空冲突集合 Cerr 时触发 Action=Refresh,详见第 3.4 节;

3. 当记忆既相关又一致时发生 Action=Pass,将验证后的集合 Ht 转发给 Constructor 进行记忆合成,并转发给下游智能体生成最终回答。

3.4 Refresher

Drawing inspiration from prior studies on dynamic memory maintenance, we introduce the Refresher to ensure the logical validity of memory storage. This component is triggered exclusively when the Judge detects a conflict set Cerr. Guided by a dedicated prompt Pref, the Refresher follows a strict conditional branching strategy to resolve detected inconsistencies.

受动态记忆维护相关研究启发,我们引入 Refresher 来确保记忆存储的逻辑有效性。 该组件只在 Judge 检测到冲突集合 Cerr 时被触发。 在专用提示 Pref 引导下,Refresher 遵循严格的条件分支策略来解决检测到的不一致。

Delete. This operation is triggered only under two rigorous conditions: (1) in response to explicit user instructions to forget specific information, and (2) when the lifespan of a conflicting memory entry exceeds a predefined maximum retention limit. In such cases, the system permanently removes the entry to purge the storage space.

Delete。 该操作只在两个严格条件下被触发:(1)响应用户明确要求遗忘特定信息的指令;(2)冲突记忆条目的生命周期超过预定义的最大保留限制。 在这些情况下,系统会永久移除该条目以清理存储空间。

Update. For all remaining conflict scenarios, the Refresher defaults to an update operation. Specifically, it performs a state modification miU(mi,ut), which selectively adjusts the attributes of mi to align with the latest state implied by the current input ut (e.g., updating outdated location data), thereby rectifying logical contradictions while preserving memory continuity. The process yields a consistent memory state HtRefresher(Cerr,HtPref). This conflict-free set Ht is then immediately routed to the Constructor for memory synthesis and the downstream agent to ensure reliable response generation.

Update。 对于所有其余冲突场景,Refresher 默认执行更新操作。 具体而言,它执行状态修改 miU(mi,ut),选择性调整 mi 的属性,使其与当前输入 ut 所暗示的最新状态对齐(例如更新过时的位置数据),从而在保留记忆连续性的同时纠正逻辑矛盾。 这一过程产生一致的记忆状态 HtRefresher(Cerr,HtPref) 随后,这一无冲突集合 Ht 会立即被路由到 Constructor 进行记忆合成,并传递给下游智能体以确保可靠回答生成。

4. Experiment

4.1 Experimental Setup

Datasets and Metrics. We evaluate long-term memory capabilities on two established benchmarks: LoCoMo and LongMemEval_s. Detailed statistics for both datasets are provided in Appendix A.1. For LoCoMo, we report F1 and BLEU-1 scores in addition to the LLM Score. For LongMemEval_s, we specifically select the more challenging Pass@1 accuracy evaluated by an LLM judge to rigorously test performance. Following Achiam et al. and Maharana et al., we employ GPT-4o-mini as the unified judge for all model-based evaluations.

Datasets and Metrics。 我们在两个成熟基准 LoCoMo 和 LongMemEval_s 上评估长期记忆能力。 两个数据集的详细统计见附录 A.1。 对于 LoCoMo,除 LLM Score 外,我们还报告 F1 和 BLEU-1 分数。 对于 LongMemEval_s,我们专门选择更具挑战性的由 LLM 评判的 Pass@1 准确率,以严格测试性能。 遵循 Achiam 等人和 Maharana 等人的做法,我们使用 GPT-4o-mini 作为所有基于模型评估的统一裁判。

Baselines. We compare AMA with various baselines, starting with FullContext and a standard Retrieval-Augmented Generation (RAG) implemented with 2048-token chunks. We then evaluate representative memory frameworks including: LangMem, MemGPT, Zep, A-Mem, Mem0, Nemori.

Baselines。 我们将 AMA 与多种基线比较,首先包括 FullContext 和一个使用 2048-token 块实现的标准检索增强生成(RAG)。 随后,我们评估代表性记忆框架,包括 LangMem、MemGPT、Zep、A-Mem、Mem0 和 Nemori。

Implementation Details. We conduct experiments using both closed-source APIs (GPT-4o-mini, GPT-4.1-mini) and open-source models (Qwen3-8B-Instruct, Qwen3-30B-Instruct), ensuring that the AMA framework utilizes the identical backbone model as the response generator. To ensure reproducibility, we fix the temperature to 0 for all experiments. For RAG, the retrieval top-k is set to 10, while for AMA the maximum retrieval loop Kr is limited to 2. All memory embeddings are computed using OpenAI's text-embedding-3-large model. Due to the commercial nature of Zep, Mem0, and MemGPT, we exclude these frameworks from evaluations involving open-source models. Additionally, given the substantial scale of LongMemEval_s (approximately 58M tokens), we restrict its evaluation exclusively to GPT-4o-mini for computational feasibility. Prompts used in AMA and descriptions of the baselines are provided in Appendix A and Appendix B.

Implementation Details。 我们使用闭源 API(GPT-4o-mini、GPT-4.1-mini)和开源模型(Qwen3-8B-Instruct、Qwen3-30B-Instruct)进行实验,并确保 AMA 框架使用与回答生成器相同的 backbone 模型。 为确保可复现性,我们将所有实验的 temperature 固定为 0。 对于 RAG,检索 top-k 设置为 10;对于 AMA,最大检索循环 Kr 限制为 2。 所有记忆嵌入均使用 OpenAI 的 text-embedding-3-large 模型计算。 由于 Zep、Mem0 和 MemGPT 的商业性质,我们在涉及开源模型的评估中排除了这些框架。 此外,考虑到 LongMemEval_s 规模很大(约 58M token),出于计算可行性,我们将其评估仅限制在 GPT-4o-mini 上。 AMA 使用的提示和基线描述见附录 A 和附录 B。

4.2 Main Results

表1:LoCoMo 基准上的主结果。我们在四个 backbone 模型上将 AMA 与代表性记忆基线进行比较。
MethodSingle-HopMulti-HopTemporal ReasoningOpen DomainOverall
LLMF1B1LLMF1B1LLMF1B1LLMF1B1LLMF1B1
GPT-4o-mini
FullContext0.8230.5260.4430.6630.3510.2590.5570.4370.3580.4820.2430.1710.7170.4580.375
RAG0.3180.2200.1850.3110.1840.1160.2350.1940.1560.3230.1890.1340.3000.2060.163
LangMem0.6160.3890.3320.5260.3360.2400.2500.3210.2630.4770.2950.2360.5150.3590.295
MemGPT0.4790.2490.1860.2290.1120.1110.3880.3260.2610.5000.3080.2550.4150.2440.192
Zep0.6270.3940.3340.5010.2730.1920.5840.4440.3780.3930.2270.1560.5800.3720.307
A-Mem0.5180.2700.2010.2480.1210.1200.5460.4590.3670.5410.3330.2760.4760.2860.225
Mem00.6760.4410.3400.5980.3400.2500.5000.4400.3730.4030.2690.1920.6080.4120.339
Nemori0.8170.5410.4300.6500.3630.2550.7070.5640.4630.4460.2070.1500.7400.4920.383
AMA0.8490.6220.5480.6810.4230.3030.7480.5890.4610.4790.2830.2280.7740.5580.465
GPT-4.1-mini
FullContext0.8480.5990.5210.7530.4310.3290.7230.4630.3900.5520.2770.2170.7860.5190.439
RAG0.3510.2520.2150.3090.1960.1250.2670.2170.1860.2810.1740.1360.3210.2290.187
LangMem0.8240.4980.4250.6930.4050.3170.4960.4730.3990.5750.3200.2570.7170.4640.390
MemGPT0.4900.2550.1900.2340.1150.1140.3970.3340.2670.5120.3150.2610.4250.2500.196
Zep0.6520.4440.3900.5230.2970.1990.5870.2330.1950.4270.2360.1880.6010.3600.301
A-Mem0.5300.2760.2060.2540.1240.1230.5590.4700.3750.5530.3410.2820.4870.2930.230
Mem00.6970.4740.4100.6660.3910.2950.5550.3820.3230.4680.2320.1730.6470.4240.356
Nemori0.8280.5730.5020.7320.4070.3110.7570.5620.4900.4980.2520.1880.7740.5210.445
AMA0.8880.6360.5700.7160.4530.3240.7640.6080.4840.5140.3330.2510.8050.5800.492
Qwen3-30B-Instruct
FullContext0.8450.5280.4530.6650.3690.2730.5550.4430.3690.5510.2730.1850.7330.4660.390
RAG0.3270.2210.1880.3120.1930.1220.2340.1970.1600.3680.2190.1460.3070.2090.169
LangMem0.6330.3900.3370.5290.3530.2540.2530.3250.2700.5450.3410.2490.5250.3650.309
A-Mem0.5180.2700.2010.2480.1210.1200.5460.4590.3670.5410.3330.2760.4760.2860.225
Nemori0.8390.5430.4390.6520.3810.2690.7040.5710.4770.5100.2390.1630.7560.5000.399
AMA0.8720.6250.5600.7030.4450.3180.7510.5970.4750.5050.3270.2470.7910.5700.483
Qwen3-8B-Instruct
FullContext0.8030.5020.4300.6320.3510.2590.5270.4210.3510.5230.2590.1760.6960.4430.371
RAG0.3170.2140.1820.3030.1870.1180.2270.1910.1550.3570.2120.1420.2980.2030.164
LangMem0.6170.3800.3280.5150.3440.2470.2460.3170.2630.5310.3320.2430.5120.3560.301
A-Mem0.5040.2630.1960.2410.1180.1170.5320.4470.3570.5270.3240.2690.4640.2790.219
Nemori0.7620.4930.3990.5920.3460.2440.6390.5190.4330.4630.2170.1480.6860.4540.362
AMA0.7800.5590.5000.6280.3980.2840.6710.5330.4240.4820.2920.2210.7070.5100.432
表2:LongMemEval_s 上的性能分解。我们报告 AMA 和记忆基线在六类问题上的分类结果,评估 backbone 为 GPT-4o-mini。
Question TypeFull-ctxRAGLangMemA-MemMemGPTMem0ZepNemoriAMA
single-session-pref.0.3000.3330.2670.3670.2000.3330.5330.4670.467
single-session-asst.0.8180.7140.7770.8040.8570.8750.7500.8390.964
temporal-reasoning0.3650.2800.3530.3980.4510.3990.5410.6170.444
multi-session0.4060.2540.4240.4510.5490.4810.4740.5110.624
knowledge-update0.7690.3850.5780.6020.4100.6540.7440.6150.897
single-session-user0.8140.6860.7500.7500.8140.8570.9290.8860.986
Average0.5480.3740.5280.5380.5540.5740.6320.6420.698

We first evaluate AMA on the LoCoMo benchmark using closed-source backbones. As shown in Table 1, with GPT-4o-mini, AMA achieves an overall LLM Score of 0.774, substantially outperforming the strongest baseline Nemori (0.740) and all other memory-based methods by a clear margin. When scaled to the more capable GPT-4.1-mini, AMA further improves to 0.805. Notably, under this setting, AMA is the only approach that surpasses FullContext (0.786), demonstrating that AMA effectively distills raw history into critical facts and episodes, thereby filtering out noise to support reasoning beyond the raw context window.

我们首先使用闭源 backbone 在 LoCoMo 基准上评估 AMA。 表1所示,在 GPT-4o-mini 上,AMA 达到 0.774 的 overall LLM Score,显著优于最强基线 Nemori(0.740)以及所有其他基于记忆的方法,优势清晰。 当扩展到能力更强的 GPT-4.1-mini 时,AMA 进一步提升到 0.805。 值得注意的是,在这一设置下,AMA 是唯一超过 FullContext(0.786)的方法,这表明 AMA 能够有效地将原始历史提炼为关键事实和 episode,从而过滤噪声,支持超越原始上下文窗口的推理。

We further assess robustness by extending the evaluation to open-source models on LoCoMo. With Qwen3-30B-Instruct, AMA attains a dominant LLM Score of 0.791, exceeding FullContext (0.733) by a large margin of 0.058. This advantage persists even with the smaller Qwen3-8B-Instruct backbone, where AMA (0.707) continues to outperform FullContext (0.696). These results demonstrate that AMA consistently enhances complex reasoning performance across backbones of varying capacity.

我们进一步将评估扩展到 LoCoMo 上的开源模型,以考察稳健性。 在 Qwen3-30B-Instruct 上,AMA 取得主导性的 0.791 LLM Score,以 0.058 的大幅优势超过 FullContext(0.733)。 即便使用更小的 Qwen3-8B-Instruct backbone,这一优势仍然存在,其中 AMA(0.707)继续优于 FullContext(0.696)。 这些结果表明,AMA 能够在不同容量的 backbone 上持续增强复杂推理性能。

Finally, we evaluate the generalization of AMA on another benchmark, LongMemEval_s (Table 2). AMA again achieves the highest average accuracy of 0.698, outperforming the two strongest baselines, Nemori and Zep, by 0.056 and 0.066, respectively. Notably, AMA attains near-perfect accuracy on single-session-user tasks (0.986) and shows a pronounced advantage on knowledge-update tasks (0.897), where dynamic knowledge maintenance and conflict resolution are critical. These consistent improvements across benchmarks with different data distributions indicate that AMA generalizes effectively to diverse long-term reasoning scenarios while robustly supporting dynamic knowledge evolution.

最后,我们在另一个基准 LongMemEval_s 上评估 AMA 的泛化能力(表2)。 AMA 再次取得最高平均准确率 0.698,分别以 0.056 和 0.066 的优势超过两个最强基线 Nemori 和 Zep。 值得注意的是,AMA 在 single-session-user 任务上达到接近完美的准确率(0.986),并在 knowledge-update 任务上表现出明显优势(0.897),而这些任务中动态知识维护和冲突解决至关重要。 这些在不同数据分布基准上的一致提升表明,AMA 能够有效泛化到多样化长期推理场景,同时稳健支持动态知识演化。

4.3 Ablation Studies

表3:记忆设计消融。RT、FD、EP 和 RF 分别表示 Raw Text Memory、Fact Knowledge Memory、Episode memory 和 Refresher。
MethodLoCoMoLongMemEval_s
RTFKEPRFLLM ScoreF1BLEU-1Knowledge-updateAverage
××0.6690.4840.4010.7670.614
××0.7120.5020.4240.8040.642
××0.6880.4950.4070.7480.630
×0.7520.5380.4480.8630.686
×0.7250.5150.4280.8040.656
×0.7410.5290.4350.8420.680
0.7740.5580.4650.8970.710
×0.7710.5560.4620.5680.634

We conduct ablation studies on LoCoMo and LongMemEval_s, with results in Table 3.

我们在 LoCoMo 和 LongMemEval_s 上进行消融研究,结果见表3

Impact of Memory Granularities. We first analyze the contribution of individual memory granularities. Under single-granularity settings, Fact Knowledge Memory performs best, achieving an LLM Score of 0.712 on LoCoMo and the highest average accuracy of 0.642 on LongMemEval_s, indicating the effectiveness of structured factual representations for long-term retrieval and reasoning. Moreover, jointly enabling Raw Text, Fact Knowledge, and Episodic Memory yields the strongest overall performance across both benchmarks, outperforming any single-granularity configuration. This result highlights the complementary nature of different memory forms and underscores the importance of multi-granularity collaboration.

Impact of Memory Granularities。 我们首先分析各个记忆粒度的贡献。 在单粒度设置下,Fact Knowledge Memory 表现最好,在 LoCoMo 上达到 0.712 的 LLM Score,并在 LongMemEval_s 上取得最高平均准确率 0.642,表明结构化事实表示对长期检索和推理有效。 此外,同时启用 Raw Text、Fact Knowledge 和 Episodic Memory 会在两个基准上得到最强整体性能,优于任何单粒度配置。 这一结果凸显了不同记忆形式的互补性,并强调了多粒度协作的重要性。

Effectiveness of the Refresher. Beyond memory representation, we evaluate the role of the Refresher in maintaining long-term consistency. Under the full multi-granularity setting, enabling the Refresher substantially improves performance on knowledge-update scenarios in LongMemEval_s, achieving an accuracy of 0.897. In contrast, removing the Refresher leads to a sharp drop to 0.568. This pronounced degradation indicates that accurate long-term memory requires not only multi-granularity storage, but also explicit mechanisms for conflict resolution and memory updating.

Effectiveness of the Refresher。 除记忆表示之外,我们评估 Refresher 在维护长期一致性中的作用。 在完整多粒度设置下,启用 Refresher 显著提升 LongMemEval_s 中 knowledge-update 场景的性能,准确率达到 0.897。 相比之下,移除 Refresher 会导致性能急剧下降到 0.568。 这种明显退化表明,准确的长期记忆不仅需要多粒度存储,还需要显式的冲突解决和记忆更新机制。

4.4 Efficiency Analysis

表4:LoCoMo 上的效率-性能权衡。我们报告不同方法的输入 token 使用量、推理延迟和 LLM Score。
MethodTokensLatency (s)LLM Score
FullContext186257.2060.717
RAG58002.9830.300
Nemori29253.1520.740
Zep24613.2550.580
Mem013403.7390.608
A-Mem27203.2270.476

AMA (Kr = 1)

24913.1240.723

AMA (Kr = 2)

36133.9100.774

We evaluate the efficiency-performance trade-off of AMA on the LoCoMo benchmark, with results reported in Table 4. Compared to FullContext, which processes 18625 input tokens with a latency of 7.21 seconds, AMA substantially reduces input length while maintaining strong performance. With the default setting Kr=2, AMA requires only 3613 tokens, approximately 19% of FullContext, with a latency of 3.91 seconds, while achieving the highest LLM Score of 0.774 among all compared memory frameworks. This configuration represents a favorable balance between efficiency and accuracy. Even under the more efficient setting (Kr=1), AMA attains an LLM Score of 0.723, only lower than Nemori (0.740), while operating within a comparable latency range to Nemori, Zep, Mem0, and A-Mem. These results show that AMA maintains competitive reasoning capability at low retrieval depth and offers a flexible trade-off between computational efficiency and performance.

我们在 LoCoMo 基准上评估 AMA 的效率-性能权衡,结果见表4 与处理 18625 个输入 token、延迟为 7.21 秒的 FullContext 相比,AMA 在保持强性能的同时显著减少输入长度。 在默认设置 Kr=2 下,AMA 仅需要 3613 个 token,约为 FullContext 的 19%,延迟为 3.91 秒,同时在所有比较的记忆框架中取得最高 LLM Score 0.774。 这种配置在效率和准确性之间取得了有利平衡。 即使在更高效的设置(Kr=1)下,AMA 也达到 0.723 的 LLM Score,仅低于 Nemori(0.740),同时与 Nemori、Zep、Mem0 和 A-Mem 运行在相近延迟范围内。 这些结果表明,AMA 在低检索深度下保持有竞争力的推理能力,并提供计算效率和性能之间的灵活权衡。

4.5 Analysis of the Retrieval Round Limit Kr

Effect of retrieval round limit on LoCoMoEffect of retrieval round limit on LongMemEval_sToken and latency cost of retrieval round limit
图4:检索轮数上限 Kr 的影响。左图和中图显示,增大 Kr 会在 LoCoMo 和 LongMemEval_s 上提升性能但收益递减;右图展示相应的 token 消耗和推理延迟增长。

Figure 4 analyzes the impact of the retrieval round limit Kr on both model performance and computational cost. Increasing Kr from 1 to 3 yields consistent performance gains on LoCoMo and LongMemEval_s, suggesting that additional retrieval rounds progressively surface useful historical information for long-term reasoning. However, the improvement exhibits clear diminishing returns, with performance largely saturating beyond Kr5. In contrast, both input token consumption and inference latency grow approximately linearly with Kr, reflecting the increasing overhead of deeper retrieval. Balancing these trends, we adopt Kr=2 as the default setting, which achieves near-optimal performance while substantially reducing token usage and latency, offering an effective trade-off between long-term reasoning quality and computational efficiency in practice. In addition, we present a detailed case study in Appendix Case, which demonstrates AMA's capabilities in adaptive retrieval and conflict resolution.

图4分析了检索轮数上限 Kr 对模型性能和计算成本的影响。 Kr 从 1 增加到 3 会在 LoCoMo 和 LongMemEval_s 上带来一致性能提升,表明额外检索轮次会逐步暴露对长期推理有用的历史信息。 然而,这种提升呈现明显边际收益递减,性能在 Kr5 后基本饱和。 相比之下,输入 token 消耗和推理延迟都随 Kr 近似线性增长,反映了更深检索带来的开销增加。 权衡这些趋势后,我们采用 Kr=2 作为默认设置,它在显著减少 token 使用和延迟的同时达到接近最优的性能,在实践中为长期推理质量和计算效率提供有效权衡。 此外,我们在 Appendix Case 中提供详细案例研究,展示 AMA 在自适应检索和冲突解决方面的能力。

5. Conclusion

In this work, we introduce AMA, a multi-agent memory framework for long-term interactions that integrates multi-granularity memory, adaptive routing, and principled memory maintenance. By decomposing the memory lifecycle into coordinated agent roles, AMA dynamically aligns retrieval granularity with task demands while maintaining memory consistency over time. Extensive experiments demonstrate that AMA consistently outperforms strong baselines across challenging long-context benchmarks, validating the effectiveness of our design. Overall, this work underscores the importance of adaptive retrieval control and long-term memory management for building robust and scalable LLM agents.

在本文中,我们提出 AMA,这是一个面向长期交互的多智能体记忆框架,集成了多粒度记忆、自适应路由和原则化记忆维护。 通过将记忆生命周期分解为协同智能体角色,AMA 在随时间保持记忆一致性的同时,动态地将检索粒度与任务需求对齐。 大量实验证明,AMA 在具有挑战性的长上下文基准上持续优于强基线,验证了我们设计的有效性。 总体而言,这项工作强调了自适应检索控制和长期记忆管理对于构建稳健且可扩展 LLM 智能体的重要性。

Limitations

Despite the significant performance gains, the multi-agent collaboration incurs a moderate computational overhead compared to static retrieval baselines. Additionally, the reliance on the backbone model's reasoning capabilities suggests that the system's efficiency on smaller architectures has room for further optimization. We aim to address these challenges in future work to further enhance the efficiency and universality of the framework.

尽管性能提升显著,但与静态检索基线相比,多智能体协作会带来中等程度的计算开销。 此外,对 backbone 模型推理能力的依赖表明,系统在较小架构上的效率仍有进一步优化空间。 我们计划在未来工作中解决这些挑战,以进一步增强框架的效率和通用性。