Skip to content

MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

MemoryAgentMulti-Agent宾夕法尼亚州立大学AmazonMicrosoft

Lin M, Zhang Z, Lu H, et al. MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution[J]. arXiv preprint arXiv:2603.18718v1, 2026.

https://github.com/ventr1c/memma


MemMA:通过多智能体推理与原位自演化协调记忆循环

Abstract

Memory-augmented LLM agents maintain external memory banks to support long-horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta-Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in-situ self-evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug-and-play manner.

记忆增强型 LLM 智能体维护外部记忆库以支持长程交互,但大多数现有系统将构建、检索和利用视为彼此孤立的子例程。 这产生了两个耦合挑战:在记忆循环的前向路径上存在 战略盲区,即构建和检索由局部启发式驱动,而不是由显式战略推理驱动;在反向路径上存在 稀疏且延迟的监督,即下游失败很少转化为对记忆库的直接修复。 为解决这些挑战,作者提出 MemMA,这是一个即插即用的多智能体框架,可沿前向和反向两条路径协调记忆循环。 在前向路径上,Meta-Thinker 生成结构化指导,在构建期间引导 Memory Manager,并在迭代检索期间指导 Query Reasoner。 在反向路径上,MemMA 引入原位自演化记忆构建:它合成探测问答对,验证当前记忆,并在记忆最终确定前将失败转化为修复动作。 在 LoCoMo 上的大量实验表明,MemMA 在多个 LLM 骨干模型上持续优于现有基线,并以即插即用方式改进了三种不同存储后端。

1. Introduction

Large language models (LLMs) are evolving from episodic chatbots into persistent agentic systems that execute complex workflows over days or weeks. In such settings, agents receive a continuous stream of observations---user constraints, tool outputs, and environmental feedback---whose consequences unfold over long horizons. This shift makes controllable, long-term memory a first-class requirement: relying solely on ephemeral context windows is insufficient, as they are computationally expensive and prone to attention dilution. To maintain coherence over time, agents must actively manage an external memory bank, deciding what to retain and how to retrieve it under uncertainty.

大语言模型(LLMs)正从短会话聊天机器人演化为持久的 智能体式系统,能够在数天或数周内执行复杂工作流。 在这样的设置中,智能体会接收连续的观测流,包括用户约束、工具输出和环境反馈,而这些信息的后果会在长时间范围内展开。 这一转变使 可控的长期记忆 成为一等需求:仅依赖短暂的上下文窗口是不够的,因为它们计算成本高且容易出现注意力稀释。 为了随时间保持连贯性,智能体必须主动管理外部记忆库,在不确定性下决定保留什么以及如何检索。

Two challenges in leveraging the memory cycle effect
图1:利用记忆循环效应时的两个挑战。

Effective memory, however, is not merely a storage utility; it is a closed-loop dynamic, conceptualized as the memory cycle effect. This cycle has three coupled phases: construction, retrieval, and utilization. Construction determines what information enters the memory bank and how it is organized; retrieval determines what stored information is surfaced as evidence; and utilization reveals whether the retrieved evidence is sufficient for downstream reasoning. This coupling implies that optimizing these stages in isolation is fundamentally suboptimal: a retrieval failure may stem from a much earlier construction error, while utilization outcomes should ideally feed back to improve future memory decisions. Despite this intrinsic dependency, most existing memory-augmented agents still treat memory operations as isolated, reactive subroutines, overlooking the coupling between stages. To leverage the memory cycle effect, two technical challenges must be addressed (Figure 1).

然而,有效记忆并不仅仅是一个存储工具;它是一种闭环动态,被概念化为 记忆循环效应 这一循环包含三个耦合阶段:构建检索利用 构建决定哪些信息进入记忆库以及如何组织;检索决定哪些已存信息会作为证据被呈现;利用则揭示检索到的证据是否足以支持下游推理。 这种耦合意味着孤立优化这些阶段从根本上是次优的:检索失败可能源于更早的构建错误,而利用结果理想情况下应反馈回来改进未来的记忆决策。 尽管存在这种内在依赖,大多数现有记忆增强型智能体仍将记忆操作视为孤立的、反应式的子例程,忽略了阶段之间的耦合。 为了利用记忆循环效应,必须解决两个技术挑战(图1)。

First, on the forward path of the memory cycle, current systems often suffer from strategic blindness: they possess the mechanisms to edit memory and issue retrieval queries, yet lack explicit meta-cognition to coordinate these actions toward downstream question answering. As our preliminary analysis shows (Section 3.3), this manifests as two pathologies: (i) Myopic Construction, where the agent accumulates or overwrites conflicting facts without resolution; and (ii) Aimless Retrieval, where the agent performs shallow or repetitive searches without narrowing the true information gap. These failures suggest that effective forward-path memory behavior requires explicit coordination between construction and retrieval, rather than isolated, short-sighted decisions.

第一,在记忆循环的 前向路径 上,当前系统经常遭遇 战略盲区:它们拥有编辑记忆和发起检索查询的机制,却缺少显式的 元认知来协调这些动作以服务下游问答 如作者的预备分析所示(第 3.3 节),这表现为两类病症:(i) 短视构建(Myopic Construction),其中智能体积累或覆盖冲突事实却不解决冲突;(ii) 盲目检索(Aimless Retrieval),其中智能体执行浅层或重复搜索,却没有缩小真正的信息缺口。 这些失败表明,有效的前向路径记忆行为需要构建与检索之间的显式协调,而不是孤立、短视的决策

Second, on the backward path of the memory cycle, feedback from utilization to construction is typically sparse and delayed. Whether a memory-writing decision is useful may become clear only much later, when the agent fails a downstream question. This makes credit assignment difficult: when an answer is wrong, it is hard to identify which earlier construction decision caused the failure, allowing omissions and unresolved conflicts to persist in the memory bank and affect later updates. Although recent methods use reflection or experiential learning to improve agent behavior, downstream failures are still rarely converted into direct signals for repairing the memory bank itself.

第二,在记忆循环的 反向路径 上,从利用到构建的反馈通常是 稀疏且延迟的 一个记忆写入决策是否有用,可能要到很久之后,当智能体在下游问题上失败时才变得清楚。 这使信用分配变得困难:当答案错误时,很难识别是哪一个更早的构建决策导致了失败,从而让遗漏和未解决冲突持续存在于记忆库中并影响后续更新。 尽管近期方法使用反思或经验学习来改进智能体行为,下游失败仍然很少被转换为修复记忆库本身的直接信号。

To address these challenges, we propose MemMA (Memory Cycle Multi-Agent Coordination), a plug-and-play multi-agent framework that coordinates the memory cycle along its forward and backward paths. Specifically, for the forward path, MemMA separates strategic reasoning from low-level execution through a planner--worker architecture: a Meta-Thinker produces structured guidance that steers a Memory Manager during construction (what to retain, consolidate, or resolve), thereby mitigating Myopic Construction, and directs a Query Reasoner during retrieval by diagnosing missing evidence and how to retrieve it, replacing one-shot search with diagnosis-guided iterative refinement and thereby mitigating Aimless Retrieval. For the backward path, MemMA introduces in-situ self-evolving memory construction: after each session, the system synthesizes probe QA pairs, verifies the memory against them, and converts failures into repair actions on the memory bank through evidence-grounded critique and semantic consolidation, before the memory is committed for future use. This directly addresses sparse and delayed supervision by turning downstream failures into immediate, localized repair signals for the current memory state, before flawed memories propagate into future memory updates.

为解决这些挑战,作者提出 MemMA(Memory Cycle Multi-Agent Coordination),这是一个即插即用的多智能体框架,沿其前向和反向两条路径协调记忆循环。 具体来说,对于 前向路径,MemMA 通过规划器--执行器架构将战略推理与底层执行分离:Meta-Thinker 生成结构化指导,在构建期间引导 Memory Manager(决定保留、合并或解决什么),从而缓解 短视构建;并在检索期间通过诊断缺失证据及其检索方式来指导 Query Reasoner,用诊断引导的迭代细化替代一次性搜索,从而缓解 盲目检索 对于 反向路径,MemMA 引入原位自演化记忆构建:每个会话之后,系统合成探测问答对,用它们验证记忆,并通过基于证据的批判和语义整合将失败转化为记忆库上的修复动作,然后再把记忆提交给未来使用。 这通过将下游失败转化为当前记忆状态的即时、局部化修复信号,直接解决 稀疏且延迟的监督,避免有缺陷的记忆传播到未来的记忆更新。

Our contributions are:

  • Analysis. We identify two technical challenges in leveraging the memory cycle effect: strategic blindness on the forward path and sparse, delayed feedback on the backward path, and provide empirical evidence through a controlled preliminary study (Section 3.3).
  • Framework. We propose MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along both its forward and backward paths, combining reasoning-aware coordination for construction and iterative retrieval with in-situ self-evolving memory construction for backward repair.
  • Experiments. MemMA outperforms existing baselines on LoCoMo across multiple LLM backbones, and consistently improves three storage backends as a plug-and-play module.

作者的贡献如下:

  • 分析。 作者识别出利用记忆循环效应的两个技术挑战:前向路径上的 战略盲区 以及反向路径上的 稀疏且延迟的反馈,并通过受控预备研究提供经验证据(第 3.3 节)。
  • 框架。 作者提出 MemMA,这是一个即插即用的多智能体框架,沿前向和反向两条路径协调记忆循环,将用于构建与迭代检索的推理感知协调,和用于反向修复的原位自演化记忆构建结合起来。
  • 实验。 MemMA 在 LoCoMo 上跨多个 LLM 骨干模型优于现有基线,并作为即插即用模块持续改进三种存储后端。

Memory-Augmented LLM Agents. External memory has become a core component of LLM agents that operate over long horizons. Prior work improves long-term memory from several directions, including memory architecture, memory organization and consolidation, and memory retrieval. These methods substantially improve individual stages of the memory pipeline, but they primarily optimize storage, organization, or retrieval in isolation. Our work is inherently different from existing work: MemMA jointly coordinates memory construction and iterative retrieval, and converts utilization failures into direct repair signals for the memory bank. Full version is in Appendix.

记忆增强型 LLM 智能体。 外部记忆已成为长时间运行的 LLM 智能体的核心组件。 先前工作从多个方向改进长期记忆,包括记忆架构、记忆组织与整合,以及记忆检索。 这些方法显著改进了记忆流程的单个阶段,但它们主要是孤立地优化存储、组织或检索。 本文工作与现有工作本质不同:MemMA 联合协调记忆构建和迭代检索,并将利用失败转化为记忆库的直接修复信号。 完整版本见附录。

3. Preliminaries and Motivation

3.1 Problem Setting

Task Setup. We consider a long-horizon conversational setting in which an agent processes a stream of dialogue chunks C={c1,,cT} over time. The stream is further organized into sessions S={s1,,sN}, where each session sτ consists of one or more consecutive chunks corresponding to a coherent interaction episode. At each step t, the agent maintains an external memory bank Mt composed of structured entries (e.g., text, timestamp, source, and speaker metadata), which is updated as new conversational information arrives. After processing the full stream C, the agent is evaluated on a set of questions Q. For each query qQ, it retrieves evidence E(q) from MT and outputs an answer y^(q). Our goal is to design an agent π that maximizes answer accuracy by jointly improving memory construction and retrieval.

任务设置。 作者考虑一种长程对话设置,其中智能体随时间处理对话片段流 C={c1,,cT} 该流进一步被组织为会话 S={s1,,sN},其中每个会话 sτ 由一个或多个连续片段组成,对应一个连贯的交互片段。 在每一步 t,智能体维护一个外部记忆库 Mt,由结构化条目组成(例如文本、时间戳、来源和说话人元数据),并随着新对话信息到来而更新。 处理完整流 C 后,智能体会在一组问题 Q 上接受评估。 对于每个查询 qQ,它从 MT 检索证据 E(q),并输出答案 y^(q) 作者的目标是设计一个智能体 π,通过联合改进记忆构建和检索来最大化回答准确率。

Challenges. This setting is challenging because success depends on both memory construction and memory retrieval. During construction, the agent must decide what to write, update, merge, or discard when a new chunk arrives. During retrieval and answering, it must identify the right evidence from memory under ambiguity, temporal dependencies, and incomplete or underspecified initial queries. The challenge is therefore not merely to improve answer generation, but to maintain a useful memory bank and retrieve the right evidence under bounded memory and retrieval budgets.

挑战。 这一设置具有挑战性,因为成功同时依赖记忆构建和记忆检索。 构建 期间,当新片段到来时,智能体必须决定写入、更新、合并或丢弃什么。 检索和回答 期间,它必须在歧义、时间依赖以及不完整或欠指定的初始查询下,从记忆中识别正确证据。 因此,挑战不仅是改进答案生成,而是在有界记忆和检索预算下维护有用的记忆库并检索正确证据。

3.2 Memory Cycle Effect as a Design Lens

The above challenges suggest that long-term memory should not be viewed as a linear pipeline of isolated modules. Instead, we adopt the memory cycle effect as a design lens for analyzing long-term memory systems. Under this view, memory forms a closed loop with three tightly coupled phases: construction, retrieval, and utilization. Construction determines what information enters the memory bank and how it is organized; retrieval determines what stored information is surfaced as evidence; and utilization reveals whether the retrieved evidence is sufficient for downstream answering.

上述挑战表明,长期记忆不应被视为由孤立模块组成的线性流程。 相反,作者采用 记忆循环效应 作为分析长期记忆系统的设计视角。 在这一视角下,记忆形成一个闭环,其中包含三个紧密耦合的阶段:构建检索利用 构建决定哪些信息进入记忆库以及如何组织;检索决定哪些已存储信息作为证据被呈现;利用则揭示检索到的证据是否足以支持下游回答。

This perspective highlights two dependencies. First, there is a forward dependency: construction constrains retrieval, and retrieval in turn constrains utilization. A poorly constructed memory bank may omit important details, retain redundant entries, or leave conflicts unresolved, all of which degrade downstream retrieval quality. Second, there is a backward dependency: utilization outcomes expose deficiencies in upstream memory operations, since answering failures may stem from earlier storage omissions, unresolved contradictions, or poorly targeted retrieval. As a result, the utility of memory operations is often sparse and delayed, making isolated optimization of memory modules fundamentally suboptimal.

这一视角突出了两种依赖。 第一,存在 前向依赖:构建约束检索,而检索又约束利用。 构造不良的记忆库可能遗漏重要细节、保留冗余条目或留下未解决冲突,所有这些都会降低下游检索质量。 第二,存在 反向依赖:利用结果会暴露上游记忆操作中的缺陷,因为回答失败可能源于更早的存储遗漏、未解决矛盾或目标不准的检索。 因此,记忆操作的效用往往稀疏且延迟,使记忆模块的孤立优化从根本上次优。

Together, these dependencies suggest that long-term memory should be studied as a coupled cycle rather than independent storage and retrieval components. This motivates the need for mechanisms that explicitly coordinate forward memory execution and propagate utilization feedback backward to improve future memory decisions.

综合来看,这些依赖表明,长期记忆应作为一个耦合循环来研究,而不是作为彼此独立的存储和检索组件来研究。 这促使人们需要这样的机制:显式协调前向记忆执行,并将利用反馈向后传播,以改进未来的记忆决策。

表1:LoCoMo 数据集上的预备分析结果 (%),GPT-4o-mini 是骨干 LLM。
MethodF1B1ACC
Static Baseline22.6417.2452.60
Unguided Active23.4918.3654.60
Strategic Active24.7817.7359.21

3.3 Motivating Analysis: Strategic Blindness

The analysis above motivates coordination across the memory cycle, but do existing active memory agents achieve this in practice? Recent agents have moved beyond fully passive memory by introducing active updates or iterative retrieval. However, most still operate in a largely reactive manner: they trigger operations based on local context or immediate similarity signals rather than an explicit global strategy. We characterize this limitation as strategic blindness: the agent has the hands to edit memory and issue retrieval queries, but lacks the brain to coordinate these actions across the full memory cycle. This manifests as: (i) Myopic Construction: construction decisions are driven by local context rather than downstream utility. The agent indiscriminately appends, overwrites, or ignores information, leaving redundancy and conflicts unresolved. (ii) Aimless Retrieval: when the initial query is incomplete or semantically mismatched with stored memory, one-shot retrieval or shallow rewrites fail to surface the required evidence. Without strategic guidance, successive queries do not narrow the information gap.

上述分析激发了跨记忆循环进行协调的需求,但现有主动记忆智能体在实践中是否做到了这一点? 近期智能体通过引入主动更新或迭代检索,已经超越了完全被动的记忆。 然而,大多数仍主要以反应式方式运行:它们基于局部上下文或即时相似度信号触发操作,而不是基于显式全局策略。 作者将这一限制刻画为 战略盲区:智能体有编辑记忆和发出检索查询的“手”,却缺少跨完整记忆循环协调这些动作的“大脑”。 这表现为:(i) 短视构建:构建决策由局部上下文而非下游效用驱动。 智能体不加区分地追加、覆盖或忽略信息,留下冗余和冲突未解决。 (ii) 盲目检索:当初始查询不完整或与已存记忆存在语义不匹配时,一次性检索或浅层改写无法呈现所需证据。 如果没有战略指导,连续查询不会缩小信息缺口。

Setup. To empirically validate this diagnosis, we conduct a preliminary study on a subset of LoCoMo, focusing on reasoning-intensive queries by excluding adversarial samples. We compare three progressively stronger baselines using GPT-4o-mini as the backbone: (i) Static, which performs memory construction followed by one-shot top-30 retrieval; (ii) Unguided Active, which adds iterative query rewriting without strategic guidance; and (iii) Strategic Active, which introduces a planner to guide both construction and retrieval. We report token-level F1, BLEU-1 (B1), and LLM-as-a-Judge accuracy (ACC). More evaluation details are provided in Appendix.

设置。 为了通过经验验证这一诊断,作者在 LoCoMo 的一个子集上进行预备研究,通过排除对抗样本专注于推理密集型查询。 作者使用 GPT-4o-mini 作为骨干模型,比较三个逐渐增强的基线:(i) Static,它执行记忆构建后进行一次性 top-30 检索;(ii) Unguided Active,它加入迭代查询改写但没有战略指导;(iii) Strategic Active,它引入规划器来同时指导构建和检索。 作者报告 词元级 F1、BLEU-1 (B1) 和 LLM 裁判准确率 (ACC)。 更多评估细节见附录。

Empirical analysis. Table 1 reveals two findings: (i) Refinement provides capability: Unguided Active (54.6% Acc) outperforms Static (52.6%), confirming that one-shot retrieval often fails to surface the required evidence when the initial query is incomplete or mismatched with memory, which directly reflects Aimless Retrieval. (ii) Reasoning provides control: Strategic Active achieves a larger leap to 59.2% Acc. Since it shares the same active operators as Unguided Active, this gap reflects the value of explicit strategic guidance in addressing both Aimless Retrieval and Myopic Construction. Case studies in Appendix further illustrate both pathologies with concrete examples of redundant entries and retrieval drift. These findings suggest that active memory operations alone are insufficient: explicit strategic reasoning is needed to guide both construction and retrieval.

经验分析。 表1 揭示了两个发现:(i) 细化带来能力:Unguided Active(54.6% ACC)优于 Static(52.6%),证实当初始查询不完整或与记忆不匹配时,一次性检索经常无法呈现所需证据,这直接反映了 盲目检索 (ii) 推理带来控制:Strategic Active 取得更大跃升,达到 59.2% ACC。 由于它与 Unguided Active 共享相同的主动算子,这一差距反映了显式战略指导在解决 盲目检索短视构建 上的价值。 附录中的案例研究进一步用冗余条目和检索漂移的具体例子说明了两种病症。 这些发现表明,仅有主动记忆操作是不够的:需要显式战略推理来同时指导构建和检索。

4. Methodology

Motivated by the memory cycle effect (Section 3.2) and strategic blindness (Section 3.3), we present MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along its forward and backward paths (Figure 2). Section 4.1 describes the forward path: a planner--worker architecture that separates strategic reasoning from low-level execution to address strategic blindness. Section 4.2 describes the backward path: an in-situ self-evolution mechanism that addresses sparse, delayed feedback by generating synthetic probe QA immediately after each session, providing dense, localized supervision for memory repair before the current memory is committed.

受记忆循环效应(第 3.2 节)和战略盲区(第 3.3 节)启发,作者提出 MemMA,这是一个即插即用的多智能体框架,沿其前向和反向两条路径协调记忆循环(图2)。 第 4.1 节描述前向路径:一种规划器--执行器架构,它将战略推理与底层执行分离,以解决战略盲区。 第 4.2 节描述反向路径:一种原位自演化机制,它通过在每个会话后立即生成合成探测问答来解决稀疏且延迟的反馈,并在当前记忆提交前为记忆修复提供密集、局部化监督。

Overview of MemMA
图2:MemMA 概览。

4.1 Reasoning-Aware Coordination over the Forward Path

MemMA coordinates online construction, iterative retrieval, and answer-time utilization through specialized yet tightly coupled agents. Its key design principle is to separate strategic reasoning (what to store, what is missing, and when to stop) from low-level execution (memory editing, evidence retrieval, and answer generation).

MemMA 通过专业化但紧密耦合的智能体来协调在线构建、迭代检索和答题时利用。 其关键设计原则是将战略推理(存什么、缺什么、何时停止)与底层执行(记忆编辑、证据检索和答案生成)分离。

Pipeline Overview. MemMA uses a planner--worker architecture with four roles: (i) a Meta-Thinker πp for high-level strategic reasoning, (ii) a Memory Manager πs for memory editing, (iii) a Query Reasoner πr for iterative query refinement, and (iv) an Answer Agent πa for final response generation. During construction, when a new dialogue chunk ct arrives, πp analyzes it against existing memory Mt1 and produces meta-guidance on what to retain, consolidate, or resolve. Conditioned on the guidance, πs selects an atomic edit to update Mt1 to Mt. During question answering, given a query q, πr retrieves candidate evidence from MT and iteratively refines its search. At each step, πp judges whether the current evidence is sufficient; if not, it identifies the most critical gap and directs πr to refine the query toward complementary evidence. The loop ends when πp deems the evidence sufficient or a budget H is reached. Then πa generates the final answer. We detail each component below.

流程概览。 MemMA 使用规划器--执行器架构,并包含四种角色:(i) 用于高层战略推理的 Meta-Thinker πp,(ii) 用于记忆编辑的 Memory Manager πs,(iii) 用于迭代查询细化的 Query Reasoner πr,以及 (iv) 用于最终回答生成的 Answer Agent πa 构建 期间,当新的对话片段 ct 到来时,πp 会结合现有记忆 Mt1 分析该片段,并生成关于应保留、合并或解决什么的元指导。 在指导条件下,πs 选择一个原子编辑,将 Mt1 更新为 Mt 问答 期间,给定查询 qπrMT 检索候选证据,并迭代细化搜索。 在每一步,πp 判断当前证据是否充分;如果不充分,它识别最关键的缺口,并指导 πr 将查询细化到互补证据。 πp 认为证据充分或达到预算 H 时,循环结束。 然后 πa 生成最终答案。 作者在下文详细介绍各个组件。

Meta-Thinker πp. πp is the planning layer of MemMA, responsible for both construction and retrieval guidance. It produces phase-specific guidance conditioned on the current input and a bounded memory view:

Meta-Thinker πp. πp 是 MemMA 的规划层,负责构建和检索指导。 它在当前输入和有界记忆视图条件下生成阶段特定指导:

gtSπp(ct,M~t1),gq,hRπp(q,Eh,Uh,M~T).

where gtS is construction guidance at step t and gq,hR is retrieval guidance at refinement step h. Here, Eh denotes the evidence accumulated up to step h, Uh={u0,,uh} denotes the query history, and M~ denotes a bounded view of the memory bank, e.g., top-k recent or semantically related entries. Construction. gtS provides a set of focus points that flag information importance, redundancy with existing entries, and potential conflicts. These focus points steer πs toward globally consistent memories rather than indiscriminate accumulation. Retrieval. gq,hR is a critique of the current evidence Eh. πp evaluates coverage, consistency, and specificity with respect to q. If the evidence is sufficient, it returns answerable; otherwise, it returns not-answerable together with a diagnosis of what is missing and how to retrieve it, e.g., a missing attribute or temporal scope. This encourages orthogonal evidence acquisition rather than near-duplicate searches. Full guidance templates and examples are in Appendix.

其中 gtS 是步骤 t 的构建指导,gq,hR 是细化步骤 h 的检索指导。 这里,Eh 表示截至步骤 h 累积的证据,Uh={u0,,uh} 表示查询历史,M~ 表示记忆库的 有界 视图,例如最近的 top-k 条目或语义相关条目。 构建。 gtS 提供一组 关注点,用于标记信息重要性、与现有条目的冗余,以及潜在冲突。 这些关注点引导 πs 形成全局一致的记忆,而不是不加区分地积累信息。 检索。 gq,hR 是对当前证据 Eh 的评判。 πp 根据 q 评估覆盖度、一致性和特异性。 如果证据充分,它返回“可回答”;否则,它返回“不可回答”,并给出缺失内容以及如何检索的诊断,例如缺失属性或时间范围。 这鼓励获取正交证据,而不是进行近重复搜索。 完整指导模板和示例见附录。

Memory Manager πs. πs performs atomic memory edits based on the current chunk, bounded context, and guidance from πp. Given ct, M~t1, and gtS, it selects an action atS{ADD,UPDATE,DELETE,NONE}:

Memory Manager πs. πs 基于当前片段、有界上下文和来自 πp 的指导执行原子记忆编辑。 给定 ctM~t1gtS,它选择一个动作 atS{ADD,UPDATE,DELETE,NONE}

atSπs(ct,M~t1,gtS),Mt=Apply(Mt1,atS).

The guidance signal gtS helps πs filter noise, consolidate redundancy, and resolve conflicts at the source rather than blindly appending. πs is backend-agnostic and can wrap diverse memory implementations such as LightMem and A-Mem.

指导信号 gtS 帮助 πs 过滤噪声、合并冗余,并在源头解决冲突,而不是盲目追加。 πs 与后端无关,可以封装 LightMem 和 A-Mem 等多种记忆实现。

Query Reasoner πr. πr implements the active retrieval policy. To overcome the Aimless Retrieval (Section 3.3), it replaces one-shot search with an iterative Refine-and-Probe loop. Let u0=q be the initial query and Uh={u0,,uh} the query history. At step h, when πp deems the current evidence Eh not-answerable, it emits guidance gq,hR. πr then proposes the next query and retrieves additional evidence:

Query Reasoner πr. πr 实现 主动检索策略 为了克服 盲目检索(第 3.3 节),它用迭代式 细化-探测(Refine-and-Probe) 循环替代一次性搜索。 u0=q 为初始查询,Uh={u0,,uh} 为查询历史。 在步骤 h,当 πp 认为当前证据 Eh 不可回答时,它发出指导 gq,hR 随后 πr 提出下一个查询并检索额外证据:

uh+1πr(Uh,Eh,gq,hR),Eh+1=EhSearch(MT,uh+1).

The loop terminates when πp returns answerable or the budget H is reached. Each refinement step targets the specific information gap diagnosed by πp, so successive queries narrow the deficit rather than drifting across redundant rewrites. Full query rewrite prompt templates are in Appendix.

πp 返回可回答或达到预算 H 时,循环终止。 每个细化步骤都针对 πp 诊断出的特定信息缺口,因此连续查询会缩小缺口,而不是在冗余改写间漂移。 完整查询改写提示模板见附录。

Answer Agent πa. Once the retrieval loop terminates, πa generates the final answer y^(q) based on the query and the final evidence set E(q)=EH:

Answer Agent πa. 一旦检索循环终止,πa 基于查询和最终证据集 E(q)=EH 生成最终答案 y^(q)

y^(q)=Fπa(q,E(q)).

where Fπa denotes a generation function (e.g., an LLM call). In our experiments, πa is kept frozen to decouple answer-generation capacity from memory quality, so that gains can be attributed to coordination over the memory cycle rather than to the parametric knowledge of πa.

其中 Fπa 表示生成函数(例如一次 LLM 调用)。 在作者的实验中,πa 保持冻结,以将答案生成能力与记忆质量解耦,从而把收益归因于对记忆循环的协调,而不是 πa 的参数知识。

4.2 In-Situ Self-Evolving Memory Construction

A major bottleneck in the memory cycle is that feedback for construction is typically sparse and delayed. The utility of a storage decision made in session τ may become observable only much later, when the agent fails a downstream question. Optimizing construction solely from final-task outcomes makes credit assignment difficult and lets early omissions propagate uncorrected. To address this, we introduce in-situ self-evolving memory construction, which provides dense intermediate feedback for the construction stage. Instead of waiting for a future user query to expose a memory failure, MemMA synthesizes a set of probe QA pairs after each session and uses them to verify and repair the current memory before it is committed.

记忆循环中的一个主要瓶颈是构建反馈通常稀疏且延迟。 在会话 τ 中做出的存储决策,其效用可能只有很久之后,当智能体在下游问题上失败时才变得可见。 仅从最终任务结果优化构建会使信用分配变得困难,并让早期遗漏未经修正地传播。 为此,作者引入 原位自演化记忆构建,它为构建阶段提供密集的中间反馈。 MemMA 不等待未来用户查询暴露记忆失败,而是在每个会话后合成一组探测问答对,并在提交当前记忆前用它们验证和修复当前记忆。

Probe Generation. Let sτ denote the current session, and let Mτ(0) denote the provisional memory state obtained after applying the construction policy of Section 4.1 to sτ. To obtain intermediate supervision, we construct a probe set

探测问题生成。sτ 表示当前会话,并令 Mτ(0) 表示将第 4.1 节中的构建策略应用于 sτ 后得到的临时记忆状态。 为了获得中间监督,作者构造一个探测问题集:

Qτ={(qj,yj)}j=1J.

where each (qj,yj) is a synthetic question--answer pair grounded in sτ and its relevant historical context M~τ1. The questions are designed to test whether the provisional memory faithfully captures and can retrieve information introduced in the current session, covering single-session factual recall, cross-session relational reasoning, and temporal inference. This turns a delayed end-task signal into J localized supervision signals immediately after construction. Design details are in Appendix.

其中每个 (qj,yj) 都是基于 sτ 及其相关历史上下文 M~τ1 的合成问答对。 这些问题被设计用来测试临时记忆是否忠实捕捉并能检索当前会话中引入的信息,覆盖单会话事实回忆、跨会话关系推理和时间推理。 这会在构建后立即将一个延迟的终端任务信号转化为 J 个局部监督信号。 设计细节见附录。

In-situ Verification. Given Qτ, MemMA verifies the provisional memory state Mτ(0) immediately after the initial construction pass. For each probe qj, we retrieve top-k evidence from Mτ(0) and generate an answer with πa:

原位验证。 给定 Qτ,MemMA 在初始构建过程之后立即验证临时记忆状态 Mτ(0) 对于每个探测问题 qj,作者从 Mτ(0) 检索 top-k 证据,并用 πa 生成答案:

Ej=Search(Mτ(0),qj),y^j=Fπa(qj,Ej).

A probe is considered failed if y^j is judged incorrect with respect to yj. Such failures provide localized evidence that M0 is insufficient for information introduced in or linked to sτ.

如果 y^j 相对于 yj 被判定为错误,则该探测问题被认为失败。 这类失败提供局部证据,表明 M0sτ 中引入或与之相关的信息而言是不充分的。

Evidence-grounded Repair. For each failed probe, a reflection module converts the failure into a repair proposal. Conditioned on the question, gold answer, predicted answer, retrieved evidence, and the provisional memory state (qj,yj,y^j,Ej,Mτ(0)), it diagnoses whether the failure reflects missing information or memory content that is difficult to retrieve in its current form, and then proposes a candidate repair fact. Collecting all failed probes in the current batch yields a set of repair proposals

基于证据的修复。 对于每个失败探测问题,反思模块会将失败转化为修复提案。 在问题、标准答案、预测答案、检索到的证据和临时记忆状态 (qj,yj,y^j,Ej,Mτ(0)) 条件下,它诊断该失败是否反映缺失信息,或反映当前形式下难以检索的记忆内容,然后提出候选修复事实。 收集当前批次中的所有失败探测问题会得到一组修复提案:

Rτ={rj}qjQτfail,

where QτfailQτ denotes the failed probes.

其中 QτfailQτ 表示失败探测问题。

Semantic Consolidation. Applying all repairs in Rτ directly would reintroduce redundancy or conflicts, e.g., when two probes request overlapping or inconsistent additions. We therefore consolidate the candidate repair facts against Mτ(0). For each candidate fact, the consolidation step assigns one of three actions with respect to the existing memory: SKIP if it is redundant, MERGE if it complements an existing entry, or INSERT if it is novel. This resolves both conflicts with the existing memory and conflicts across repair proposals before any update is written back. The refined memory is obtained as

语义整合。 直接应用 Rτ 中的所有修复会重新引入冗余或冲突,例如当两个探测问题请求重叠或不一致的新增内容时。 因此,作者将候选修复事实与 Mτ(0) 进行整合。 对于每个候选事实,整合步骤会针对现有记忆分配三种动作之一:如果它冗余则 SKIP,如果它补充现有条目则 MERGE,如果它是新的则 INSERT 这会在任何更新写回之前,解决与现有记忆的冲突以及修复提案之间的冲突。 细化后的记忆得到为:

Mτ=Refine(Mτ(0),Rτ).

where Refine denotes consolidation followed by write-back over Rτ. In this way, utilization failures are detected and repaired during construction before they can propagate into later memory updates, while keeping the evolving memory compact and internally consistent.

其中 Refine 表示对 Rτ 进行整合后再写回。 通过这种方式,利用失败会在构建期间被检测并修复,先于它们传播到后续记忆更新,同时保持演化中的记忆紧凑且内部一致。

5. Experiments

This section presents the experimental results. We first compare MemMA with existing baselines, then evaluate its flexibility across storage backends, and finally assess the contribution of each component and key design choices.

本节展示实验结果。 作者首先将 MemMA 与现有基线比较,然后评估其跨存储后端的灵活性,最后评估各组件和关键设计选择的贡献。

5.1 Experimental Setup

Datasets. We evaluate MemMA on LoCoMo, a benchmark for long-horizon conversational memory. Following prior work, we exclude the adversarial subset and focus on the reasoning-intensive QA setting. More dataset details are provided in Appendix.

数据集。 作者在 LoCoMo 上评估 MemMA,这是一个用于长程对话记忆的基准。 按照先前工作,作者排除对抗子集,并聚焦于推理密集型问答设置。 更多数据集细节见附录。

Baselines. We compare against two passive baselines: Full Text and Naive RAG, and four active memory systems: LangMem, Mem0, A-Mem, and LightMem. Additional baseline details are in Appendix.

基线。 作者与两个被动基线比较:Full TextNaive RAG,以及四个主动记忆系统:LangMemMem0A-MemLightMem 额外基线细节见附录。

Evaluation Protocol. Following prior work, we report three metrics: token-level F1 (F1), BLEU-1 (B1), and LLM-as-a-Judge accuracy (ACC). F1 and B1 measure lexical overlap with the reference answer; ACC measures semantic correctness via a judge model. GPT-4o-mini and Claude-Haiku-4.5 are used as the backbones for the Memory Manager, Meta-Thinker, and Query Reasoner. To isolate memory construction quality from answer-generation capacity, we fix GPT-4o-mini as both the Answer Agent and the LLM judge across all experiments. The retrieval budget is top-30 entries, the iterative refinement budget is H=3, and we generate J=5 probe QA pairs per session for self-evolution. Additional implementation details are in Appendix.

评估协议。 按照先前工作,作者报告三个指标:词元级 F1 (F1)、BLEU-1 (B1) 和 LLM 裁判准确率 (ACC)。 F1 和 B1 衡量与参考答案的词面重叠;ACC 通过裁判模型衡量语义正确性。 GPT-4o-mini 和 Claude-Haiku-4.5 被用作 Memory Manager、Meta-Thinker 和 Query Reasoner 的骨干模型。 为了将记忆构建质量与答案生成能力隔离开,作者在所有实验中固定 GPT-4o-mini 同时作为 Answer Agent 和 LLM 裁判。 检索预算为 top-30 条目,迭代细化预算为 H=3,并且作者为自演化在每个会话生成 J=5 个探测问答对。 额外实现细节见附录。

表2:LoCoMo 上四类问题(多跳、时间、开放域、单跳)的结果。报告 F1、B1 和 ACC (%)。最佳结果用粗体表示。GPT-4o-mini 和 Claude-Haiku-4.5 是骨干模型;GPT-4o-mini 是答案智能体。MemMA_LM 使用 LightMem 作为存储后端。
ModelMethodMulti-HopTemporalOpen-DomainSingle-HopOverall
F1B1ACCF1B1ACCF1B1ACCF1B1ACCF1B1ACC
GPT
Full Text29.4121.1643.7529.9519.3351.3518.2519.5661.5441.4529.9674.2934.1324.6361.18
Naive RAG15.849.5031.2517.3012.3635.1417.4016.6546.1539.3230.3558.5727.1420.4146.05
LangMem12.559.2225.0015.2311.5321.6214.9114.0338.4623.5217.5935.7118.4614.0530.26
A-Mem15.5610.8831.2555.0142.4051.3518.1815.2753.8542.7232.4362.8637.9028.8552.63
LightMem33.7429.3365.6259.7651.1278.3831.8524.2376.9243.8834.6878.5744.5836.6675.66
MemMALM48.1539.6778.1257.2141.9483.7824.5822.4476.9250.4538.6682.8649.4038.2881.58
Claude-Haiku
Full Text29.4121.1643.7529.9519.3351.3518.2519.5661.5441.4529.9674.2934.1324.6361.18
Naive RAG15.849.5031.2517.3012.3635.1417.4016.6546.1539.3230.3558.5727.1420.4146.05
LangMem20.0514.8534.3834.7226.3337.8420.0120.8569.2322.6516.1948.5724.8118.7844.74
A-Mem15.7910.3228.1356.4143.2354.0516.3417.7638.4638.3727.9865.7136.1227.1052.63
LightMem35.1131.8559.3858.4249.8589.1932.6024.4369.2344.0636.5671.4344.6937.7773.03
MemMALM35.3832.4865.6259.2544.6683.7828.5926.8684.6245.3135.8577.1445.1036.5376.97

5.2 Main Comparison with Baselines

To evaluate MemMA, we compare it with baselines. We use LightMem as the storage backend of MemMA, denoted by MemMALM. GPT-4o-mini and Claude-Haiku-4.5 are the backbones. Other settings follow these in Section 5.1.

为了评估 MemMA,作者将其与基线进行比较。 作者使用 LightMem 作为 MemMA 的存储后端,记为 MemMALM GPT-4o-mini 和 Claude-Haiku-4.5 是骨干模型。 其他设置遵循第 5.1 节。

Table 2 reports the results. Three findings emerge: (i) MemMALM achieves the best overall performance under both backbones. Under GPT-4o-mini, it reaches 49.40 F1, 38.28 B1, and 81.58 ACC, improving over LightMem by +4.82 F1, +1.62 B1, and +5.92 ACC. Under Claude-Haiku-4.5, it again achieves the best overall ACC, improving from 73.03 to 76.97 over LightMem. (ii) The gains are strong at the category level. Under GPT-4o-mini, MemMALM improves most on Multi-Hop and Single-Hop, raising ACC from 65.62 to 78.12 and from 78.57 to 82.86, respectively. The Multi-Hop gains are consistent with diagnosis-guided iterative retrieval helping recover distributed evidence, while the Single-Hop gains suggest that construction guidance and self-evolution help preserve precise answer-bearing details. (iii) MemMALM improves an already strong baseline. LightMem is already the strongest baseline, yet MemMALM further improves it under both backbones, suggesting that the gain comes from memory-cycle coordination rather than a stronger storage backend.

表2 报告了结果。 出现了三个发现:(i) MemMALM 在两个骨干模型下都取得最佳总体性能 在 GPT-4o-mini 下,它达到 49.40 F1、38.28 B1 和 81.58 ACC,相比 LightMem 提升 +4.82 F1、+1.62 B1 和 +5.92 ACC。 在 Claude-Haiku-4.5 下,它再次取得最佳总体 ACC,相比 LightMem 从 73.03 提升到 76.97 (ii) 这些增益在类别层面上很强 在 GPT-4o-mini 下,MemMALM 在多跳和单跳上提升最大,分别将 ACC 从 65.62 提高到 78.12、从 78.57 提高到 82.86 多跳增益与诊断引导的迭代检索帮助恢复分散证据相一致,而单跳增益表明构建指导和自演化有助于保留精确的承载答案细节。 (iii) MemMALM 改进了一个已经很强的基线 LightMem 已经是最强基线,但 MemMALM 在两个骨干模型下进一步改进了它,这表明收益来自记忆循环协调,而不是更强的存储后端。

5.3 Flexibility across Storage Backends

To assess the flexibility of MemMA across storage backends, we instantiate it on top of three memory systems: Single-Agent (MemMASA), A-Mem (MemMAAM), and LightMem (MemMALM). All other components and settings are fixed as in Section 5.1.

为了评估 MemMA 跨存储后端的灵活性,作者将其实例化在三个记忆系统之上:Single-Agent(MemMASA)、A-Mem(MemMAAM)和 LightMem(MemMALM)。 所有其他组件和设置均固定为第 5.1 节中的设置。

表3:GPT-4o-mini 下 LoCoMo 上跨后端的灵活性。每个后端的最佳结果用粗体表示。
MethodF1B1ACC
Single-Agent22.6417.2452.60
MemMASA23.6412.9484.87
A-Mem37.9028.8552.63
MemMAAM46.2335.1378.29
LightMem44.5836.6675.66
MemMALM49.4038.2881.58

Table 3 reports results on LoCoMo under GPT-4o-mini. Two observations emerge. (i) MemMA consistently improves all backends. In terms of ACC, MemMA improves the Single-Agent backend from 52.60 to 84.87, A-Mem from 52.63 to 78.29, and LightMem from 75.66 to 81.58. For A-Mem and LightMem, the gains are also consistent in F1 and B1. For the weaker Single-Agent backend, B1 decreases even though Acc rises sharply, suggesting that MemMA improves semantic correctness more than lexical overlap in this setting. These results indicate that MemMA improves long-horizon memory across diverse storage implementations. (ii) The gains of MemMA complement storage quality rather than replace it. Among the enhanced variants, MemMALM achieves the strongest overall performance, which is consistent with LightMem being the strongest standalone backend. This pattern suggests that MemMA improves how memory is coordinated, rather than relying on a particular storage design.

表3 报告了 GPT-4o-mini 下 LoCoMo 上的结果。 出现了两个观察。 (i) MemMA 持续改进所有后端 就 ACC 而言,MemMA 将 Single-Agent 后端从 52.60 提升到 84.87,将 A-Mem 从 52.63 提升到 78.29,并将 LightMem 从 75.66 提升到 81.58 对于 A-Mem 和 LightMem,增益在 F1 和 B1 上也一致。 对于较弱的 Single-Agent 后端,尽管 ACC 大幅上升,B1 却下降,这表明在此设置下 MemMA 对语义正确性的改进大于对词面重叠的改进。 这些结果表明,MemMA 能够跨多样存储实现改进长程记忆。 (ii) MemMA 的增益补充存储质量,而不是替代它 在增强变体中,MemMALM 取得最强总体性能,这与 LightMem 是最强独立后端相一致。 这一模式表明 MemMA 改进的是记忆如何被协调,而不是依赖某一种特定存储设计。

5.4 In-depth Dissection of MemMA

Ablation Studies. To understand the contributions of key components in MemMA, we implement three ablated variants on the Single-Agent backend: (i) MemMA/C removes Meta-Thinker guidance during construction and directly uses the Memory Manager for memory writing; (ii) MemMA/R removes iterative retrieval, reverting to one-shot retrieval based on semantic similarity; and (iii) MemMA/E removes the probe-and-repair loop of in-situ self-evolving memory construction and directly commits Mτ(0) to the memory bank.

消融研究。 为了理解 MemMA 中关键组件的贡献,作者在 Single-Agent 后端上实现三种消融变体:(i) MemMA/C 移除构建期间的 Meta-Thinker 指导,并直接使用 Memory Manager 进行记忆写入;(ii) MemMA/R 移除迭代检索,回退到基于语义相似度的一次性检索;(iii) MemMA/E 移除原位自演化记忆构建中的探测并修复循环,并直接将 Mτ(0) 提交到记忆库。

Ablation under GPT-4o-mini
(a) GPT-4o-mini
Ablation under Claude-Haiku-4.5
(b) Claude-Haiku-4.5
图3:MemMASA 在 GPT-4o-mini 和 Claude-Haiku-4.5 下于 LoCoMo 上的消融研究。

Figure 3 reports the results under GPT-4o-mini and Claude-Haiku-4.5. The full MemMASA achieves the strongest overall performance, while the variants reveal complementary weaknesses. Specifically: (i) Iterative retrieval is the most critical forward-path component. MemMASA/R causes the largest drop under both backbones, reducing ACC from 84.87 to 70.39 with GPT-4o-mini and from 88.82 to 81.58 with Claude-Haiku-4.5. This confirms that one-shot retrieval remains a major bottleneck and that diagnosis-guided refinement is essential for narrowing the information gap. (ii) Self-evolution repairs construction omissions. MemMASA/E causes the second-largest degradation (ACC: 84.87 73.68 with GPT-4o-mini). The large ACC drop with only moderate F1 change suggests that self-evolution mainly improves semantic correctness by repairing missing information during construction. (iii) Construction guidance reduces upstream noise. MemMASA/C reduces ACC from 88.82 to 83.55 with Claude-Haiku-4.5. This shows that construction decisions benefit from explicit strategic guidance rather than local heuristics alone, as the Meta-Thinker helps determine what should be retained, consolidated, or resolved before information enters the memory bank. These ablations confirm that MemMA's gains come from complementary improvements on both paths of the memory cycle.

图3 报告了 GPT-4o-mini 和 Claude-Haiku-4.5 下的结果。 完整的 MemMASA 取得最强总体性能,而各变体揭示了互补弱点。 具体来说:(i) 迭代检索是最关键的前向路径组件 MemMASA/R 在两个骨干模型下都造成最大下降,在 GPT-4o-mini 下将 ACC 从 84.87 降至 70.39,在 Claude-Haiku-4.5 下从 88.82 降至 81.58 这证实一次性检索仍是主要瓶颈,而诊断引导的细化对缩小信息缺口至关重要。 (ii) 自演化修复构建遗漏 MemMASA/E 造成第二大退化(GPT-4o-mini 下 ACC:84.87 73.68)。 ACC 大幅下降而 F1 只中等变化,这表明自演化主要通过在构建期间修复缺失信息来改进语义正确性。 (iii) 构建指导减少上游噪声 MemMASA/C 在 Claude-Haiku-4.5 下将 ACC 从 88.82 降至 83.55 这表明构建决策从显式战略指导中受益,而不只是局部启发式,因为 Meta-Thinker 帮助决定信息进入记忆库前应保留、合并或解决什么。 这些消融证实,MemMA 的收益来自记忆循环两条路径上的互补改进。

Retrieval budget for MemMA_LM
(a) MemMALM
Retrieval budget for MemMA_SA
(b) MemMASA
图4:MemMA 在 GPT-4o-mini 和 Claude-Haiku-4.5 下的检索预算 k 影响。

Impact of retrieval budget k. We vary k{10,20,30,40,50} on both Single-Agent and LightMem backends and report results in Figure 4. We observe that the optimal k depends on storage quality. For MemMALM, ACC peaks at k=30--40 (81.58) and declines at k=50 (79.61), indicating a sweet spot beyond which additional retrieval introduces noise. For MemMASA, ACC increases steadily from 75.66 at k=10 to 84.21 at k=50, without saturation. We attribute this contrast to storage quality: stronger backends produce higher-quality, less redundant entries, so a moderate k suffices and excess retrieval dilutes the evidence; weaker backends need a larger k to retrieve enough evidence from sparser memory banks.

检索预算 k 的影响。 作者在 Single-Agent 和 LightMem 存储后端上改变 k{10,20,30,40,50},并在 图4 中报告结果。 作者观察到最优 k 取决于存储质量。 对于 MemMALM,ACC 在 k=30--40 时达到峰值(81.58),并在 k=50 时下降(79.61),表明存在一个最佳区间,超过后额外检索会引入噪声。 对于 MemMASA,ACC 从 k=10 时的 75.66 稳定增加到 k=50 时的 84.21,没有饱和。 作者将这种差异归因于存储质量:更强后端产生质量更高、冗余更少的条目,因此适中的 k 就足够,过量检索会稀释证据;较弱后端则需要更大的 k 从更稀疏的记忆库中检索足够证据。

Refinement budget under GPT-4o-mini
(a) GPT-4o-mini
Refinement budget under Claude-Haiku-4.5
(b) Claude-Haiku-4.5
图5:MemMA 的细化预算 H 影响。

Impact of retrieval refinement budget H. We vary the refinement budget H{0,1,2,3,4,5} under both GPT-4o-mini and Claude-Haiku-4.5. The results of MemMASA and MemMALM are reported in Figure 5. We observe that ACC improves sharply from one-shot retrieval (H=0) to a small H and then declines. For example, MemMASA's ACC rises from 78.95 at H=0 to 85.53 at H=2, then drops back to 81.58 at H=4. This shows that diagnosis-guided refinement converges quickly: one or two additional retrieval rounds suffice to close most information gaps, while further iterations risk retrieval drift. This validates the effectiveness of the Meta-Thinker's answerability diagnosis, which directs each refinement step toward the specific missing evidence rather than redundant searches. More analysis of the impact of probe generation model are in Appendix.

检索细化预算 H 的影响。 作者在 GPT-4o-mini 和 Claude-Haiku-4.5 下改变细化预算 H{0,1,2,3,4,5} MemMASAMemMALM 的结果报告在 图5 中。 作者观察到 ACC 从一次性检索(H=0)到较小 H 会迅速提升,随后下降。 例如,MemMASA 的 ACC 从 H=0 时的 78.95 上升到 H=2 时的 85.53,然后在 H=4 时回落到 81.58 这表明诊断引导的细化收敛很快:一两轮额外检索足以关闭大多数信息缺口,而更多迭代会带来检索漂移风险。 这验证了 Meta-Thinker 的可回答性诊断的有效性,它将每个细化步骤指向特定缺失证据,而不是冗余搜索。 关于探测问题生成模型影响的更多分析见附录。

5.5 Case Studies

We conduct a case study to better understand why MemMA improves long-horizon QA. Our findings indicate that: (i) on the forward path, construction-time Meta-Thinker guidance determines whether answer-bearing details survive in memory, while diagnosis-guided iterative retrieval determines whether missing evidence is surfaced before the system commits to an answer. Importantly, iterative retrieval cannot compensate for details that were never preserved during construction. The cases also show that the retrieval controller and the storage backend play distinct roles: the Meta-Thinker and Query Reasoner identify the information gap, while the backend determines whether the required evidence can actually be recovered; (ii) on the backward path, in-situ self-evolution converts local probe failures into targeted memory repairs that transfer to downstream QA, for example by inserting missing named entities, sharpening vague event descriptions, and completing partial evidence clusters. Detailed examples are in Appendix.

作者进行案例研究,以更好理解为什么 MemMA 能改进长程问答。 作者的发现表明:(i) 在前向路径上,构建时的 Meta-Thinker 指导决定承载答案的细节是否能在记忆中保留下来,而诊断引导的迭代检索决定缺失证据是否会在系统提交答案前被呈现。 重要的是,迭代检索无法补偿构建期间从未被保留的细节。 这些案例还表明,检索控制器和存储后端扮演不同角色:Meta-Thinker 和 Query Reasoner 识别信息缺口,而后端决定所需证据是否真的能够被恢复;(ii) 在反向路径上,原位自演化将局部探测失败转化为有针对性的记忆修复,并迁移到下游问答,例如插入缺失的命名实体、锐化模糊事件描述,以及补全部分证据簇。 详细示例见附录。

6. Conclusion

We introduce MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along its forward and backward paths. On the forward path, a Meta-Thinker separates strategic reasoning from low-level execution, addressing strategic blindness in construction and retrieval. On the backward path, in-situ self-evolution converts probe QA failures into direct memory repair before the memory is committed. Experiments on LoCoMo show that MemMA outperforms all baselines across multiple backbones and consistently improves three different storage backends.

作者提出 MemMA,这是一个即插即用的多智能体框架,沿前向和反向两条路径协调记忆循环。 在前向路径上,Meta-Thinker 将战略推理与底层执行分离,解决构建和检索中的战略盲区。 在反向路径上,原位自演化在记忆提交前将探测问答失败转化为直接记忆修复。 LoCoMo 上的实验表明,MemMA 跨多个骨干模型优于所有基线,并持续改进三种不同存储后端。

7. Limitations

Our evaluation focuses on a dialogue-centric long-horizon memory benchmark. While LoCoMo covers diverse question types, including single-hop, multi-hop, temporal, and open-domain reasoning, it does not capture all settings in which persistent memory may be needed.

作者的评估聚焦于一个以对话为中心的长程记忆基准。 虽然 LoCoMo 覆盖多样问题类型,包括单跳、多跳、时间和开放域推理,但它没有捕捉所有可能需要持久记忆的设置。

In addition, the backward path assumes that interaction streams can be organized into sessions and that synthetic probe QA can provide useful localized supervision. These assumptions are natural for the benchmark studied here, but may require adaptation in settings with less clear session boundaries or more open-ended interaction structure.

此外,反向路径假设交互流可以被组织成会话,并且合成探测问答能提供有用的局部监督。 这些假设对于本文研究的基准是自然的,但在会话边界不太清晰或交互结构更开放的设置中可能需要适配。

8. Ethics Statement

This work studies long-horizon memory management for LLM agents. All experiments are conducted on the publicly available benchmark, which consists of synthetic conversations and does not contain real user data. No personally identifiable information is collected, stored, or processed in this work. We note that improving memory quality in agent systems may raise broader considerations for real-world deployment, including user privacy, informed consent for data retention, controllability over stored memories, and the risk of persisting incorrect information through automated repair. While these concerns are beyond the scope of the present study, we believe they should be treated as first-class design requirements in any production deployment of memory-augmented agents.

本文研究 LLM 智能体的长程记忆管理。 所有实验都在公开可用基准上进行,该基准由合成对话构成,不包含真实用户数据。 本文没有收集、存储或处理个人身份信息。 作者指出,提高智能体系统中的记忆质量可能为真实世界部署带来更广泛考量,包括用户隐私、数据留存的知情同意、对已存记忆的可控性,以及通过自动化修复持久化错误信息的风险。 虽然这些问题超出了当前研究范围,但作者认为,在任何记忆增强型智能体的生产部署中,它们都应被视为一等设计要求。