Skip to content


Mem-T:为长程记忆智能体稠密化奖励

Abstract

Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is high-performing, surpassing frameworks such as A-Mem and Mem0 by up to 14.92%, and economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by 24.45% relative to GAM without sacrificing performance.

记忆智能体脱离预定义的记忆处理流程,通过内生方式管理记忆的处理、存储和检索,因此其自主性和适应性正受到越来越多关注。 然而,现有训练范式仍受限制:智能体往往要经过长程的记忆操作序列,之后才收到稀疏且延迟的奖励,这妨碍了对记忆管理策略进行真正端到端优化。 为了解决这一局限,我们提出 Mem-T,这是一个自主记忆智能体,它连接轻量级层级记忆数据库,对流式输入执行动态更新和多轮检索。 为了有效训练长程记忆管理能力,我们进一步提出 MoT-GRPO,这是一个树引导的强化学习框架,它通过记忆操作树反向传播和 hindsight credit assignment,将稀疏终端反馈转化为稠密的逐步监督,从而支持联合优化记忆构建和检索。 大量实验表明,Mem-T 性能很高,最多比 A-Mem 和 Mem0 等框架高出 14.92%;同时它也更经济,位于有利的准确率-效率 Pareto 前沿上,并且在不牺牲性能的情况下,相比 GAM 将每个查询的推理 token 减少约 24.45%

The paradigm comparison between the previous trainable memory agent and Mem-T
图1:此前可训练记忆智能体与 Mem-T 的范式比较。

1. Introduction

As Large Language Models (LLMs) rapidly evolve into powerful AI agents, they have achieved significant success across various fields. However, constrained by the finite context windows of foundation models, AI agents face inherent challenges with long-term inconsistency and context forgetting during extended multi-turn interactions. As a promising frontier, memory systems dynamically construct and leverage memories from historical interactions, thereby sustaining temporal coherence and long-term intelligence beyond finite context windows, and have consequently emerged as a core component of modern agentic systems.

随着大语言模型(LLMs)迅速演化为强大的 AI 智能体,它们已经在各个领域取得显著成功。 然而,由于基础模型的上下文窗口有限,AI 智能体在扩展的多轮交互中面临长期不一致和上下文遗忘这两类内在挑战。 作为一个有前景的前沿方向,记忆系统会从历史交互中动态构建并利用记忆,从而在有限上下文窗口之外维持时间一致性和长期智能,并因此成为现代智能体系统的核心组件。

Tracing the evolution of memory systems, early frameworks such as MemGPT, Mem0, and A-Mem predominantly rely on hand-crafted prompts and heuristic rules to guide frozen LLMs in populating predefined memory structures. As a result, their performance is inherently bounded by the base model's instruction-following capacity and rigid human priors, often leading to suboptimal outcomes. By contrast, recent approaches such as Memory-R1, Mem-α, and MemTool employ reinforcement learning (e.g., GRPO) to train LLMs into adaptive policies for dynamic memory curation and retrieval, commonly referred to as memory agents. This shift constitutes a fundamental paradigm change, recasting memory management from static instruction adherence into a problem of adaptive policy optimization.

回顾记忆系统的演进,MemGPT、Mem0 和 A-Mem 等早期框架主要依赖人工设计的提示和启发式规则,引导冻结的 LLM 填充预定义记忆结构。 因此,它们的性能本质上受限于基础模型的指令遵循能力和僵硬的人类先验,常常导致次优结果。 相比之下,Memory-R1、Mem-α 和 MemTool 等近期方法采用强化学习(例如 GRPO)把 LLM 训练成用于动态记忆整理和检索的自适应策略,这类模型通常被称为记忆智能体。 这一转变构成了根本性的范式变化,把记忆管理从静态遵循指令重新表述为自适应策略优化问题。

However, current paradigms for training memory agents remain fundamentally constrained by temporal credit assignment, the challenge of attributing sparse and delayed rewards to causative actions along long-horizon memory operation sequences. This limitation is particularly acute in memory-centric tasks, where agents may execute hundreds of memory operations across 500 turns within million-token contexts before receiving a binary 0/1 reward derived from sporadic QA accuracy signals. Existing approaches fail to bridge this gap, as they indiscriminately propagate the sparse terminal reward across all memory operations without dense supervision or process-level attribution. Consequently, this extreme sparsity impedes effective optimization of the full memory operation trajectory. To put it more formally:

然而,当前训练记忆智能体的范式仍从根本上受限于时间信用分配,也就是把稀疏且延迟的奖励归因到长程记忆操作序列中致因动作的挑战。 这种局限在以记忆为中心的任务中尤其严重,因为智能体可能要在百万 token 上下文中的 500 轮里执行数百次记忆操作,之后才收到一个由零散 QA 准确率信号产生的二值 0/1 奖励。 现有方法无法弥合这一鸿沟,因为它们在没有稠密监督或过程级归因的情况下,把稀疏终端奖励不加区分地传播到所有记忆操作。 因此,这种极端稀疏性阻碍了对完整记忆操作轨迹的有效优化。 更正式地说:

Research Question. How can we implement a fully trainable memory agent framework that jointly optimizes memory construction and retrieval, supervised with dense rewards and accurate process-level attribution?

研究问题。 我们如何实现一个完全可训练的记忆智能体框架,在稠密奖励和准确过程级归因的监督下,联合优化记忆构建与检索?

To address this challenge, we introduce Mem-T, a streamlined hierarchical memory agent optimized under a process-supervised, attribution-centric training paradigm termed Memory Operation Tree-guided GRPO (MoT-GRPO). Functionally, Mem-T integrates three core capabilities: (i) formation and (ii) evolution operations that maintain and refine the hierarchical memory database over dynamic information streams, and (iii) a retrieval operation that conducts multi-turn, autonomous search to provide accurate memory clues. To jointly optimize these components, MoT-GRPO employs a dual-track training mechanism integrating memory retrieval and construction. To refine memory retrieval, it constructs multiple Memory operation Trees (MoT) to explore diverse trajectories, leveraging the branching topology to back-propagate sparse outcome rewards to intermediate nodes, thereby generating dense process-level signals and identifying critical search paths. To refine memory construction, the utility of the MoT is explicitly attributed back to source memory items via hindsight credit assignment, supervising the corresponding formation and evolution operations. This paradigm effectively mitigates reward sparsity and attribution ambiguity, rendering memory interactions both interpretable and learnable. Our contributions can be summarized as:

为了解决这一挑战,我们提出 Mem-T,这是一个精简的层级记忆智能体,它在一种以过程监督和归因为中心的训练范式下优化,该范式称为 Memory Operation Tree-guided GRPO(MoT-GRPO)。 在功能上,Mem-T 集成三项核心能力:(i) formation 和 (ii) evolution 操作,用于在动态信息流上维护并细化层级记忆数据库;以及 (iii) retrieval 操作,用于执行多轮自主搜索并提供准确的记忆线索。 为了联合优化这些组件,MoT-GRPO 采用一种双轨训练机制,将记忆检索和记忆构建结合起来。 为了细化记忆检索,它构建多个 Memory operation Trees(MoT)来探索多样轨迹,利用分支拓扑把稀疏结果奖励反向传播到中间节点,从而生成稠密的过程级信号并识别关键搜索路径。 为了细化记忆构建,MoT 的效用会通过 hindsight credit assignment 显式归因回源记忆条目,从而监督相应的 formation 和 evolution 操作。 这一范式有效缓解了奖励稀疏性和归因歧义,使记忆交互既可解释又可学习。 我们的贡献可总结如下:

  • Unified Memory Framework. We propose Mem-T, a streamlined memory management agent with a hierarchical architecture that integrates factual, experiential, and working memory, and agentically orchestrates the full lifecycle of memory operations.
  • Tree-Guided Optimization. We present MoT-GRPO, a memory operation tree-based paradigm that tackles temporal credit assignment via node-wise reward backpropagation and hindsight credit assignment.

By transforming sparse terminal rewards into dense supervision for intermediate operations, it enables the joint optimization of memory formation, evolution, and retrieval.

  • Experimental Evaluation. Comprehensive evaluations on four memory benchmarks demonstrate that Mem-T achieves state-of-the-art performance while maintaining a superior Pareto frontier, delivering up to 14.92% F1 gains and reducing inference tokens per query by 24.45% compared with GAM and A-Mem baselines.
  • 统一记忆框架。 我们提出 Mem-T,这是一个精简的记忆管理智能体,采用层级架构整合事实记忆、经验记忆和工作记忆,并以智能体方式编排记忆操作的完整生命周期。
  • 树引导优化。 我们提出 MoT-GRPO,这是一种基于记忆操作树的范式,通过节点级奖励反向传播和 hindsight credit assignment 处理时间信用分配。

通过把稀疏终端奖励转化为针对中间操作的稠密监督,它能够联合优化记忆形成、演化和检索。

  • 实验评估。 在四个记忆基准上的综合评估表明,Mem-T 达到最先进性能,同时保持更优的 Pareto 前沿;与 GAM 和 A-Mem 基线相比,它最多带来 14.92% 的 F1 提升,并将每个查询的推理 token 减少 24.45%
表1:不同记忆智能体系统的比较。×:不包含;◇:包含但基于启发式;◆:包含且可训练。缩写:Fact.=Factual Memory,Exp.=Experiential Memory,Work.=Working Memory,Form.=Memory Formation,Evol.=Memory Evolution,Retr.=Memory Retrieval,Proc. Attr.=Process Attribution。
MethodFunctionsOperationsProc.
Attr.
Fact.Exp.Work.Form.Evol.Retr.
MemAgent××××
Context-Folding××××
Memory-R1×××

Mem-α

××
MemSearcher×××××
LightSearcher××××
Mem-T

Memory Agent Architectures. In recent years, memory agents have advanced rapidly, evolving from heuristic-based systems such as MemoryBank and MemGPT to more agentic architectures, including Mem0, MemOS, and A-Mem. Functionally, prior work spans three categories: (I) Factual Memory, preserving declarative knowledge for long-term consistency; (II) Experiential Memory, distilling experience from trajectories to support continual self-improvement; and (III) Working Memory, managing dynamic context for ongoing tasks. Operationally, the memory lifecycle comprises (I) Formation, transforming raw context into high-value memory; (II) Evolution, integrating new insights with existing memory store; and (III) Retrieval, performing accurate retrieval from the memory base. As shown in Table 1, our Mem-T, despite its streamlined design, spans all three functional classes and operational stages.

记忆智能体架构。 近年来,记忆智能体快速发展,从 MemoryBank 和 MemGPT 等基于启发式的系统演化到更具智能体特性的架构,包括 Mem0、MemOS 和 A-Mem。 从功能上看,已有工作涵盖三类:(I) 事实记忆,保存声明性知识以维持长期一致性;(II) 经验记忆,从轨迹中提炼经验以支持持续自我改进;以及 (III) 工作记忆,管理进行中任务的动态上下文。 从操作上看,记忆生命周期包括 (I) 形成,把原始上下文转化为高价值记忆;(II) 演化,把新洞察与既有记忆存储整合起来;以及 (III) 检索,从记忆库中执行准确检索。 表1所示,尽管 Mem-T 设计精简,但它覆盖了全部三类功能和操作阶段。

Reinforcement Learning for Memory Agents. As memory systems scale in complexity, the efficacy of foundation models in managing memory increasingly becomes the primary performance bottleneck. Consequently, reinforcement learning (RL) has emerged as a central paradigm for endowing LLMs with adaptive memory management capabilities. Current research spans a broad spectrum, from short-term working memory to long-term factual and experiential memory. Working Memory. RL has been used to enable agents to autonomously manage execution context within a single task, particularly in settings such as deep research and web browsing. Long-term Factual Memory. Prior work targets different stages of memory management: Memory-R1 emphasizes memory evolution, Mem-α addresses both formation and evolution, and MemSearcher focuses on training agents to exploit retrieval tools. Long-term Experiential Memory. Methods such as LightSearcher and MemRL improve the acquisition, refinement, and reuse of skills over time. Despite these advances, RL-based approaches remain limited by sparse rewards and temporal credit assignment in long-horizon settings, hindering effective optimization across the full memory construction and utilization pipeline, as shown in Table 1.

面向记忆智能体的强化学习。 随着记忆系统复杂度提高,基础模型管理记忆的有效性日益成为主要性能瓶颈。 因此,强化学习(RL)已经成为赋予 LLM 自适应记忆管理能力的核心范式。 当前研究覆盖范围很广,从短期工作记忆到长期事实记忆和经验记忆。 工作记忆。 RL 已被用于使智能体在单个任务中自主管理执行上下文,尤其是在深度研究和网页浏览等场景中。 长期事实记忆。 既有工作针对记忆管理的不同阶段:Memory-R1 强调记忆演化,Mem-α 同时处理形成和演化,MemSearcher 则聚焦训练智能体利用检索工具。 长期经验记忆。 LightSearcher 和 MemRL 等方法改进了技能随时间的获取、细化和复用。 尽管取得这些进展,基于 RL 的方法在长程设置中仍受限于稀疏奖励和时间信用分配,从而阻碍了对完整记忆构建与利用流程的有效优化,如表1所示。

The overall framework of Mem-T
图2:Mem-T 的整体框架。

3. Method

3.1 Mem-T Workflow

Hierarchical Memory Definition. We consider the agent interacting with a continuous information stream X={x1,x2,,xT}. At each time step t, corresponding to the processing of the current chunk xt, the system maintains a hierarchical memory state Mt:

层级记忆定义。 我们考虑智能体与连续信息流 X={x1,x2,,xT} 交互。 在每个时间步 t,对应于处理当前块 xt,系统维护一个层级记忆状态 Mt

Mt=Mtwork,Mtfact,Mtexp,Mtraw.

Within this hierarchy, Working Memory (Mtwork) iteratively updates a concise summary at each step, maintaining within-episode coherence. The long-term memory consists of three modules: Factual Memory (Mtfact) stores declarative knowledge, Experiential Memory (Mtexp) captures procedural knowledge, and Raw Memory (Mtraw) archives raw data across sessions. Formally, we have:

在这一层级中,工作记忆Mtwork)在每一步迭代更新一个简洁摘要,从而维持单次情节内的一致性。 长期记忆由三个模块组成:事实记忆Mtfact)存储声明性知识,经验记忆Mtexp)捕获程序性知识,原始记忆Mtraw)归档跨会话原始数据。 形式化地,我们有:

Mtfact={mifactmifact=(fi,tistart,tiend)}i=1Nf,Mtexp={mjexpmjexp=(ej,tjstart,tjend)}j=1Ne,Mtraw={mlrawmlraw=(xl,tlraw)}l=1t.

where each m() represents an atomic memory unit. Specifically, fi and ej represent concrete facts and strategies, respectively, bound by validity time windows [tstart,tend].

其中每个 m() 表示一个原子记忆单元。 具体而言,fiej 分别表示具体事实和策略,并受有效时间窗口 [tstart,tend] 约束。

Memory Operation Pipeline. Building upon this hierarchical memory, we formulate the agent's interaction as a dual-track decision process, comprising continuous memory construction and on-demand memory utilization. Phase I: Continuous Memory Construction. As the agent processes the input stream xt, it proactively constructs new memory candidates via the memory formation policy πform. This policy scans the raw input to identify salient information and operates on the formation action space Aform={CrtFact,CrtExp,CrtRaw,UpdWork}. Here, CrtFact, CrtExp, and CrtRaw extract atomic declarative facts, procedural strategies, and raw data, respectively, while UpdWork updates the session-level working summary. Formally, the formation process is defined as:

记忆操作流程。 基于这一层级记忆,我们把智能体交互形式化为一个双轨决策过程,包括持续记忆构建和按需记忆利用。 阶段 I:持续记忆构建。 当智能体处理输入流 xt 时,它通过记忆形成策略 πform 主动构建新的候选记忆。 该策略扫描原始输入以识别显著信息,并在形成动作空间 Aform={CrtFact,CrtExp,CrtRaw,UpdWork} 上操作。 其中,CrtFactCrtExpCrtRaw 分别抽取原子声明性事实、程序性策略和原始数据,而 UpdWork 更新会话级工作摘要。 形式化地,形成过程定义为:

aformπform(xt,Mtwork),aformAform,Mtcand={mmExecute(op),opaform}.

where Mtcand denotes the set of candidate memories extracted from xt. For each candidate mMtcand, the memory evolution policy πevol integrates it into Mt. Specifically, the policy considers memories in Mt that are relevant to m, and samples an evolution action aevolπevol(m,Mt) from the action space Aevol={ADD,UPDATE,DELETE,IGNORE}. Collectively, these actions define the set of memories to be added (Δ+) and removed (Δ) from the memory store:

其中 Mtcand 表示从 xt 中抽取的候选记忆集合。 对于每个候选 mMtcand,记忆演化策略 πevol 将其整合进 Mt 具体而言,该策略会考虑 Mt 中与 m 相关的记忆,并从动作空间 Aevol={ADD,UPDATE,DELETE,IGNORE} 中采样一个演化动作 aevolπevol(m,Mt) 总体而言,这些动作定义了要加入记忆库的记忆集合(Δ+)和要从记忆库移除的记忆集合(Δ):

Δ+={maevol=ADD}{mrefinedaevol=UPDATE},Δ={mtargetaevol=DELETE}{moldaevol=UPDATE}.

Consequently, the memory store is updated accordingly:

因此,记忆库会相应更新:

Mt+1=(MtΔ)Δ+.

Phase II: On-Demand Memory Retrieval. Based on the constructed memory store Mt, when a query q arises, the agent employs a multi-turn retrieval to respond. During this process, the memory retrieval policy πretr selects actions from the retrieval action space Aretr, which includes queries for each memory module and a terminal signal:

阶段 II:按需记忆检索。 基于已构建的记忆库 Mt,当查询 q 出现时,智能体采用多轮检索来响应。 在此过程中,记忆检索策略 πretr 从检索动作空间 Aretr 中选择动作,该空间包括对每个记忆模块的查询以及一个终止信号:

Aretr={Search(r,key,topk)rMt}{Finish}.

where r is the memory type to be retrieved, key is the retrieval query. Unlike single-step retrieval, πretr operates as a sequential decision policy. At each step k, conditioned on the query q and the history context hk1, which consists of the retrieved relevant memory set Mk1rel and reasoning state zk1, the policy samples an action ak:

其中 r 是要检索的记忆类型,key 是检索查询。 不同于单步检索,πretr 作为序列决策策略运行。 在每一步 k,给定查询 q 和历史上下文 hk1,后者由已检索到的相关记忆集合 Mk1rel 与推理状态 zk1 组成,该策略采样一个动作 ak

akπretr(q,Mt,hk1),akAretr.

This iterative process accumulates the relevant memory set Mrel by aggregating the observations from each search step. Finally, the loop terminates when the policy selects the Finish action, signaling that the gathered information is sufficient to support the final answer yPθ(q,Mrel).

这一迭代过程通过聚合每个搜索步骤的观测来累积相关记忆集合 Mrel 最后,当策略选择 Finish 动作时,循环终止,表示已收集的信息足以支持最终答案 yPθ(q,Mrel)

3.2 MoT-GRPO for Memory Retrieval

In long-horizon scenarios, memory operation chains become extremely long, making credit assignment and reward sparsity major challenges. To address these issues, we propose Memory Operation Tree GRPO (MoT-GRPO), inspired by prior RL methods. Memory Operation Tree Construction. In the retrieval phase, to achieve efficient rollout generation while obtaining dense intermediate signals, we employ an Iterative Branching Rollout to construct the Memory Operation Tree(MoT). Formally, we define a node in MoT as a tuple v=aretr,h, representing a specific operation aretrAretr and the reasoning context h.

在长程场景中,记忆操作链会变得极长,使信用分配和奖励稀疏性成为主要挑战。 为了解决这些问题,我们提出 Memory Operation Tree GRPO(MoT-GRPO),其灵感来自既有 RL 方法。 记忆操作树构建。 在检索阶段,为了在获得稠密中间信号的同时高效生成 rollout,我们采用 Iterative Branching Rollout 来构建 Memory Operation Tree(MoT)。 形式化地,我们把 MoT 中的节点定义为元组 v=aretr,h,表示一个具体操作 aretrAretr 和推理上下文 h

For each query, we initialize an ensemble of G independent MoTs {T0(i)}i=1G. Each tree T0(i) initially contains a single seed trajectory τ(i), obtained by a full rollout from the root state (q,Mt,h0=):

对于每个查询,我们初始化 G 个独立 MoT 的集合 {T0(i)}i=1G 每棵树 T0(i) 初始只包含一条种子轨迹 τ(i),它通过从根状态 (q,Mt,h0=) 进行完整 rollout 获得:

T0(i)={τ(i)},i=1,,G,τ(i)=(v1(i),v2(i),,vLi(i)),vk(i)=ak(i),hk(i).

Subsequently, we iteratively densify each T(i) over M expansion rounds. In each expansion round j{1,,M}, we stochastically sample Nv non-terminal pivot nodes {vn}n=1Nv from each tree Tj1(i). For each node v and its corresponding context history hv, the policy executes a new rollout to generate a branch trajectory τbranch:

随后,我们在 M 轮扩展中迭代稠密化每个 T(i) 在每一轮扩展 j{1,,M} 中,我们从每棵树 Tj1(i) 随机采样 Nv 个非终止枢轴节点 {vn}n=1Nv 对于每个节点 v 及其对应的上下文历史 hv,策略会执行一次新的 rollout,以生成分支轨迹 τbranch

τbranchπretr(q,Mt,hv),τnew=Path(v)τbranch.

The newly generated trajectories are then grafted onto the tree, updating its state to Tj(i). After M rounds, this process yields a final ensemble of G MoTs {TM(i)}i=1G.

随后,新生成的轨迹会嫁接到树上,并把其状态更新为 Tj(i) 经过 M 轮后,该过程得到最终的 G 个 MoT 集合 {TM(i)}i=1G

Node-wise Reward Backpropagation. Instead of relying solely on sparse terminal rewards, we assign a dense reward R(v) to every node v, synthesizing immediate retrieval quality with expected future success. Formally, for a node v with retrieved memories Mvrel, we define the reward as:

节点级奖励反向传播。 我们不是只依赖稀疏终端奖励,而是为每个节点 v 分配稠密奖励 R(v),将即时检索质量与预期未来成功结合起来。 形式化地,对于一个带有已检索记忆 Mvrel 的节点 v,我们将奖励定义为:

R(v)=Ifmt(v)(αEvid(v)+Perform(v)).

Here, Ifmt(v) serves as a binary validity mask ensuring syntactic correctness of tool invocations; Evid(v) measures the immediate evidence density, calculated as the proportion of ground-truth evidence retrieved in Mvrel; and Perform(v) denotes the expected terminal performance of node v. For a leaf node, it is defined as the answer quality measured by the F1 score or accuracy. For an internal node, it is computed as the average Perform() over all its child nodes Ch(v):

这里,Ifmt(v) 是一个二值有效性掩码,用于确保工具调用的语法正确;Evid(v) 衡量即时证据密度,计算方式是在 Mvrel 中检索到的真实证据比例;Perform(v) 表示节点 v 的预期终端性能。 对于叶节点,它定义为由 F1 分数或准确率衡量的答案质量。 对于内部节点,它计算为所有子节点 Ch(v)Perform() 平均值:

Perform(v)={F1(v),vVleaf,1|Ch(v)|uCh(v)Perform(u),otherwise.

This formulation ensures that high-reward nodes should adhere to valid formats, retrieve relevant evidence, and lead to high-quality outcomes.

这一形式化确保高奖励节点应该遵循有效格式、检索相关证据,并导向高质量结果。

Dual-Scale Advantage Estimation. To enable tree-based credit assignment, we perform grouped advantage estimation at both the intra-tree and inter-tree levels. The Intra-Tree Advantage Aintra(v) evaluates the relative quality of nodes within the same tree. For a node v in tree T(i), we standardize R(v) using the mean μintra(i) and standard deviation σintra(i) derived from that specific tree:

双尺度优势估计。 为了实现基于树的信用分配,我们在树内和树间两个层级执行分组优势估计。 树内优势 Aintra(v) 评估同一棵树内节点的相对质量。 对于树 T(i) 中的节点 v,我们使用从该特定树得到的均值 μintra(i) 和标准差 σintra(i)R(v) 进行标准化:

Aintra(v)=R(v)μintra(i)σintra(i)+ϵ.

Simultaneously, to capture each node's global advantage, we compute the Inter-Tree Advantage Ainter(v) against the global mean μglobal and standard deviation σglobal across the entire ensemble {T(i)}i=1G:

同时,为了捕获每个节点的全局优势,我们基于整个集合 {T(i)}i=1G 上的全局均值 μglobal 和标准差 σglobal 计算树间优势 Ainter(v)

Ainter(v)=R(v)μglobalσglobal+ϵ.

The final advantage Atotal(v) balances these perspectives:

最终优势 Atotal(v) 平衡这两种视角:

Atotal(v)=Aintra(v)+Ainter(v).

Through this dual-scale design, the intra-tree advantage supports reliable local comparisons sharing similar contexts and effective credit assignment to identify nodes that critically influence the final outcome. Meanwhile, inter-tree advantages encourage cross-tree competition, guiding the optimization toward globally high-quality solutions.

通过这一双尺度设计,树内优势支持在共享相似上下文的节点之间进行可靠局部比较,并进行有效信用分配,以识别对最终结果有关键影响的节点。 同时,树间优势鼓励跨树竞争,引导优化走向全局高质量解。

Optimization Objective. Following the GRPO paradigm, we directly utilize the dual-scale advantage Atotal(v) to optimize the retrieval policy πθ by maximizing:

优化目标。 沿用 GRPO 范式,我们直接利用双尺度优势 Atotal(v) 优化检索策略 πθ,最大化:

J(θ)=EqD[1Gi=1G1|T(i)|vT(i)(LclipβDKL(πθπref))].

where πref constrains the update via the KL penalty coefficient β. The core term Lclip applies standard PPO clipping to the probability ratio ρv,t(θ)=πθ(av,t|)/πθold(av,t|):

其中 πref 通过 KL 惩罚系数 β 约束更新。 核心项 Lclip 对概率比 ρv,t(θ)=πθ(av,t|)/πθold(av,t|) 应用标准 PPO clipping:

Lclip=min(ρv,t(θ)Atotal(v),clip(ρv,t(θ),1±ϵ)Atotal(v)).

3.3 MoT-GRPO for Memory Construction

Unlike retrieval, memory construction spans hundreds of steps with rewards delayed until downstream queries, and its quality is irrelevant to most queries, resulting in severe credit assignment ambiguity. To address this, we propose Hindsight Credit Assignment, which back-propagates advantage signals from downstream retrieval trajectories to upstream construction actions. Hindsight Credit Assignment. Let amem be a memory operation processing source turns Xsrc to produce a memory entry m. For a query q with ground-truth evidence Xeviq, we define the hindsight score S(amem) by aggregating advantages Atotal(vL) from terminal leaf nodes vLVleaves:

不同于检索,记忆构建跨越数百个步骤,其奖励延迟到下游查询之后才出现,而且其质量与大多数查询无关,这导致严重的信用分配歧义。 为了解决这一问题,我们提出 Hindsight Credit Assignment,它把优势信号从下游检索轨迹反向传播到上游构建动作。 Hindsight Credit Assignment。amem 表示一个处理源轮次 Xsrc 并产生记忆条目 m 的记忆操作。 对于一个带有真实证据 Xeviq 的查询 q,我们通过聚合终端叶节点 vLVleaves 的优势 Atotal(vL) 来定义 hindsight score S(amem)

S(amem)=1|Vleaves|vLVleavesAtotal(vL)ϱ(amem,vL).

The credit coefficient ϱ integrates two distinct signals:

信用系数 ϱ 整合两种不同信号:

ϱ(amem,vL)=I(XsrcXeviq)Evidence Alignment Gate+λI(mMvL)Retrieval Trace Gate.

The Evidence Alignment Gate attributes credit by linking the construction quality of ideal evidence turn Xeviq to answer accuracy. It posits that successful reasoning is fundamentally rooted in the effective transformation of ground-truth evidence into memory. Thus, the advantage of a final answer serves as a proxy to evaluate the construction of these pivotal source turns. Conversely, the Retrieval Trace Gate (weighted by λ=0.1) captures the empirical utility of m retrieved within the actual retrieval tree. It recognizes that any memory entry m involved in the terminal path MvL objectively modulates the model's decision-making, rewarding the construction process for its functional contribution to the successful trajectory. Notably, in the absence of ground-truth evidence, the mechanism naturally relies on the Retrieval Trace Gate, maintaining robust generalization across diverse datasets.

Evidence Alignment Gate 通过把理想证据轮次 Xeviq 的构建质量与答案准确率相连来归因信用。 它认为,成功推理从根本上源于把真实证据有效转化为记忆。 因此,最终答案的优势可以作为代理,用于评估这些关键源轮次的构建。 相反,Retrieval Trace Gate(权重为 λ=0.1)捕获在实际检索树中被检索到的 m 的经验效用。 它承认,任何参与终端路径 MvL 的记忆条目 m 都会客观调节模型决策,因此会奖励构建过程对成功轨迹的功能性贡献。 值得注意的是,在缺少真实证据时,该机制自然依赖 Retrieval Trace Gate,从而在多样数据集上保持稳健泛化。

表2:LoCoMo 基准上的性能比较,使用 F1 和 BLEU-1 作为评估指标。†:GAM 论文推荐 gpt-4o-mini;我们也使用 Qwen3-4B 复现它以便公平比较。‡:由于 Memory-R1 未开源,我们忠实报告其原论文提供的结果。
MethodBase LLMSingle-HopMulti-HopTemporalOpen DomainOverall
F1↑B1↑F1↑B1↑F1↑B1↑F1↑B1↑F1↑B1↑
Training-free Methods
VANILLAQwen3-4B40.6831.5423.2316.7618.9713.4213.8710.7031.5023.94
RAGQwen3-4B49.4544.9423.5017.1343.0737.3520.2314.9441.5936.45
MemGPTQwen3-4B14.0011.7716.6813.9912.5610.9411.619.1614.0511.84
MemoryBankQwen3-4B26.6517.7225.5219.449.157.4416.4212.3922.3415.66
Mem0Qwen3-4B47.2840.7235.4027.3646.8439.4826.6421.0443.7136.78
MemoryOSQwen3-4B48.3542.5735.2427.3040.9832.6822.0817.9342.8336.26
LightMemQwen3-4B43.7838.8430.7825.8044.7140.7218.9314.4240.0135.27
A-MemQwen3-4B44.6238.2627.2421.0743.8535.9715.4012.7139.4333.04
GAMQwen3-4B32.2325.5432.2328.6626.2622.5218.4514.4730.1724.81
GAMgpt-4o-mini57.7552.1042.2934.4459.4553.1129.7324.7453.4847.33
Trained Methods
MEM1MEM1-7B27.4822.1018.9815.5630.5223.4814.2111.4325.6820.50
MemAgentMemAgent-14B35.8629.6427.8622.7237.9331.8520.3116.4733.8227.97
Memory-R1-PPOMem-R1-8B32.5224.4726.8623.4741.5726.1145.3039.1834.0825.54
Memory-R1-GRPOMem-R1-8B35.7327.7035.6530.7749.8638.2747.4241.2439.2531.21
Our Method Mem-T
w/o trainingQwen3-4B53.9749.1538.4431.7053.9948.0826.4423.3749.3844.11
with GRPOQwen3-4B59.4354.6538.4030.5160.7856.1023.4620.1653.5648.33
with MoT-GRPOQwen3-4B63.7557.9545.0936.5865.1360.1232.9728.9458.6552.63

Policy Refinement. To optimize memory construction policies, we employ rank-based sampling to curate a high-quality training dataset Dmem. We first discard trajectories with invalid tool invocations. Subsequently, we rank all candidate actions by their hindsight score S(amem) and retain only the top 50% percentile within each operation category. Finally, treating Dmem as a collection of expert demonstrations, we train the policies πθ (encompassing πform and πevol) to maximize the log-likelihood of these selected actions:

策略细化。 为了优化记忆构建策略,我们采用基于排名的采样来整理高质量训练数据集 Dmem 我们首先丢弃带有无效工具调用的轨迹。 随后,我们按照 hindsight score S(amem) 对所有候选动作排序,并且在每个操作类别中只保留前 50% 分位的动作。 最后,我们把 Dmem 视为专家示范集合,训练策略 πθ(包括 πformπevol)以最大化这些被选中动作的对数似然:

LOff(θ)=E(x,amem)Dmem[logπθ(amemx,Mt)].

This offline optimization effectively distills the "hindsight wisdom" derived from the downstream MoT-GRPO search trees into the forward-looking memory construction policy.

这一离线优化有效地把来自下游 MoT-GRPO 搜索树的“事后智慧”蒸馏进面向未来的记忆构建策略。

4. Experiments

4.1 Experimental Setup

Evaluation and Benchmarks. We evaluate the proposed framework across four challenging long-context benchmarks, including LoCoMo, LongMemEval, HotpotQA, and NarrativeQA. LoCoMo and LongMemEval focus on long-term conversational question answering. Following Memory-R1, we use the same training data configuration by splitting the LoCoMo dataset into a 1:1:8 train/validation/test split to ensure a fair comparison. The remaining three benchmarks are treated as out-of-domain datasets to evaluate the generalization ability of our method. Specifically, for HotpotQA, following prior work, we construct long-context inputs by concatenating the gold supporting documents with 400 irrelevant Wikipedia documents. More details about the dataset are in the supplementary experimental setup.

评估与基准。 我们在四个具有挑战性的长上下文基准上评估所提出框架,包括 LoCoMo、LongMemEval、HotpotQA 和 NarrativeQA。 LoCoMo 和 LongMemEval 聚焦长期对话问答。 沿用 Memory-R1,我们使用相同的训练数据配置,把 LoCoMo 数据集按 1:1:8 划分为训练/验证/测试集,以确保公平比较。 其余三个基准被视为域外数据集,用于评估我们方法的泛化能力。 具体而言,对于 HotpotQA,我们沿用既有工作,通过把黄金支持文档与 400 篇无关 Wikipedia 文档拼接来构造长上下文输入。 关于数据集的更多细节见补充实验设置。

Baselines. We compare Mem-T against thirteen baselines, categorized into two groups: (I) Training-free Methods: This group includes memory-free approaches, such as vanilla long-LLM and retrieval-augmented generation (RAG), as well as memory-based methods, including MemGPT, MemoryBank, Mem0, LightMem, A-Mem, and GAM. (II) Training-based Methods: This group includes MemAgent and Mem1, which primarily focus on working memory, and Memory-R1 and Mem-α, which are designed to mainly enhance factual memory. For all the baselines, official implementations and released parameters are used when available.

基线。 我们将 Mem-T 与十三个基线比较,并将它们分为两组:(I) 无训练方法: 这一组包括无记忆方法,例如普通 long-LLM 和检索增强生成(RAG),以及基于记忆的方法,包括 MemGPT、MemoryBank、Mem0、LightMem、A-Mem 和 GAM。 (II) 基于训练的方法: 这一组包括主要聚焦工作记忆的 MemAgent 和 Mem1,以及主要为增强事实记忆而设计的 Memory-R1 和 Mem-α 对于所有基线,只要官方实现和已发布参数可用,我们都使用它们。

Implementation Details. We select LLM backbones of varying sizes, including Qwen3-4B and Qwen3-8B. All methods use BGE-M3 as the embedding model. During training with MoT-GRPO, we generate three trees for each query (G=3), with a maximum tree depth of 4. In each expansion round, we select three nodes (Nv=3) for branch expansion. The training for memory retrieval is conducted for 200 steps. And the training for memory construction is based on a dataset containing 10k memory operations. At inference time, Mem-T is allowed up to 6 reasoning steps. All retrieval operations default to returning the top-5 most similar items. More training setup and parameter configurations are listed in the supplementary implementation details.

实现细节。 我们选择不同规模的 LLM backbone,包括 Qwen3-4B 和 Qwen3-8B。 所有方法都使用 BGE-M3 作为 embedding model。 在使用 MoT-GRPO 训练时,我们为每个查询生成三棵树(G=3),最大树深度为 4 在每一轮扩展中,我们选择三个节点(Nv=3)进行分支扩展。 记忆检索训练进行 200 步。 而记忆构建训练基于一个包含 10k 个记忆操作的数据集。 在推理时,Mem-T 最多允许 6 个推理步骤。 所有检索操作默认返回 top-5 个最相似条目。 更多训练设置和参数配置列在补充实现细节中。

4.2 Main Results

High Performance. As shown in Table 2 and Appendix Table 1, Mem-T achieves substantially better performance on the LoCoMo benchmark than both training-free and training-based baselines. When using Qwen3-4B and Qwen3-8B, Mem-T improves F1 by 14.92 (34.13%) and 14.55 (33.08%), respectively. Even without training, the hierarchical and highly agentic memory system of Mem-T achieves superior performance, improving F1 by 5.67 (12.97%) compared to other methods. Moreover, MoT-GRPO further strengthens the LLM's memory management capability compared to the training-free and the GRPO baseline, yielding additional F1 gains of 9.27 (18.77%) and 5.09 (9.50%). These results demonstrate that the joint retrieval and construction training with dense rewards in MoT-GRPO is better suited for long-horizon memory agents. Notably, GAM, the SOTA memory system, exhibits an F1 gap of 23.31 when switching its backbone from gpt-4o-mini to Qwen3-4B, highlighting the importance of systematically improving model-level memory management capabilities.

高性能。表2附表1所示,Mem-T 在 LoCoMo 基准上的表现显著优于无训练和基于训练的基线。 在使用 Qwen3-4B 和 Qwen3-8B 时,Mem-T 分别将 F1 提升 14.9234.13%)和 14.5533.08%)。 即使不训练,Mem-T 的层级且高度智能体化的记忆系统也取得了更优性能,相比其他方法提升 F1 5.6712.97%)。 此外,与无训练和 GRPO 基线相比,MoT-GRPO 进一步增强了 LLM 的记忆管理能力,分别带来额外 F1 增益 9.2718.77%)和 5.099.50%)。 这些结果表明,MoT-GRPO 中带有稠密奖励的联合检索与构建训练更适合长程记忆智能体。 值得注意的是,SOTA 记忆系统 GAM 在把 backbone 从 gpt-4o-mini 换成 Qwen3-4B 时出现了 23.31 的 F1 差距,这凸显了系统性提升模型级记忆管理能力的重要性。

附表1:LoCoMo 基准上的性能比较,使用 F1 和 BLEU-1 作为评估指标。‡:由于 Memory-R1 未开源,我们忠实报告其原论文提供的结果。
MethodLLMSingle-HopMulti-HopTemporalOpen DomainOverall
F1↑B1↑F1↑B1↑F1↑B1↑F1↑B1↑F1↑B1↑
Training-free Methods
RAGQwen3-8B49.6243.9823.6417.8237.9333.8021.3916.3340.7735.43
MemGPTQwen3-8B16.2313.0818.1313.7215.8711.3914.1810.6616.3812.71
MemoryBankQwen3-8B26.5019.4826.5218.9315.4911.3615.9212.0923.6617.31
Mem0Qwen3-8B45.9239.9327.8019.9743.6433.8218.3713.8440.4133.42
MemoryOSQwen3-8B48.7743.4729.1924.8742.9835.2718.5015.0942.1236.65
LightMemQwen3-8B49.8944.4833.9827.6044.5339.6519.3714.0543.9838.51
A-MemQwen3-8B47.7541.3632.3524.8236.8030.7118.6214.9840.9234.56
GAMQwen3-8B46.6240.1532.1824.9646.4239.7113.5610.3241.8435.39
Trained Methods
MEM1MEM1-7B27.4822.1018.9815.5630.5223.4814.2111.4325.6820.50
MemAgentMemAgent-14B35.8629.6427.8622.7237.9331.8520.3116.4733.8227.97
Memory-R1-PPOMem-R1-8B32.5224.4726.8623.4741.5726.1145.3039.1834.0825.54
Memory-R1-GRPOMem-R1-8B35.7327.7035.6530.7749.8638.2747.4241.2439.2531.21
Our Method
w/o trainingQwen3-8B55.8951.1438.1330.3353.3047.0223.5520.1850.0844.55
with MoT-GRPOQwen3-8B63.6557.9742.3834.7266.8562.2934.3331.4758.5352.89
表3:OOD 基准(HotpotQA、LongMemEval、NarrativeQA)上的评估结果。除 MEM1 使用原论文训练的 7B 模型外,所有方法都基于 Qwen3-4B 模型实现。
MethodHotpotQALongMemEvalNarrativeQAAvg.
F1↑Acc↑F1↑
Training-free Methods
VANILLA21.8938.8018.0926.26
RAG50.1356.6021.1742.63
MemGPT18.2423.008.3916.54
MemoryBank16.9026.209.6517.58
A-Mem30.4661.3025.1838.98
Mem031.9653.6027.6337.73
MemoryOS26.8646.8023.4532.37
LightMem38.6263.1016.7839.50
GAM52.9861.8028.3247.70
Trained Methods
MEM155.3619.0013.4929.28

Mem-α

58.8052.0028.5646.45
Mem-T66.3565.8030.2954.15

Cross-domain generalization. To evaluate whether the memory management capabilities learned by MoT-GRPO can transfer across tasks, we assess the performance of Mem-T on three out-of-domain tasks. As shown in Table 3, baselines such as LightMem achieve suboptimal performance on LongMemEval but fail to generalize to other benchmarks, trailing Mem-T by 27.73 and 13.51 on HotpotQA and NarrativeQA, respectively. Training-based MEM-1 performs well on the in-domain QA benchmark HotpotQA, outperforming training-free methods by 2.38, but suffers substantial degradation on benchmarks that emphasize long-horizon dialogue understanding, underperforming Mem-T by 46.8 and 16.8. In contrast, Mem-T learns effective memory management strategies through training on LoCoMo and achieves SOTA performance across all three out-of-domain benchmarks, with an average improvement of 6.45(13.52%) over other methods. Notably, Mem-T generalizes well from long-horizon dialogue to the QA setting of HotpotQA, outperforming other approaches by 7.55.

跨域泛化。 为了评估 MoT-GRPO 学到的记忆管理能力能否跨任务迁移,我们评估 Mem-T 在三个域外任务上的性能。 表3所示,LightMem 等基线在 LongMemEval 上取得次优性能,但无法泛化到其他基准,在 HotpotQA 和 NarrativeQA 上分别落后 Mem-T 27.7313.51 基于训练的 MEM-1 在域内 QA 基准 HotpotQA 上表现良好,比无训练方法高出 2.38,但在强调长程对话理解的基准上显著退化,分别落后 Mem-T 46.816.8 相比之下,Mem-T 通过在 LoCoMo 上训练学到有效的记忆管理策略,并在三个域外基准上都达到 SOTA 性能,平均比其他方法提升 6.45(13.52%) 值得注意的是,Mem-T 能很好地从长程对话泛化到 HotpotQA 的 QA 设置,比其他方法高出 7.55

Token-economical. As illustrated in Figure 3 and Figure 4, Mem-T demonstrates superior cost-effectiveness, lying on the Pareto front for both the LoCoMo and HotpotQA datasets. Compared to GAM, Mem-T not only achieves a 5.1728.48 improvement in F1 Score but also reduces the inference overhead by 19.94%24.45% per query.

Token 经济性。图3图4所示,Mem-T 展现出更优的成本效益,在 LoCoMo 和 HotpotQA 两个数据集上都位于 Pareto 前沿。 相比 GAM,Mem-T 不仅让 F1 Score 提升 5.1728.48,还将每个查询的推理开销降低 19.94%24.45%

The comparison of the performance and inference cost on the LoCoMo dataset
图3:LoCoMo 数据集上性能和推理成本的比较。散点的不同形状代表不同类型的基线。
The comparison of the performance and inference cost on the HotpotQA dataset
图4:HotpotQA 数据集上性能和推理成本的比较。散点的不同形状代表不同类型的基线。

4.3 Framework Analysis

表4:LoCoMo 数据集上的消融研究。所有条目的评估指标均设为 F1。
MethodSingleMultiTemporalOpenOverall
Vanilla Mem-T63.7545.0965.1332.9758.65
Ablation of Memory Modules

w/o Mwork

63.2443.4263.3830.9057.59

w/o Mfact

60.8040.1064.2322.3955.25

w/o Mexp

61.9443.9662.6427.4256.60

w/o Mraw

62.1942.8462.4129.3856.61
Ablation of MoT-GRPO
w/o Retr. Opt.57.9143.8556.6930.7353.37
w/o Cons. Opt.61.4141.1561.1725.2455.36

w/o Aintra

62.0843.0863.5231.5356.95

w/o Ainter

58.3343.5259.5830.5954.09

Ablation Study. We conduct an ablation study on the hierarchical memory architecture and the MoT-GRPO training paradigm, with results presented in Table 4: (1) w/o Memory Modules, which individually removes the working (Mwork), factual (Mfact), experiential (Mexp), and raw (Mraw) memory stores. On LoCoMo, which emphasizes information extraction in long-horizon dialogues, factual memory proves to be the most critical component, leading to a substantial performance decline of 3.40. (2) w/o Optimization Strategies, where we replace the MoT-GRPO-optimized policies with the base model during the memory retrieval (w/o Retr. Opt.) and construction (w/o Cons. Opt.) phases. Eliminating the retrieval optimization leads to the most significant performance decline of 5.28, while removing the construction optimization causes a 3.29 drop. These marked degradations verify that both stages of MoT-GRPO are crucial. (3) w/o Advantage Terms, which ablates the intra-tree (Aintra) or inter-tree (Ainter) advantage. Removing Ainter causes a larger performance drop (4.56) than removing Aintra (1.70), indicating that cross-tree advantage estimation is critical for stable RL training, while combining both signals yields the best performance.

消融研究。 我们对层级记忆架构和 MoT-GRPO 训练范式进行消融研究,结果见表4(1) w/o Memory Modules,分别移除工作记忆(Mwork)、事实记忆(Mfact)、经验记忆(Mexp)和原始记忆(Mraw)存储。 在强调长程对话信息抽取的 LoCoMo 上,事实记忆被证明是最关键的组件,导致 3.40 的显著性能下降。 (2) w/o Optimization Strategies,即在记忆检索(w/o Retr. Opt.)和构建(w/o Cons. Opt.)阶段,用基础模型替换经 MoT-GRPO 优化的策略。 移除检索优化导致最显著的性能下降,为 5.28;移除构建优化则导致 3.29 的下降。 这些明显退化验证了 MoT-GRPO 的两个阶段都至关重要。 (3) w/o Advantage Terms,即消融树内优势(Aintra)或树间优势(Ainter)。 移除 Ainter 导致的性能下降(4.56)大于移除 Aintra1.70),说明跨树优势估计对稳定 RL 训练至关重要,而结合两种信号可取得最佳性能。

Parameter sensitivity analysis on retrieval steps and operation trees
图5:左:LoCoMo 上最大推理检索步数的参数敏感性分析;右:在 LoCoMo 和 HotpotQA 数据集上使用 MoT-GRPO 训练时,每个查询的操作树数量(G)的参数敏感性分析。

Sensitivity Analysis. We analyze the sensitivity of Mem-T to three core parameters. The results are presented in Figure 5 and Appendix Figure 1. For the maximum retrieval steps, we observe a substantial performance improvement as the steps increase from 2 to 6, where the F1 score increases from 53.4558.65. However, further extending the steps from 6 to 10 yields only marginal gains (<0.5%) while linearly inflating the token consumption per query from 9k to 21k. For the number of operation trees G, increasing G from 1 to 3 yields substantial gains, boosting the F1 score on LoCoMo from 54.20 to 58.65 and on HotpotQA from 62.49 to 66.54. However, further increasing G to 5 results in diminishing returns, offering a marginal average improvement of only 0.35 while disproportionately inflating the computational cost by approximately 67%. Thus, we set the maximum retrieval steps to 6 and G=3 to balance efficiency and overhead. More analysis is in the supplementary sensitivity analysis.

敏感性分析。 我们分析 Mem-T 对三个核心参数的敏感性。 结果见图5附图1 对于最大检索步数,我们观察到当步数从 2 增加到 6 时,性能显著提升,F1 分数从 53.4558.65 然而,把步数从 6 进一步扩展到 10 只带来边际收益(<0.5%),同时会使每个查询的 token 消耗从 9k 线性膨胀到 21k 对于操作树数量 G,将 G 从 1 增加到 3 会带来显著收益,使 LoCoMo 上的 F1 分数从 54.20 提升到 58.65,HotpotQA 上从 62.49 提升到 66.54 然而,进一步将 G 增加到 5 会出现收益递减,平均只提供 0.35 的边际提升,却使计算成本不成比例地增加约 67% 因此,我们把最大检索步数设为 6,并设置 G=3,以平衡效率和开销。 更多分析见补充敏感性分析。

Parameter sensitivity analysis on the number of nodes selected for branch expansion
附图1:使用 MoT-GRPO 在 LoCoMo 和 HotpotQA 数据集上训练时,为分支扩展选择的节点数量的参数敏感性分析。

4.4 Case Study

Case Study comparing Mem-T against baseline
图6:Mem-T 与基线的案例研究比较。

We present a case study comparing the memory processing trajectories of Mem-T against the Qwen3-4B baseline in Figure 6 to demonstrate the enhanced capabilities acquired through our training paradigm.

我们在图6中展示了一个案例研究,将 Mem-T 的记忆处理轨迹与 Qwen3-4B 基线进行比较,以说明通过我们的训练范式获得的增强能力。

As illustrated, the baseline exhibits severe limitations across the entire memory lifecycle. In the formation phase, it lacks an accurate information extraction capability, failing to resolve relative timestamps (e.g., "yesterday") into specific dates. During evolution, it fails to distinguish between Update and Add operations, erroneously overwriting existing entity records with unrelated new memory. Finally, its retrieval mechanism is limited to ambiguous raw queries, lacking the logical depth to handle multi-step reasoning.

如图所示,基线在整个记忆生命周期中都表现出严重局限。 formation 阶段,它缺少准确的信息抽取能力,无法把相对时间戳(例如 “yesterday”)解析成具体日期。 evolution 阶段,它无法区分 UpdateAdd 操作,错误地用无关的新记忆覆盖已有实体记录。 最后,它的 retrieval 机制局限于含糊的原始查询,缺乏处理多步推理的逻辑深度。

In contrast, Mem-T demonstrates superior capabilities in three aspects: Accurate Information Extraction: It accurately processes raw information (e.g., converting "yesterday" to a correct specific date), ensuring initial memory entries are temporally grounded and factually complete; Rational Memory Evolution: It exhibits a deep understanding of the usage criteria for memory evolution tools. By explicitly distinguishing between state updates and new knowledge acquisition, it preserves memory atomicity and prevents key information forgetting. Multi-step Retrieval: Instead of vague searches, it autonomously decomposes complex queries into sub-questions and retrieves from a suitable store. This step-by-step memory lookups synthesize the answer from distinct memory entries.

相比之下,Mem-T 在三个方面展现出更强能力:准确信息抽取: 它准确处理原始信息(例如把 “yesterday” 转换为正确的具体日期),确保初始记忆条目有时间依据且事实完整;理性记忆演化: 它对记忆演化工具的使用标准展现出深入理解。 通过明确区分状态更新和新知识获取,它保留了记忆原子性,并防止关键信息遗忘。 多步检索: 它不是进行模糊搜索,而是自主把复杂查询分解成子问题,并从合适的存储中检索。 这种逐步记忆查找会从不同记忆条目中综合出答案。

5. Conclusion

In this paper, we introduce Mem-T, a comprehensive hierarchical memory framework, and MoT-GRPO, a novel RL paradigm for memory agents. By decomposing sparse terminal rewards into dense, step-wise supervision via memory operation trees, MoT-GRPO enables the joint optimization of memory construction and retrieval policies. The extensive experiments demonstrate that Mem-T not only achieves state-of-the-art performance across in-domain and out-of-domain benchmarks but also realizes a superior Pareto efficiency between task accuracy and inference overhead. We believe Mem-T represents a shift from heuristic-based storage to fully learnable, attribution-centric memory systems, paving the way for the development of self-evolving agents capable of lifelong learning.

在本文中,我们提出 Mem-T,一个综合的层级记忆框架,以及 MoT-GRPO,一个用于记忆智能体的新型 RL 范式。 通过利用记忆操作树把稀疏终端奖励分解成稠密的逐步监督,MoT-GRPO 能够联合优化记忆构建和检索策略。 大量实验表明,Mem-T 不仅在域内和域外基准上达到最先进性能,还在任务准确率和推理开销之间实现了更优的 Pareto 效率。 我们认为,Mem-T 代表了从基于启发式的存储到完全可学习、以归因为中心的记忆系统的转变,为开发具备终身学习能力的自演化智能体铺平了道路。