Skip to content


Memory-R2:面向长程记忆增强 LLM 智能体的公平信用分配

Abstract

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

记忆增强 LLM 智能体通过跨会话存储、更新和复用信息,使交互能够超越有限上下文窗口。 然而,在多会话环境中用强化学习训练这类智能体具有挑战性,因为记忆会把智能体过去的动作变成其未来环境的一部分。 一旦不同 rollout 写入、更新或删除了不同记忆,它们就不再共享相同的中间记忆状态,使轨迹级比较从根本上变得不公平。 这违反了 GRPO 等组相对方法背后的一项关键假设:这些方法把 rollout 当作是从同一个有效环境中采样出来的来进行比较。 因此,轨迹级奖励会为长程记忆操作提供带噪声或有偏的信用信号。 为了解决这一挑战,我们提出 Memory-R2,这是一个面向长程记忆增强 LLM 智能体的训练框架。 它的核心算法 LoGo-GRPO 结合了局部和全局组相对优化。 全局目标保留来自长程轨迹级奖励的端到端学习,而局部重新 rollout 会从同一中间记忆状态比较不同记忆操作结果,从而得到更公平的组比较和更精确的记忆构建监督。 除了信用分配之外,Memory-R2 还通过共享参数协同学习设计联合优化记忆形成和记忆演化,其中事实抽取器和记忆管理器通过角色特定提示从同一个 LLM backbone 实例化。 为了稳定长记忆跨度上的多步强化学习,我们采用渐进课程,将训练跨度从 8 个会话增加到 16 个会话,再增加到 32 个会话。 这些组件共同为长程多会话设置中的记忆增强 LLM 智能体提供了有效的训练范式。

Overview of Memory-R2
图1:Memory-R2 概览。(a)Memory-R2 使用共享 backbone 的抽取器-管理器架构进行分块记忆构建。(b)LoGo-GRPO 与标准 GRPO 的区别在于,它从共享的中间记忆状态引入局部重新 rollout,以在保留全局轨迹级优化的同时实现更公平的信用分配。(c)Memory-R2 在多个 backbone 上提升准确率并降低推理延迟。

1. Introduction

Large language models (LLMs) have rapidly evolved from standalone text generators into agentic systems that can plan, use tools, and interact over long horizons. A central requirement for such agents is the ability to accumulate, update, and reuse information across interactions. However, despite strong in-context reasoning ability, LLM agents remain fundamentally constrained by finite context windows and the lack of persistent state, making it difficult to retain salient user information, track long-term goals, or maintain consistency over extended multi-session interactions.

大语言模型(LLM)已经从独立文本生成器迅速演化为能够规划、使用工具并进行长程交互的智能体系统。 这类智能体的一项核心需求,是能够跨交互积累、更新和复用信息。 然而,尽管 LLM 智能体具有很强的上下文内推理能力,它们仍从根本上受限于有限上下文窗口和缺少持久状态,因此难以保留关键用户信息、跟踪长期目标,或在扩展的多会话交互中保持一致性。

To address this limitation, a growing body of work augments LLM agents with explicit memory systems. Existing research broadly follows two directions. The first focuses on memory infrastructure, including graph-structured memory, structured memory schemas, and system-inspired memory organization. The second focuses on memory policy learning, where reinforcement learning (RL) is used to decide what to extract, how to update memory, and how to use retrieved memory. While these efforts have substantially improved long-horizon agent behavior, training memory agents in multi-session environments remains fundamentally challenging.

为了解决这一局限,越来越多工作用显式记忆系统增强 LLM 智能体。 现有研究大体沿着两个方向展开。 第一个方向关注记忆基础设施,包括图结构记忆、结构化记忆模式以及受系统启发的记忆组织方式。 第二个方向关注记忆策略学习,其中使用强化学习(RL)来决定抽取什么、如何更新记忆以及如何使用检索到的记忆。 虽然这些努力已经显著改善了长程智能体行为,但在多会话环境中训练记忆智能体仍然存在根本挑战。

The core difficulty is that memory makes the environment non-stationary. In multi-session agent training, memory turns the agent's past actions into part of its future environment: what the agent writes, updates, or deletes in one session becomes the state inherited by subsequent sessions. This creates a fundamental challenge for trajectory-level RL, especially for group-relative methods such as GRPO, which rely on comparing rollouts sampled from the same effective environment. Once rollouts modify memory differently, they no longer share the same intermediate memory state, yet GRPO still normalizes their rewards within a single comparison group, leading to unfair comparisons and biased credit assignment. The problem is further amplified by trajectory-level rewards: when a downstream failure occurs, it is difficult to determine whether it comes from the current session's memory operation, corrupted memory inherited from earlier sessions, or later updates that overwrite useful information. This raises a simple but important question:

核心困难在于,记忆会使环境变成非平稳的。 在多会话智能体训练中,记忆会把智能体过去的动作变成其未来环境的一部分:智能体在一个会话中写入、更新或删除的内容,会成为后续会话继承的状态。 这给轨迹级强化学习带来根本挑战,尤其是对于 GRPO 等组相对方法,因为这些方法依赖于比较从同一有效环境中采样的 rollout。 一旦 rollout 以不同方式修改记忆,它们就不再共享相同的中间记忆状态,但 GRPO 仍会在同一个比较组内归一化它们的奖励,从而导致不公平比较和有偏信用分配。 轨迹级奖励会进一步放大这一问题:当下游失败发生时,很难判断它来自当前会话的记忆操作、早期会话继承的损坏记忆,还是来自后续更新覆盖了有用信息。 这引出了一个简单但重要的问题:

How can we design a training paradigm for memory-augmented agents that provides more accurate and fair credit assignment across sessions?

我们如何为记忆增强智能体设计一种训练范式,使其能够跨会话提供更准确、更公平的信用分配?

In this work, we present Memory-R2, a training framework for long-horizon memory-augmented LLM agents, as illustrated in Figure 1. At its core is LoGo-GRPO, a credit-assignment algorithm that combines global and local group-relative optimization. LoGo-GRPO preserves a trajectory-level global reward for end-to-end long-horizon optimization, while additionally introducing session-wise attribution signals and local rerollouts that compare trajectories starting from identical intermediate memory states. This yields fairer group comparison and cleaner supervision for memory operations.

在本文中,我们提出 Memory-R2,这是一个面向长程记忆增强 LLM 智能体的训练框架,如图1所示。 其核心是 LoGo-GRPO,这是一种结合全局和局部组相对优化的信用分配算法。 LoGo-GRPO 保留轨迹级全局奖励以进行端到端长程优化,同时额外引入会话级归因信号和局部重新 rollout,用于比较从相同中间记忆状态出发的轨迹。 这带来了更公平的组比较和更干净的记忆操作监督。

Beyond fair credit assignment, Memory-R2 is designed to optimize the whole memory lifecycle. Recent analyses decompose agentic memory into memory formation, memory evolution, and memory retrieval, whereas prior RL-based memory work has focused primarily on evolution and retrieval. Our framework targets memory formation and evolution through two cooperative roles: a fact extractor, which identifies salient information from the interaction context, and a memory manager, which decides whether to insert, update, or delete memory entries. Inspired by shared-policy multi-agent RL, we instantiate both roles with a shared LLM backbone and role-specific prompts, enabling parameter-efficient co-learning and tighter coordination between extraction and memory editing.

除了公平信用分配之外,Memory-R2 还旨在优化整个记忆生命周期。 近期分析把智能体记忆分解为记忆形成、记忆演化和记忆检索,而此前基于强化学习的记忆工作主要关注演化和检索。 我们的框架通过两个协作角色来处理记忆形成和演化:事实抽取器从交互上下文中识别关键信息,记忆管理器决定是否插入、更新或删除记忆条目。 受共享策略多智能体强化学习启发,我们用共享 LLM backbone 和角色特定提示实例化这两个角色,从而实现参数高效的协同学习,并加强抽取与记忆编辑之间的协调。

We further formulate memory construction as a multi-step decision process within each session. Rather than treating a session as a single monolithic transition, we divide it into chunks and allow the fact extractor and memory manager to alternate over them, turning memory construction into a temporally extended process that can be refined as more evidence becomes available. To stabilize long-horizon optimization, we also introduce a curriculum over session horizon, progressively scaling training from 8 to 16 to 32 sessions so that the model first acquires reliable short-horizon memory behavior before adapting to more challenging long-context settings. Our contributions are summarized as follows:

我们进一步把每个会话内的记忆构建表述为一个多步决策过程。 与其把一个会话看成单一的整体转移,我们把它划分为多个块,并让事实抽取器和记忆管理器在这些块上交替工作,从而把记忆构建变成一个随更多证据出现而逐步细化的时间扩展过程。 为了稳定长程优化,我们还引入跨会话跨度的课程学习,将训练逐步从 8 个会话扩展到 16 个会话再到 32 个会话,使模型先获得可靠的短程记忆行为,再适应更具挑战性的长上下文设置。 我们的贡献总结如下:

  • We propose Memory-R2, a training framework for long-horizon memory-augmented LLM agents, whose core algorithm LoGo-GRPO improves fairness and session-level credit assignment through global-local group-relative optimization.
  • We introduce a shared-parameter extractor--manager architecture and formulate memory construction as a multi-step decision process over chunked sessions, enabling joint optimization of memory formation and evolution.
  • We develop a curriculum learning strategy over session horizon that stabilizes long-horizon RL training, and show that the resulting system is highly data-efficient, achieving strong gains over prior memory-agent baselines using only two training conversations while generalizing across benchmarks, model scales, and answer agents.
  • 我们提出 Memory-R2,这是一个面向长程记忆增强 LLM 智能体的训练框架,其核心算法 LoGo-GRPO 通过全局-局部组相对优化提升公平性和会话级信用分配。
  • 我们引入 共享参数抽取器-管理器架构,并把记忆构建表述为分块会话上的 多步决策过程,从而支持记忆形成和记忆演化的联合优化。
  • 我们开发了一种跨会话跨度的 课程学习策略,用于稳定长程强化学习训练,并表明由此得到的系统具有很高的 数据效率:它仅使用两个训练对话就相较以往记忆智能体基线取得显著提升,并能跨基准、模型规模和答案智能体泛化。

2.1 Memory Agent Architectures

Explicit memory has become a standard way to extend LLM agents beyond finite context windows and support long-horizon interaction. Prior work mainly differs in how memory is represented and managed. Representative examples include graph- or structure-based memory systems such as Zep, G-Memory, A-MEM, Mem0, and CAM, as well as system-inspired designs such as MemOS and MemoryOS. While these methods propose increasingly expressive memory substrates, they mostly rely on heuristic or prompt-based policies for deciding what to store, update, or discard. In contrast, our work retains a modular extractor--manager architecture but optimizes the memory lifecycle directly with reinforcement learning.

显式记忆已经成为把 LLM 智能体扩展到有限上下文窗口之外并支持长程交互的标准方式。 以往工作的主要差异在于如何表示和管理记忆。 代表性例子包括 Zep、G-Memory、A-MEM、Mem0 和 CAM 等基于图或结构的记忆系统,以及 MemOS 和 MemoryOS 等受系统启发的设计。 虽然这些方法提出了越来越有表达力的记忆载体,但它们大多依赖启发式或基于提示的策略来决定存储、更新或丢弃什么。 相比之下,我们的工作保留模块化抽取器-管理器架构,但直接用强化学习优化记忆生命周期。

2.2 Reinforcement Learning for Memory Agents

Reinforcement learning has recently become an effective paradigm for training LLM agents in interactive settings such as tool use, web navigation, and reasoning. This is particularly suitable for memory agents, where the quality of extraction, memory editing, and retrieval decisions is only revealed through downstream task performance. Existing RL-based memory methods, such as Memory-R1 and Mem-α, demonstrate the promise of this direction. However, they rely mainly on outcome-level rewards and do not explicitly address cross-session credit assignment under diverging memory states. They also focus primarily on memory evolution and retrieval, leaving joint optimization of formation, evolution, and retrieval underexplored. Our work addresses these gaps by introducing multi-step extractor--manager training, shared-parameter co-learning, and a global-local GRPO objective for fairer credit assignment in long-horizon multi-session settings.

强化学习近期已经成为在工具使用、网页导航和推理等交互式设置中训练 LLM 智能体的有效范式。 这尤其适合记忆智能体,因为抽取、记忆编辑和检索决策的质量只有通过下游任务表现才会显现。 Memory-R1 和 Mem-α 等现有基于强化学习的记忆方法展示了这一方向的潜力。 然而,它们主要依赖结果级奖励,并没有显式解决记忆状态分化下的跨会话信用分配。 它们也主要关注记忆演化和检索,使形成、演化与检索的联合优化仍缺乏充分探索。 我们的工作通过引入多步抽取器-管理器训练、共享参数协同学习以及用于长程多会话设置中更公平信用分配的全局-局部 GRPO 目标来弥补这些空白。

3. Method

3.1 Problem Formulation: Multi-step Memory Bank Construction

We study memory bank construction for long-horizon multi-session interactions. Let D={St}t=1T denote a dialogue trajectory of T sessions, where each session St={xt,k}k=1K is divided into K chunks. The agent maintains an external memory bank M that evolves across sessions. We formulate memory construction as a chunk-wise multi-step process, illustrated in Figure 1(a): for each chunk xt,k, a fact extractor first proposes salient content

我们研究长程多会话交互中的记忆库构建。 D={St}t=1T 表示由 T 个会话组成的对话轨迹,其中每个会话 St={xt,k}k=1K 被划分为 K 个块。 智能体维护一个跨会话演化的外部记忆库 M 我们把记忆构建表述为分块的多步过程,如图1(a) 所示:对于每个块 xt,k,事实抽取器首先提出关键信息

zt,kπext(zxt,k).

and a memory manager then chooses an operation conditioned on the extracted content and current memory state,

随后,记忆管理器在抽取内容和当前记忆状态条件下选择一个操作,

at,kπmgr(azt,k,Mt,k1),

where at,kA denotes operations such as INSERT, UPDATE, and DELETE. The memory bank is updated by a deterministic transition operator

其中 at,kA 表示 INSERTUPDATEDELETE 等操作。 记忆库由一个确定性转移算子更新:

Mt,k=T(Mt,k1,zt,k,at,k).

This yields a chunk-wise memory construction process over session t:

这给出了会话 t 上的分块记忆构建过程:

Mt,0πext,πmgrxt,1Mt,1πext,πmgrxt,2πext,πmgrxt,KMt,K.

Across the full dialogue trajectory, let τ={zt,k,at,k}t=1,k=1T,K denote a memory-construction rollout. Its probability factorizes as

在完整对话轨迹上,令 τ={zt,k,at,k}t=1,k=1T,K 表示一次记忆构建 rollout。 其概率可分解为

pθ(τD)=t=1Tk=1Kπext(zt,kxt,k)πmgr(at,kzt,k,Mt,k1).

In our framework, the extractor and the manager are implemented as two cooperative roles instantiated from a shared LLM backbone with role-specific prompts:

在我们的框架中,抽取器和管理器被实现为两个协作角色,它们通过角色特定提示从共享 LLM backbone 实例化:

πext()=πθ(pext,),πmgr()=πθ(pmgr,),

where θ denotes the shared model parameters, and pext and pmgr are role-specific prompts for fact extraction and memory management, respectively. The resulting memory-construction rollout τ is evaluated through downstream task performance, yielding a trajectory-level reward R(τ). We optimize the shared memory policy by maximizing the expected return Eτπθ[R(τ)].

其中 θ 表示共享模型参数,pextpmgr 分别是用于事实抽取和记忆管理的角色特定提示。 得到的记忆构建 rollout τ 通过下游任务表现进行评估,从而产生轨迹级奖励 R(τ) 我们通过最大化期望回报 Eτπθ[R(τ)] 来优化共享记忆策略。

3.2 Length-Normalized Step-level RL with Shared Extractor--Manager Policy

While Section 3.1 defines memory construction as a multi-step process, optimizing it with a shared LLM policy introduces length-induced bias. We instantiate fact extraction and memory management as two roles of a shared policy with role-specific prompts. Since the two roles produce outputs of different lengths, token-level RL assigns more loss terms to longer generations, biasing the shared policy toward verbose outputs and roles with longer outputs. To address this, we use a length-normalized step-level objective, treating each extractor or manager call as one generation step. For a generation step u with generated token indices Uu, we aggregate token-level ratios and advantages as

虽然第 3.1 节把记忆构建定义为多步过程,但用共享 LLM 策略优化它会引入由长度造成的偏差。 我们把事实抽取和记忆管理实例化为共享策略的两个角色,并使用角色特定提示。 由于两个角色产生不同长度的输出,token 级强化学习会给更长生成分配更多损失项,使共享策略偏向冗长输出和输出更长的角色。 为了解决这一点,我们使用长度归一化的步骤级目标,把每次抽取器或管理器调用视为一个生成步骤。 对于生成 token 索引为 Uu 的生成步骤 u,我们把 token 级比率和优势聚合为

ρu=exp(1|Uu|Uulogπθ(yh)πθold(yh)),A¯u=1|Uu|UuA.

where ρu is the step-level importance ratio, A¯u is the step-level advantage, y is a generated token, h is its autoregressive context, A is the token-level advantage, and πθold is the rollout policy. This gives each generation step comparable weight regardless of output length. The resulting ρu and A¯u are then used in the LoGo-GRPO objective.

其中 ρu 是步骤级重要性比率,A¯u 是步骤级优势,y 是生成 token,h 是其自回归上下文,A 是 token 级优势,πθold 是 rollout 策略。 这使每个生成步骤无论输出长度如何都具有可比权重。 随后,得到的 ρuA¯u 会用于 LoGo-GRPO 目标。

3.3 LoGo-GRPO for Multi-session Credit Assignment

The formulation in Section 3.1 defines memory construction as a chunk-wise multi-step process, but learning still requires fair credit assignment across sessions. Trajectory-level GRPO is problematic in memory-augmented settings because memory turns an agent's past actions into part of its future environment. Once rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making group-relative comparisons unfair and credit signals noisy or biased. To address this, we propose LoGo-GRPO, which combines a global trajectory-level branch with a local rerollout branch. As shown in Figure 1 (b), the global branch preserves end-to-end optimization over the full multi-session trajectory, while the local branch rerolls a stochastically sampled subset of sessions from shared memory states, yielding lower-bias session-level credit assignment at manageable cost.

第 3.1 节中的表述把记忆构建定义为分块的多步过程,但学习仍然需要跨会话的公平信用分配。 在记忆增强设置中,轨迹级 GRPO 存在问题,因为记忆会把智能体过去的动作变成未来环境的一部分。 一旦 rollout 写入、更新或删除了不同记忆,它们就不再共享相同的中间记忆状态,使组相对比较变得不公平,信用信号也会带噪或有偏。 为了解决这一点,我们提出 LoGo-GRPO,它把全局轨迹级分支与局部重新 rollout 分支结合起来。 图1(b) 所示,全局分支保留完整多会话轨迹上的端到端优化,而局部分支从共享记忆状态对随机采样的会话子集重新 rollout,从而以可管理成本得到偏差更低的会话级信用分配。

Reward function. Let Q denote the full set of question-answer pairs (q,a) associated with a training conversation, and let QtQ denote the subset whose required evidence is attributed to session t. Given a memory bank M and a question q, an answer module retrieves relevant entries from memory and generates an answer a^. We measure QA quality using token-level F1:

奖励函数。Q 表示与一个训练对话关联的完整问答对集合 (q,a),并令 QtQ 表示所需证据归属于会话 t 的子集。 给定记忆库 M 和问题 q,答案模块会从记忆中检索相关条目并生成答案 a^ 我们使用 token 级 F1 衡量问答质量:

QA(M,Qt)=1|Qt|(q,a)QtF1(a^,a).

To discourage unbounded memory growth, we penalize memory tokens exceeding an α fraction of the cumulative session tokens up to session t, where Tok() denotes token count and α is a fixed memory budget ratio:

为了抑制无界记忆增长,我们惩罚超过截至会话 t 的累计会话 token 的 α 比例的记忆 token,其中 Tok() 表示 token 数,α 是固定记忆预算比例:

Comp(M,t)={0,Tok(M)αs=1tTok(Ss),Tok(M)αs=1tTok(Ss)s=1tTok(Ss),Tok(M)>αs=1tTok(Ss).

The session-level reward is

会话级奖励为

R(M,Qt,t)=QA(M,Qt)λcompComp(M,t),

where λcomp controls the compression penalty.

其中 λcomp 控制压缩惩罚。

Global branch. For rollout i, let Mt(i)Mt,K(i) denote the memory state after session t. The global branch evaluates the terminal memory MT(i) and attributes the reward to session t according to the location of the required evidence:

全局分支。 对于 rollout i,令 Mt(i)Mt,K(i) 表示会话 t 之后的记忆状态。 全局分支评估终止记忆 MT(i),并根据所需证据的位置把奖励归因给会话 t

rt,iG=R(MT(i),Qt,T).

Following GRPO, we compute group-relative advantages across the n global rollouts:

按照 GRPO,我们在 n 个全局 rollout 上计算组相对优势:

A^t,iG=rt,iGμtGσtG+ε,μtG=1nj=1nrt,jG,σtG=stdj(rt,jG).

While this branch provides full-horizon supervision, it still suffers from reward contamination: at session t, different rollouts induce different intermediate memory states as their effective environments, yet GRPO normalizes their rewards within the same comparison group.

虽然这一分支提供完整跨度监督,但它仍受到奖励污染影响:在会话 t,不同 rollout 会诱导出不同的中间记忆状态作为其有效环境,但 GRPO 仍在同一个比较组内归一化它们的奖励。

Local branch with stochastic rerollout. To reduce this contamination, the local branch performs rerollouts from shared intermediate memory states. After the global rollout phase, each session is independently selected with probability plocal:

带随机重新 rollout 的局部分支。 为了减少这种污染,局部分支会从共享的中间记忆状态执行重新 rollout。 在全局 rollout 阶段之后,每个会话都会以概率 plocal 独立地被选中:

btBernoulli(plocal),B={tbt=1}.

For each selected session tB, we choose an anchor rollout i0{1,,n}, retrieve the cached memory state immediately before session t, and sample m local rerollouts of session t only. Since these rerollouts share the same starting memory state Mt1(i0), their comparison is not confounded by divergence from earlier sessions. Let Mt(i0,j) denote the memory state after the j-th local rerollout from this anchor state. The local reward is

对于每个被选中的会话 tB,我们选择一个锚点 rollout i0{1,,n},取回会话 t 之前缓存的记忆状态,并仅对会话 t 采样 m 个局部重新 rollout。 由于这些重新 rollout 共享相同的起始记忆状态 Mt1(i0),它们的比较不会被早期会话造成的分化混淆。 Mt(i0,j) 表示从该锚点状态出发的第 j 个局部重新 rollout 之后的记忆状态。 局部奖励为

rt,jL=R(Mt(i0,j),Qt,t),j=1,,m.

The corresponding local advantages are computed within the rerollout group:

对应的局部优势在重新 rollout 组内计算:

A^t,jL=rt,jLμtLσtL+ε,μtL=1mj=1mrt,jL,σtL=stdj(rt,jL).

Because local advantages are computed among rerollouts from the same anchor memory state Mt1(i0), the comparison is fairer than global normalization across already-diverged trajectories.

由于局部优势是在来自同一锚点记忆状态 Mt1(i0) 的重新 rollout 之间计算的,因此相比在已经分化的轨迹上进行全局归一化,这种比较更公平。

Unified training objective. We optimize the shared memory policy using both global rollouts and local rerollouts. For each generation step u, we assign the normalized advantage associated with its corresponding rollout: A^t,iG for a step from global rollout i at session t, and A^t,jL for a step from local rerollout j at session t. The same assigned advantage is used as the token-level advantage A for all tokens in step u. Let Kstep denote the set of valid generation steps from both branches. Using the step-level ratio ρu and advantage A¯u from the step-level definition, we optimize the dual-clipped surrogate

统一训练目标。 我们同时使用全局 rollout 和局部重新 rollout 来优化共享记忆策略。 对于每个生成步骤 u,我们分配与其对应 rollout 关联的归一化优势:会话 t 中来自全局 rollout i 的步骤使用 A^t,iG,会话 t 中来自局部重新 rollout j 的步骤使用 A^t,jL 同一个分配到的优势会作为步骤 u 中所有 token 的 token 级优势 A Kstep 表示来自两个分支的有效生成步骤集合。 使用步骤级定义中的步骤级比率 ρu 和优势 A¯u,我们优化双重裁剪 surrogate:

u={min(cA¯u,max(ρuA¯u,clip(ρu,1ϵ,1+ϵ)A¯u)),A¯u<0,max(ρuA¯u,clip(ρu,1ϵ,1+ϵ)A¯u),A¯u0.

where c>1 is the dual-clipping constant and ϵ is the clipping threshold. The final actor objective is

其中 c>1 是双重裁剪常数,ϵ 是裁剪阈值。 最终 actor 目标为

L(θ)=1|Kstep|uKstepuβentHtoken+βklDKL,token.

where Htoken and DKL,token denote the mean token-level entropy and KL divergence, respectively. The proportion of local rerollouts controls the strength of local supervision, allowing LoGo-GRPO to balance end-to-end long-horizon learning with lower-bias session-level credit assignment.

其中 HtokenDKL,token 分别表示平均 token 级熵和 KL 散度。 局部重新 rollout 的比例控制局部监督强度,使 LoGo-GRPO 能够在端到端长程学习与偏差更低的会话级信用分配之间取得平衡。

3.4 Curriculum Learning for Long-Horizon Credit Assignment

Directly training on long multi-session trajectories is unstable before the model acquires reliable memory manipulation skills. Because memory operations shape the future environment, early insert, update, or delete errors can propagate across sessions and make long-horizon credit assignment increasingly noisy. We therefore adopt a curriculum over session horizon: training starts from shorter sessions, where memory effects are easier to observe and attribute, and gradually increases the horizon as the policy stabilizes. Concretely, we train in three stages with the maximum number of sessions increasing from 8 to 16 to 32. The 8-session stage learns basic memory operations under limited error propagation, the 16-session stage introduces stronger inter-session dependencies, and the 32-session stage enables full long-horizon optimization. For each stage, we select the best validation checkpoint as the initialization for the next stage, providing a stable starting point for longer-horizon training.

在模型获得可靠记忆操作技能之前,直接在长多会话轨迹上训练是不稳定的。 因为记忆操作会塑造未来环境,早期插入、更新或删除错误可能跨会话传播,并使长程信用分配越来越嘈杂。 因此,我们采用跨会话跨度的课程学习:训练从较短会话开始,在那里记忆效应更容易观察和归因,并随着策略稳定逐渐增加跨度。 具体而言,我们分三阶段训练,最大会话数从 8 增加到 16,再增加到 32。 8 会话阶段在有限错误传播下学习基本记忆操作,16 会话阶段引入更强的会话间依赖,32 会话阶段支持完整长程优化。 对于每个阶段,我们选择最佳验证 checkpoint 作为下一阶段初始化,为更长跨度训练提供稳定起点。

4. Experiments

4.1 Experiment Setup

Datasets and Evaluation Metrics. We train on LoCoMo, a long-term persona-grounded conversation benchmark, using a 2:1:7 train/validation/test split. For out-of-distribution evaluation, we additionally test on LongMemEval, MSC-Self-Instruct, and MemBench. We report token-level F1, BLEU-1 (B1), and LLM-as-a-Judge (J) as the primary metrics, and additionally use M-Fail, the percentage of required evidence location IDs that are missing from the memory bank, as a diagnostic measure of memory-construction quality. Further details on the M-Fail metric can be found in Appendix.

数据集与评估指标。 我们在 LoCoMo 上训练,它是一个长期、以 persona 为基础的对话基准,并使用 2:1:7 的训练/验证/测试划分。 对于分布外评估,我们还在 LongMemEval、MSC-Self-Instruct 和 MemBench 上测试。 我们报告 token 级 F1、BLEU-1(B1)和 LLM-as-a-Judge(J)作为主要指标,并额外使用 M-Fail,即记忆库中缺失的必需证据 location ID 的百分比,作为记忆构建质量的诊断指标。 关于 M-Fail 指标的更多细节见附录。

Baselines and Implementation Details. We compare against A-MEM, Mem0, MemoryOS, a RAG variant implemented within the Mem0 framework, MEM1, MemAgent, and Memory-R1. Our work primarily targets the memory construction stage: the memory extractor and memory manager share a Qwen2.5-7B-Instruct backbone and are jointly trained, while the answer agent is held fixed during training to provide stable reward signals. We use GPT-OSS-120B as this fixed answer agent, since a weaker answer model would yield noisy reward signals that conflate memory-construction quality with answer-generation errors. To remain consistent with this training pipeline, all reported results in our ablation and analysis experiments use the same GPT-OSS-120B answer agent. In Table 1, however, we additionally train a Qwen2.5-7B-Instruct answer agent and report a backbone-controlled variant of Memory-R2 in which all components share the same 7B backbone, enabling a fair comparison against the baselines. Unless otherwise noted, all baselines also use Qwen2.5-7B-Instruct as the backbone. Additional details are provided in Appendix.

基线与实现细节。 我们与 A-MEM、Mem0、MemoryOS、Mem0 框架内实现的一个 RAG 变体、MEM1、MemAgent 和 Memory-R1 进行比较。 我们的工作主要针对记忆构建阶段:记忆抽取器和记忆管理器共享 Qwen2.5-7B-Instruct backbone 并联合训练,而答案智能体在训练期间保持固定,以提供稳定奖励信号。 我们使用 GPT-OSS-120B 作为这个固定答案智能体,因为较弱的答案模型会产生带噪声的奖励信号,把记忆构建质量与答案生成错误混在一起。 为了与这一训练流程保持一致,我们在消融和分析实验中报告的所有结果都使用同一个 GPT-OSS-120B 答案智能体。 然而,在表1中,我们额外训练了一个 Qwen2.5-7B-Instruct 答案智能体,并报告了 Memory-R2 的 backbone 控制变体,其中所有组件共享同一个 7B backbone,从而能够与基线进行公平比较。 除非另有说明,所有基线也使用 Qwen2.5-7B-Instruct 作为 backbone。 更多细节见附录。

表1:LoCoMo 上的主结果。我们报告 token-level F1(F1)、BLEU-1(B1)和 LLM-as-a-Judge(J),每列最佳结果用粗体表示。为公平比较,所有基线和 Memory-R2 都使用 Qwen2.5-7B-Instruct 作为基础模型。另报告 Memory-R2 (GPT-OSS),它把答案智能体替换为 GPT-OSS-120B。结果为三次运行平均值;标准差见附表1。† 表示 Memory-R1 原文报告结果。
ModelMulti-hopOpen DomainSingle HopTemporalOverall
F1B1JF1B1JF1B1JF1B1JF1B1J
Training-free Methods
RAG9.577.0015.0611.8410.0219.288.676.5212.798.358.745.438.977.2712.17
A-MEM18.9212.8640.7814.7312.6631.3230.5826.1446.9023.6720.6728.6826.0821.7840.78
Mem024.9618.0561.9220.3115.8248.1932.7425.2765.2033.1626.2838.7630.6123.5553.30
MemoryOS29.5522.5948.1221.0318.4138.5540.8536.2663.1426.2619.7024.8134.6429.3651.26
Trained Methods
MEM117.1512.7241.2922.6714.7243.1028.0522.9657.2231.7726.4044.7326.5521.3850.78
MemAgent35.9524.5670.6529.2025.4862.0750.5543.8978.6123.1516.5556.9640.7233.3671.52
Memory-R133.6426.0662.3423.5520.7140.9646.8640.9267.8147.7538.4949.6143.1436.4461.51
Memory-R2 (GPT-OSS)36.3729.7985.0730.2224.5074.1450.9845.7990.5862.4855.3383.3349.6743.7787.10
Memory-R238.4130.9080.9320.7616.7867.5354.0648.7386.8059.6550.0569.9050.6044.0180.99

4.2 Main Results

Fair Comparison. Table 1 reports the main results on LoCoMo. Under the backbone-controlled setting, Memory-R2 achieves the best overall F1 and BLEU-1 among all training-free and trained baselines, including MEM1, MemAgent, and Memory-R1. Compared with the closely related RL baseline Memory-R1, Memory-R2 improves overall F1 from 43.14 to 50.60 and B1 from 36.44 to 44.01, while also reaching a strong judge score of 80.99. These gains are obtained with a simple memory-agent pipeline, suggesting that the improvement mainly comes from the proposed training algorithm rather than additional system complexity. We additionally report Memory-R2 (GPT-OSS), which uses the same memory construction module but replaces the answer agent with GPT-OSS-120B. Memory-R2 with the 7B answer agent achieves higher F1 and BLEU-1 than the GPT-OSS-120B variant, demonstrating that a task-aligned small model can rival a much larger frozen one when paired with a well-trained memory module.

公平比较。 表1报告了 LoCoMo 上的主结果。 在 backbone 控制设置下,Memory-R2 在所有免训练和训练基线中取得最佳 overall F1 和 BLEU-1,包括 MEM1、MemAgent 和 Memory-R1。 与密切相关的强化学习基线 Memory-R1 相比,Memory-R2 将 overall F1 从 43.14 提升到 50.60,将 B1 从 36.44 提升到 44.01,同时也达到 80.99 的强 judge score。 这些收益来自一个简单的记忆智能体流程,说明提升主要来自所提出的训练算法,而不是额外系统复杂性。 我们还报告 Memory-R2 (GPT-OSS),它使用相同的记忆构建模块,但把答案智能体替换为 GPT-OSS-120B。 使用 7B 答案智能体的 Memory-R2 取得了高于 GPT-OSS-120B 变体的 F1 和 BLEU-1,说明当搭配训练良好的记忆模块时,任务对齐的小模型可以媲美大得多的冻结模型。

Generalization of Memory-R2
图2:Memory-R2 在(a)OOD 基准、(b)backbone 规模和(c)答案智能体上的泛化能力。

Strong Generalization. Figure 2 further demonstrates the strong generalization ability of Memory-R2 from three complementary perspectives. Notably, these gains are achieved even though the model is trained on only two LoCoMo conversations, suggesting that the proposed training paradigm is highly data-efficient. First, Figure 2(a) shows strong transfer to out-of-distribution benchmarks. When evaluated zero-shot on LongMemEval-oracle, LongMemEval-s, MSC-Self-Instruct, and MemBench, Memory-R2 consistently improves over the base model across all reported metrics. For example, on LongMemEval-oracle, the F1 score improves from 27.88 to 50.60, and similar gains are observed on the other benchmarks, indicating that the learned memory-construction policy does not simply overfit to the training benchmark. Second, Figure 2(b) shows that the gains also transfer across model scales. The improvement is especially pronounced for Qwen2.5-3B, where F1 increases from 10.3 to 46.8, suggesting that our training paradigm is particularly beneficial for smaller-capacity models, for which effective long-horizon memory construction is otherwise difficult to learn. Third, Figure 2(c) decomposes the contribution of training the memory module versus the answer agent. The dominant gain comes from training the memory module (e.g., F1 from 26.4 to 45.2 with a 7B-Base answer agent; F1 from 30.6 to 49.7 with a GPT-OSS answer agent), while varying the answer agent at fixed RL-trained memory yields comparably high scores. This indicates that the benefits of Memory-R2 transfer across diverse downstream answer agents. Taken together, these results indicate that Memory-R2 learns a robust and transferable memory-construction policy rather than overfitting to a specific benchmark, model scale, or answer agent.

强泛化能力。 图2进一步从三个互补角度展示了 Memory-R2 的强泛化能力。 值得注意的是,尽管模型只在两个 LoCoMo 对话上训练,也能获得这些收益,这说明所提出训练范式具有很高的数据效率。 首先,图2(a) 显示了向分布外基准的强迁移能力。 当在 LongMemEval-oracle、LongMemEval-s、MSC-Self-Instruct 和 MemBench 上进行零样本评估时,Memory-R2 在所有报告指标上都持续优于基础模型。 例如,在 LongMemEval-oracle 上,F1 分数从 27.88 提升到 50.60,其他基准上也观察到类似收益,说明学习到的记忆构建策略并不是简单过拟合训练基准。 其次,图2(b) 显示这些收益也能跨模型规模迁移。 对于 Qwen2.5-3B,提升尤其明显,F1 从 10.3 增加到 46.8,这说明我们的训练范式对容量较小的模型尤其有益,因为有效长程记忆构建对它们来说原本很难学习。 第三,图2(c) 分解了训练记忆模块与训练答案智能体的贡献。 主要收益来自训练记忆模块(例如,使用 7B-Base 答案智能体时 F1 从 26.4 到 45.2;使用 GPT-OSS 答案智能体时 F1 从 30.6 到 49.7),而在固定强化学习训练过的记忆时改变答案智能体,也能得到相当高的分数。 这表明 Memory-R2 的收益能够迁移到多种下游答案智能体。 总体而言,这些结果表明 Memory-R2 学到的是稳健且可迁移的记忆构建策略,而不是过拟合到特定基准、模型规模或答案智能体。

LoGo-GRPO and curriculum learning ablation
图3:LoGo-GRPO 和课程学习都至关重要。(a,b)LoGo-GRPO 在各课程阶段都持续优于 GRPO。(c,d)在相同计算量下,课程训练保持稳定,而直接 32 会话训练会使验证 F1 从 0.47 崩塌到 0.27,并将 M-Fail 提高到 72.1%
表2:LoGo-GRPO 组件消融。
VariantsF1 ↑B1 ↑J ↑M-Fail ↓
LoGo-GRPO (full)49.6743.7787.106.72
RL Algorithm Ablation
GRPO46.62 (-3.05)40.97 (-2.80)82.76 (-4.34)10.20 (+3.48)
-curriculum24.12 (-25.55)20.67 (-23.10)45.99 (-41.11)46.50 (+39.78)
-length norm.43.53 (-6.14)38.10 (-5.67)77.20 (-9.90)8.30 (+1.58)
Architecture Ablation
Single agent39.14 (-10.53)33.93 (-9.84)72.63 (-14.47)18.00 (+11.28)
Separate params44.31 (-5.36)38.59 (-5.18)78.89 (-8.21)33.00 (+26.28)
Chunks Ablation
N = 440.39 (-9.28)35.16 (-8.61)73.92 (-13.18)14.50 (+7.78)
N = 841.37 (-8.30)36.15 (-7.62)75.58 (-11.52)22.70 (+15.98)
N = 1037.61 (-12.06)32.61 (-11.16)70.05 (-17.05)37.20 (+30.48)
Training Target Ablation
memory manager45.34 (-4.33)39.99 (-3.78)81.20 (-5.90)16.10 (+9.38)
fact extractor28.30 (-21.37)24.41 (-19.36)52.90 (-34.20)56.50 (+49.78)

4.3 Ablation Studies

Table 2 summarizes ablations on the major components of our method, with M-Fail reported as a diagnostic measure of memory quality. Replacing LoGo-GRPO with standard GRPO degrades F1 from 49.67 to 46.62 and B1 from 43.77 to 40.97, confirming the benefit of global-local credit assignment. Figure 3(a, b) shows this gap holds at every stage across question types, indicating that local rerollouts consistently mitigate credit-assignment bias. Removing curriculum learning (curriculum) causes a much larger drop—F1 falls to 24.12 and M-Fail rises to 46.5%. Figure 3(c, d) traces this collapse: direct 32-session training peaks at F1=0.47 before falling to 0.27, while M-Fail explodes from below 10% to over 70%; the curriculum instead stabilizes around F1=0.50 with M-Fail held under 7%. This confirms that early errors propagate across sessions and corrupt memory, and that progressive horizon expansion is essential for stable long-horizon training. We further ablate the length normalization in our step-level objective: switching to a token-level loss (length norm.) drops F1 to 43.53 and B1 to 38.10, confirming that length-normalized step weighting is necessary to prevent output length bias under the shared extractor--manager policy.

表2总结了我们方法主要组件的消融,并报告 M-Fail 作为记忆质量的诊断度量。 用标准 GRPO 替换 LoGo-GRPO 会使 F1 从 49.67 降到 46.62、B1 从 43.77 降到 40.97,确认了全局-局部信用分配的收益。 图3(a, b) 表明这一差距在各个阶段和问题类型上都成立,说明局部重新 rollout 持续缓解了信用分配偏差。 移除课程学习(curriculum)会导致更大幅度下降:F1 降到 24.12,M-Fail 上升到 46.5% 图3(c, d) 展示了这种崩塌:直接 32 会话训练在 F1=0.47 达到峰值后跌至 0.27,而 M-Fail 从低于 10% 爆炸到超过 70%;相比之下,课程学习稳定在约 F1=0.50,并把 M-Fail 保持在 7% 以下。 这确认早期错误会跨会话传播并污染记忆,而渐进式跨度扩展对稳定长程训练至关重要。 我们进一步消融步骤级目标中的长度归一化:切换为 token 级损失(length norm.)会使 F1 降到 43.53、B1 降到 38.10,确认长度归一化步骤加权对于防止共享抽取器-管理器策略下的输出长度偏差是必要的。

For the memory-construction architecture, a single-agent variant merging extraction and editing into one role drops to 39.14 F1, and a separate-params variant where the extractor and manager use disjoint parameters also underperforms (44.31 F1), supporting both explicit role decomposition and parameter sharing. Alternative interaction depths likewise underperform the full multi-step design (40.39 / 41.37 / 37.61 F1 for N=4/8/10), showing that moderate iterative refinement is optimal. Too few chunks limit refinement, while overly long interaction chains hurt optimization. Finally, training only the memory manager (45.34 F1) or only the fact extractor (28.30 F1) also degrades performance, especially the latter, confirming that both components benefit from joint RL training and that fact extraction is the more brittle of the two roles when left untrained. Overall, the gains of our method arise from the combination of fair credit assignment, curriculum learning, multi-step memory construction, and shared extractor--manager co-learning. Additional ablations are reported in Appendix.

对于记忆构建架构,把抽取和编辑合并到一个角色中的单智能体变体会降到 39.14 F1,而抽取器和管理器使用互不相交参数的 separate-params 变体也表现更差(44.31 F1),这同时支持显式角色分解和参数共享。 替代交互深度同样不如完整多步设计(N=4/8/10 时 F1 为 40.39 / 41.37 / 37.61),说明适度迭代细化是最优的。 块数过少会限制细化,而过长的交互链会损害优化。 最后,仅训练记忆管理器(45.34 F1)或仅训练事实抽取器(28.30 F1)也会降低性能,尤其是后者,这确认两个组件都受益于联合强化学习训练,而且在不训练时事实抽取是两个角色中更脆弱的一个。 总体而言,我们方法的收益来自公平信用分配、课程学习、多步记忆构建以及共享抽取器-管理器协同学习的结合。 更多消融见附录。

4.4 More Analysis: Latency and Compression

Latency. Figure 4(a,b) compares F1 and inference latency before and after Memory-R2 training. Memory-R2 improves F1 while reducing latency for both Qwen2.5-3B and Qwen2.5-7B under the per-conversation measurement, moving both models toward a better quality--efficiency regime. The source of this latency reduction differs across scales. For Qwen2.5-3B, the gain is mainly driven by more concise generations: the trained policy emits fewer tokens per memory-construction turn. For Qwen2.5-7B, the gain comes not only from shorter generations, but also from a more stable generation-length distribution: the untrained policy occasionally produces overly long memory-management outputs, whereas Memory-R2 suppresses these unnecessary generations, reducing decoding work and making memory construction more stable. We provide a diagnostic breakdown of these scale-dependent mechanisms in Appendix Figure 1. These results suggest that better-trained memory policies can improve answer quality without incurring additional inference overhead, and can even reduce latency by making memory construction more concise and stable.

延迟。 图4(a,b) 比较了 Memory-R2 训练前后的 F1 和推理延迟。 在按每个对话测量时,Memory-R2 对 Qwen2.5-3B 和 Qwen2.5-7B 都同时提升 F1 并降低延迟,使两个模型都进入更好的质量-效率区间。 这种延迟下降的来源在不同规模上有所不同。 对于 Qwen2.5-3B,收益主要来自更简洁的生成:训练后的策略在每个记忆构建轮次中输出更少 token。 对于 Qwen2.5-7B,收益不仅来自更短的生成,也来自更稳定的生成长度分布:未训练策略偶尔会产生过长的记忆管理输出,而 Memory-R2 抑制了这些不必要生成,从而减少解码工作并使记忆构建更稳定。 我们在附图1中提供了这些规模相关机制的诊断分解。 这些结果表明,更好训练的记忆策略可以在不增加额外推理开销的情况下提升答案质量,甚至可以通过让记忆构建更简洁、更稳定来降低延迟。

Compression. Figure 4(c,d) studies the effect of the compression penalty λcomp. Across both Qwen2.5-3B and Qwen2.5-7B, λcomp=0.3 achieves the best F1 and BLEU-1, as highlighted by the yellow band. Smaller penalties may retain redundant or noisy memories, while overly strong compression can remove useful evidence. We therefore use λcomp=0.3 as the default setting.

压缩。 图4(c,d) 研究了压缩惩罚 λcomp 的影响。 在 Qwen2.5-3B 和 Qwen2.5-7B 上,λcomp=0.3 都取得最佳 F1 和 BLEU-1,如黄色区域所示。 较小惩罚可能保留冗余或噪声记忆,而过强压缩可能移除有用证据。 因此,我们使用 λcomp=0.3 作为默认设置。

Inference efficiency and compression penalty analysis
图4:推理效率与压缩惩罚分析。(a,b)用 F1 与每个对话/每个生成 token 的时间衡量准确率-延迟权衡。(c,d)λcomp{0,0.1,0.3,0.5} 对 F1 和 BLEU-1 的影响;黄色区域标记 λcomp=0.3,圆环标记最佳值。
Latency-mechanism diagnostics
附图1:延迟机制诊断。(a)Memory-R2 在降低每个对话延迟的同时提升 F1。(b)Memory-R2 缩短输出长度尾部,说明过长生成更少。(c)Memory-R2 减少总 completion token,尤其是对 Qwen2.5-3B,说明记忆构建策略更简洁。
附表1:每个问题类别的主结果,基于 3 个不同随机种子的独立运行求平均。报告均值 ± 标准差。J score 由 LLM-as-a-Judge 使用 gpt-4o-mini 作为 judge model 计算。
ModelCategoryF1 ↑B1 ↑J ↑
Memory-R2Single-hop54.06 ± 2.1248.73 ± 2.2686.80 ± 0.48
Multi-hop38.41 ± 0.9130.90 ± 0.4680.93 ± 1.52
Temporal59.65 ± 1.0050.05 ± 1.1069.90 ± 1.34
Open-domain20.76 ± 1.0616.78 ± 0.6667.53 ± 2.04
Overall50.60 ± 1.3444.01 ± 1.3780.99 ± 0.28
Memory-R2 (OSS)Single-hop50.98 ± 0.4545.79 ± 0.4890.58 ± 1.56
Multi-hop36.37 ± 1.6329.79 ± 1.9185.07 ± 1.41
Temporal62.48 ± 0.5255.33 ± 0.4283.33 ± 1.49
Open-domain30.22 ± 0.7124.50 ± 1.1174.14 ± 0.00
Overall49.67 ± 0.6043.77 ± 0.5987.10 ± 1.43

5. Conclusion

In this paper, we present Memory-R2, a training framework for long-horizon memory-augmented LLM agents that addresses a fundamental challenge in multi-session reinforcement learning: fair credit assignment under diverging memory states. Our method, LoGo-GRPO, combines global trajectory-level optimization with local rerollouts from shared intermediate memory states, enabling fairer session-level comparisons while preserving end-to-end long-horizon learning. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution through a shared extractor--manager policy, formulates memory construction as a multi-step decision process over chunked sessions, and stabilizes training with a curriculum over session horizon. Experiments show that Memory-R2 consistently outperforms prior memory-agent baselines on LoCoMo and generalizes well across out-of-distribution benchmarks, model scales, and answer agents. These results suggest that improving credit assignment is a key ingredient for training robust long-horizon memory agents, and we hope this work provides a useful foundation for future research on memory-centric RL for LLM agents.

在本文中,我们提出 Memory-R2,这是一个面向长程记忆增强 LLM 智能体的训练框架,用于解决多会话强化学习中的根本挑战:分化记忆状态下的公平信用分配。 我们的方法 LoGo-GRPO 将全局轨迹级优化与来自共享中间记忆状态的局部重新 rollout 结合起来,在保留端到端长程学习的同时实现更公平的会话级比较。 除了信用分配之外,Memory-R2 还通过共享抽取器-管理器策略联合优化记忆形成和记忆演化,把记忆构建表述为分块会话上的多步决策过程,并使用跨会话跨度的课程学习稳定训练。 实验表明,Memory-R2 在 LoCoMo 上持续优于以往记忆智能体基线,并且能很好地跨分布外基准、模型规模和答案智能体泛化。 这些结果表明,改进信用分配是训练稳健长程记忆智能体的关键要素;我们希望这项工作为未来面向记忆的 LLM 智能体强化学习研究提供有用基础。