Skip to content

MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

Memory390+50+ICLR 2026CCF-A新加坡国立大学

MemGen: Weaving Generative Latent Memory for Self-Evolving Agents. arXiv preprint arXiv:2509.24704v2, 2025.

https://github.com/KANABOON1/MemGen


MemGen:为自进化智能体编织生成式潜在记忆

Abstract

Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a memory trigger, which monitors the agent’s reasoning state to decide explicit memory invocation, and a memory weaver, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to 38.22%, exceeds GRPO by up to 13.44%, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

智能体记忆塑造了由大语言模型(LLM)驱动的智能体如何像人脑一样,通过与环境交互逐步完善自身。 现有范式仍然受限:参数化记忆强行调整模型参数,基于检索的记忆把经验外化到结构化数据库中,但二者都没有捕捉到人类认知中推理与记忆流动交织的过程。 为弥合这一差距,作者提出 MemGen,这是一个动态的生成式记忆框架,为智能体配备一种类人的认知能力。 它由两个部分组成:memory trigger 监控智能体的推理状态以决定是否显式调用记忆;memory weaver 则以智能体当前状态作为刺激,构造一段潜在 token 序列,作为机器原生记忆来增强推理。 通过这种方式,MemGen 使智能体能够在整个推理过程中回忆并增强潜在记忆,形成记忆与认知紧密交织的循环。 跨 8 个基准的大量实验表明,MemGen 相比 ExpeL 和 AWM 等领先外部记忆系统最高提升 38.22%,相比 GRPO 最高提升 13.44%,并表现出很强的跨域泛化能力。 更重要的是,作者发现,在没有显式监督的情况下,MemGen 会自发演化出不同的类人记忆能力,包括规划记忆、程序性记忆和工作记忆,这暗示机器认知正在朝更自然主义的形态涌现。

1. Introduction

The ascent of Large Language Model (LLM)-powered agents marks a paradigm shift across diverse domains. Pivotal to this success is the concept of agent memory, which enables LLM agents to learn progressively from environmental interactions. Crucially, this conception of agent memory extends beyond that of conversational agents (i.e., personalized memory), whose primary role is to sustain coherence across long-horizon, multi-turn dialogues. Rather, the scope of this paper is primarily on enabling agents to internalize experience, simulate human-like cognitive iteration, and progressively enhance problem-solving competence.

由大语言模型(LLM)驱动的智能体兴起,标志着多个领域中的一次范式转变。 这一成功的关键在于 agent memory 这一概念,它使 LLM 智能体能够从环境交互中逐步学习。 关键的是,本文中的智能体记忆并不局限于对话智能体中的个性化记忆;后者的主要作用是在长程、多轮对话中维持连贯性。 相反,本文关注的主要是让智能体内化经验,模拟类人的认知迭代,并逐步提升解决问题的能力。

The memory serving as this self-evolving engine typically manifests in two dominant paradigms. The first is (I) parametric memory, which internalizes experiences by directly updating agents' parameters. While this approach can yield substantial performance gains, its reliance on parameter modification inevitably entails catastrophic forgetting, i.e., the erosion of general knowledge. Conversely, the second paradigm is (II) retrieval-based memory, which externalizes past experiences into a structured database, such as (i) raw trajectories, (ii) high-level experiences, and (iii) condensed skills like reusable APIs or MCP boxes. Although this non-invasive approach circumvents catastrophic forgetting, its efficacy is fundamentally tethered to context engineering. It adheres to a rigid execution pipeline, providing retrieved context to the agent without achieving the fluid, seamless integration characteristic of truly internalized memory.

作为这种自进化引擎的记忆,通常表现为两种主要范式。 第一种是 (I)参数化记忆,它通过直接更新智能体参数来内化经验。 虽然这种方法可以带来显著性能收益,但它对参数修改的依赖不可避免地会导致灾难性遗忘,也就是通用知识被侵蚀。 相反,第二种范式是 (II)基于检索的记忆,它把过去经验外化为结构化数据库,例如*(i)原始轨迹*、(ii)高层经验,以及*(iii)压缩技能*,如可复用 API 或 MCP box。 虽然这种非侵入式方法绕开了灾难性遗忘,但它的效果从根本上受制于上下文工程。 它遵循僵硬的执行管线,把检索到的上下文提供给智能体,却没有实现真正内化记忆那种流动、无缝的整合。

Given these deficiencies, latent memory offers a compelling alternative, leveraging latent states as a machine-native, high-density medium for memory. Existing approaches either use the (i) key-value (KV) cache to maintain dynamic memory set, yet which is primarily confined to addressing long-context issues, or (ii) latent token embeddings to store agent experiences, which still rely on invasive LLM parameter updates. LatentSeek and SoftCoT similarly belong to this category, utilizing latent embeddings to steer agent generation. Nevertheless, all these methods diverge from human cognition in two critical dimensions: they lack the seamless interleaving of reasoning and memory, a process where thought and memory dynamically reshape one another, and remain largely retrieval-based, fetching memories by embedding similarity rather than generatively reconstructing them into novel, coherent insights. This leads to our pivotal research question:

鉴于这些缺陷,潜在记忆提供了一个很有吸引力的替代方案:它使用潜在状态作为机器原生、高密度的记忆媒介。 现有方法要么使用 (i)键值(KV)缓存 来维护动态记忆集合,但这主要局限于解决长上下文问题;要么使用 (ii)潜在 token 嵌入 来存储智能体经验,但这仍然依赖侵入式的 LLM 参数更新。 LatentSeek 和 SoftCoT 也属于这一类,它们利用潜在嵌入来引导智能体生成。 然而,所有这些方法都在两个关键维度上偏离了人类认知:它们缺乏推理与记忆的无缝交织,也就是思想与记忆彼此动态重塑的过程;同时它们仍主要是基于检索的,通过嵌入相似度获取记忆,而不是把记忆生成式地重构为新颖而连贯的洞见。 这引出了本文的核心研究问题:

How can we architect agent memory as a dynamic cognitive faculty, capable of fluid, reconstructive processes that interweave seamlessly with reasoning?
MemGen intro comparison
图1:参数化记忆、基于检索的记忆和 MemGen 的比较。作者的图示灵感来自 LatentSeek 中的图 1。

To address this challenge, we introduce MemGen, a dynamic and generative memory framework designed to endow any LLM agent with a more human-esque cognitive faculty. At its core, MemGen continuously monitors an agent's cognitive state, enabling it to dynamically invoke a generative process that synthesizes a bespoke latent memory at any critical juncture during its reasoning process. Practically, MemGen comprises two synergistic components: a reinforcement learning (RL)-trained memory trigger, which acts as a metacognitive monitor to discern the opportune moments for explicit memory invocation; and a memory weaver, which takes the agent's current state as a stimulus to draw upon relevant implicit parametric memory (potentially augmented with externally retrieved information) and then reconstructs this synthesis into a succinct, machine-native latent memory. With the reasoning core fixed, MemGen inherently mitigates catastrophic forgetting when exposed to new data, and, moving beyond the static and extractive paradigm of prior memory systems, equips agents with a fluid, generative faculty deeply integrated with reasoning cores.

为应对这一挑战,作者提出 MemGen,一个动态且生成式的记忆框架,旨在为任意 LLM 智能体赋予更类人的认知能力。 其核心在于,MemGen 持续监控智能体的认知状态,使其能够在推理过程中的任意关键时刻动态调用生成过程,合成一段定制化的潜在记忆。 具体来说,MemGen 包含两个协同组件:一个经过强化学习(RL)训练的 memory trigger,它作为元认知监控器,判断显式记忆调用的合适时机;以及一个 memory weaver,它以智能体当前状态作为刺激,调动相关的隐式参数化记忆(也可能由外部检索信息增强),再把这种综合重构为简洁、机器原生的潜在记忆 在固定推理核心的情况下,MemGen 在接触新数据时天然缓解灾难性遗忘,并超越先前记忆系统静态、抽取式的范式,为智能体配备一种与推理核心深度整合的流动生成能力。

Experimental Observation

Extensive experiments across nine benchmarks and four baseline categories demonstrate that MemGen delivers ① substantial performance gains, with improvements of up to 31.7% on ALFWorld and 27.1% on KodCode with Qwen3-8B, surpassing parametric memory (REINFORCE++, +5.8%) and the GRPO method (+5.32%); ② strong cross-domain generalization, where training in the math domain not only avoids degradation elsewhere but also boosts performance in science reasoning (+6.06%) and code generation (+5.1%); and ③ continual learning ability, maintaining stable performance in previously trained domains even after fine-tuning on three additional ones.

跨 9 个基准和 4 类基线的大量实验表明,MemGen 带来了 ① 显著性能提升:在 Qwen3-8B 上,ALFWorld 最高提升 31.7%,KodCode 最高提升 27.1%,超过参数化记忆(REINFORCE++,+5.8%)和 GRPO 方法(+5.32%);② 强跨域泛化:在数学领域训练不仅不会损害其他领域,还能提升科学推理(+6.06%)和代码生成(+5.1%);以及 ③ 持续学习能力:即使在额外 3 个领域上微调后,也能在先前训练过的领域保持稳定性能。

Analysis & Interpretation

Beyond quantitative evaluation, we sought to interpret the learned behavior of MemGen. Through post-hoc interventions examining the impact of removing specific latent memory on different agent failure modes, we found that MemGen implicitly evolves a human-like memory hierarchy without any external guidance, including ① planning memory, where certain latent tokens specifically support high-level task planning, ② procedural memory, where some latent memory tokens facilitate the agent’s recall of task-specific procedural skills, such as tool usage and answer formatting, and ③ working memory, where certain tokens help the agent maintain coherence and understanding over long contexts within a single task session. These specializations strongly reveal that MemGen endows the agent with precise, functionally distinct memory.

除了定量评估之外,作者还试图解释 MemGen 学到的行为。 通过事后干预,考察移除特定潜在记忆对不同智能体失败模式的影响,作者发现 MemGen 在没有任何外部指导的情况下隐式演化出类人的记忆层级,包括 ① 规划记忆,某些潜在 token 专门支持高层任务规划;② 程序性记忆,一些潜在记忆 token 帮助智能体回忆任务特定的程序性技能,例如工具使用和答案格式;以及 ③ 工作记忆,某些 token 帮助智能体在单个任务会话的长上下文中保持连贯性和理解。 这些专门化强烈表明,MemGen 为智能体赋予了精确且功能各异的记忆。

LLM & Agent Memory

As outlined in Section 1, existing memory mechanisms designed to evolve the problem-solving capacity of LLM agents can be broadly categorized into three classes: (I) parametric memory, which either integrates past experiences directly into agent parameters through finetuning, as in FireAct, AgentLumos, and others, or maintains them in external parameter modules; (II) retrieval-based memory, which abstracts prior experiences into transferable knowledge, or distills them into reusable tools and skills; and (III) latent memory, which leverages implicit representations to encode and retrieve experience. Our MemGen falls within the latent memory paradigm, yet distinguishes itself from prior approaches through its more human-esque interweaving of reasoning and memory, as well as its generative, rather than purely retrieval-based, nature.

如第 1 节所述,旨在演化 LLM 智能体问题解决能力的现有记忆机制,大体可以分为三类:(I)参数化记忆,它要么通过微调把过去经验直接整合进智能体参数,例如 FireAct、AgentLumos 等,要么把经验维护在外部参数模块中;(II)基于检索的记忆,它把先前经验抽象为可迁移知识,或蒸馏成可复用工具和技能;以及 (III)潜在记忆,它利用隐式表示来编码和检索经验。 MemGen 属于潜在记忆范式,但它与先前方法的区别在于,它以更类人的方式交织推理与记忆,并且本质上是生成式的,而不是纯检索式的。

Latent Computation

Our method is also closely related to latent computation, wherein latent states are employed to intervene in or reshape the LLM's reasoning process. Prominent paradigms include: (I) architecturally enabling native latent reasoning, exemplified by Coconut, CODI, LatentR3 and CoLaR, which render the LLM's inference process inherently latent and machine-native; and (II) employing latent computation to steer LLM generation, as in LaRS, LatentSeek, SoftCoT, and Coprocessor, which leverage latent representations to modulate the quality of generated outputs. These aforementioned works have greatly inspired the latent memory design in this paper: Latent memory can likewise be viewed as an instantiation of the latter, supplementing essential memory context to enhance the problem-solving capacity of agents.

本文方法也与潜在计算密切相关,在这类方法中,潜在状态被用来干预或重塑 LLM 的推理过程。 代表性范式包括:(I)在架构上支持原生潜在推理,例如 Coconut、CODI、LatentR3 和 CoLaR,它们使 LLM 的推理过程天然具有潜在且机器原生的形式;以及 (II)使用潜在计算来引导 LLM 生成,例如 LaRS、LatentSeek、SoftCoT 和 Coprocessor,它们利用潜在表示调节生成输出质量。 上述工作极大启发了本文的潜在记忆设计:潜在记忆同样可以被看作后一类方法的一个实例,通过补充必要的记忆上下文来增强智能体的问题解决能力。

LLM Decoding & RL

Two additional topics that relate to our work are LLM decoding and reinforcement learning (RL). From the decoding perspective, MemGen dynamically generates and inserts latent tokens, which shares similarity with speculative decoding where a drafter model receives the current decoding context and produces subsequent drafted tokens. . However, these methods primarily aim to accelerate LLM inference, whereas MemGen focuses on leveraging latent states as effective carriers of memory. From the RL perspective, MemGen employs rule-based RL to train the memory trigger, which is closely related to reinforcement learning with variable reward (RLVR), including GRPO from DeepSeek-R1 and its various derivatives. While there exist efforts combining RL with agent memory, to our knowledge, most do not address self-improving memory; for example, MemAgent and MEM1 focus on handling long-context inputs rather than evolving memory mechanisms.

另外两个与本文相关的主题是 LLM 解码和强化学习(RL)。 解码视角看,MemGen 会动态生成并插入潜在 token,这与投机解码有相似之处:草稿模型接收当前解码上下文,并产生后续草稿 token。 。然而,这些方法主要旨在加速 LLM 推理,而 MemGen 关注的是把潜在状态作为有效的记忆载体。 RL 视角看,MemGen 使用基于规则的 RL 训练 memory trigger,这与可变奖励强化学习(RLVR)密切相关,包括 DeepSeek-R1 中的 GRPO 及其各种衍生方法。 虽然已有工作尝试把 RL 与智能体记忆结合,但据作者所知,多数并不处理自我改进记忆;例如 MemAgent 和 MEM1 关注的是长上下文输入处理,而不是记忆机制的演化。

3. Preliminary

Notation

We formalize the agent's interaction within an environment E. An agent, powered by an LLM parameterized by θ, is denoted as πθ. For a given task x, the agent's interaction unfolds as a high-level trajectory, denoted as follows τ=(s0,a0,s1,a1,,sT), where st represents the state of the environment and at is the high-level action taken by the agent. More internally, each action at is essentially a sequence of tokens, at=(zt,1,zt,2,,zt,Lt), generated autoregressively by the LLM. The generation of the j-th token is conditioned on the current state st and all previously generated tokens within that action:

作者将智能体在环境 E 中的交互形式化。 由参数为 θ 的 LLM 驱动的智能体记为 πθ 对于给定任务 x,智能体的交互展开为一条高层轨迹,记作 τ=(s0,a0,s1,a1,,sT),其中 st 表示环境状态,at 是智能体采取的高层动作。 更内部地看,每个动作 at 本质上是由 LLM 自回归生成的一串 token,at=(zt,1,zt,2,,zt,Lt) j 个 token 的生成依赖于当前状态 st 以及该动作中此前生成的所有 token:

zt,jπθ(st,zt,<j).

After an entire action sequence at is generated, it is executed in the environment, which transitions the state from st to st+1. The success of the trajectory τ is evaluated by a reward function R(τ).

当完整动作序列 at 生成后,它会在环境中执行,使状态从 st 转移到 st+1 轨迹 τ 的成功程度由奖励函数 R(τ) 评估。

Problem Formalization

Given a history of past experiences H={(xi,τi)}i=1N, the objective is to leverage this history to maximize the agent's performance on new tasks. The policy πθ and a memory system M are thus jointly optimized to maximize the expected reward over a task distribution D:

给定过去经验历史 H={(xi,τi)}i=1N,目标是利用这些历史来最大化智能体在新任务上的表现。 因此,策略 πθ 和记忆系统 M 会被联合优化,以最大化任务分布 D 上的期望奖励:

maxθ,MExD,τπθ,M[R(τ)].

During which M is to produce a memory representation, m, which conditions the agent's policy. The action at any timestep t is thus sampled as atπθ(st,mt), where mt is the inserted memory at that step. Crucially, the nature and timing of memory generation, which we denote as the function fM, vary across different paradigms. We express the generation of the memory mt as:

在这一过程中,M 会产生一个记忆表示 m,用于条件化智能体策略。 因此,任意时间步 t 的动作会被采样为 atπθ(st,mt),其中 mt 是该步插入的记忆。 关键的是,记忆生成的性质和时机在不同范式中并不相同,作者将其记为函数 fM 记忆 mt 的生成可以表示为:

mt=fM(st,H,m<t).

Which accommodates diverse memory invocation granularities. For task-level memory (e.g., Expel and G-Memory), fM is invoked only at t=0, and mt=m0 for all subsequent steps. For step-level memory (e.g., AgentKB), fM is invoked at every step t to update the memory. In parametric memory, the influence of H is compiled into θ, rendering memory generation implicit in the model parameters. Our work, which introduces dynamic latent memory, focuses on designing a more fine-grained fM that decides for itself the optimal moments to regenerate mt at the token level during the agent's reasoning process.

该表达能够容纳不同粒度的记忆调用。 对于任务级记忆(例如 Expel 和 G-Memory),fM 只在 t=0 被调用,之后所有步骤中都有 mt=m0 对于步骤级记忆(例如 AgentKB),fM 会在每个步骤 t 被调用以更新记忆。 在参数化记忆中,H 的影响被编译进 θ,使得记忆生成隐含在模型参数中。 本文引入动态潜在记忆,重点是设计一个更细粒度的 fM,使其在智能体推理过程中自行决定在 token 级别重新生成 mt 的最佳时机。

4. Methodology

MemGen framework
图2:MemGen 框架概览。

4.1. MemGen: Interleaving Memory and Reasoning

Just as a person is the sum of their past experiences, memory critically shapes an agent's actions. Existing agent memory systems, however, often lack the flexibility of human cognition. In the human brain, reasoning and memory form a seamless continuum: active reasoning in the frontoparietal control network and memory retrieval in the hippocampus and prefrontal cortices operate interweavingly, generating a “continuous flow of thoughts”. By contrast, many agent memory paradigms retrieve information once at task onset and append it coarsely to the query. MemGen is designed precisely to bridge this gap.

正如“一个人是其过往经历的总和”,记忆会关键性地塑造智能体的行动。 然而,现有智能体记忆系统往往缺乏人类认知的灵活性。 在人脑中,推理与记忆形成一个无缝连续体:额顶控制网络中的主动推理,以及海马体和前额叶皮层中的记忆检索,会以交织方式运作,产生“连续的思想流”。 相比之下,许多智能体记忆范式只在任务开始时检索一次信息,并粗粒度地附加到查询中。 MemGen 正是为了弥合这一差距而设计的。

As shown in Figure 2, the reasoning process in an agent equipped with MemGen unfolds autoregressively, driven by a frozen core LLM, the reasoner πθ. For a given state st, πθ generates the action at=(zt,1,,zt,Lt). MemGen continuously monitors the token-by-token generation process and performs on-demand memory insertion. At each token-generation step j, a memory trigger Ttrigger monitors the reasoner's internal cognitive state to determine if a moment of reflection, i.e., a memory invocation, is necessary. Specifically, as the reasoner generates the token sequence zt,<j, it produces a corresponding sequence of hidden state vectors, Ht,<j=(ht,1,,ht,j1), where each ht,kRdmodel. The trigger takes the current hidden states Ht,<j as a representation of the reasoner's current metacognitive state and computes an invocation probability:

图2 所示,配备 MemGen 的智能体推理过程由冻结的核心 LLM,也就是推理器 πθ,以自回归方式展开。 对于给定状态 stπθ 生成动作 at=(zt,1,,zt,Lt) MemGen 持续监控逐 token 的生成过程,并按需执行记忆插入。 在每个 token 生成步骤 jmemory trigger Ttrigger 会监控推理器的内部认知状态,以判断是否需要进行一次反思,也就是一次记忆调用。 具体来说,当推理器生成 token 序列 zt,<j 时,它会产生对应的隐藏状态向量序列 Ht,<j=(ht,1,,ht,j1),其中每个 ht,kRdmodel trigger 将当前隐藏状态 Ht,<j 作为推理器当前元认知状态的表示,并计算调用概率:

pj=σ(Ttrigger(ht,1,,ht,j1)).

From which a binary decision, djBernoulli(pj){INVOKE,SKIP}, is sampled. If the decision is to [SKIP], πθ proceeds with its standard autoregressive generation, i.e., zt,jπθ(st,zt,<j). However, if the decision is to INVOKE, the reasoning process is momentarily paused. This summons the second core component of our framework: the memory weaver Wweaver, which takes the same cognitive state Ht,<j as a stimulus to perform a generative act of recollection. It synthesizes a bespoke, machine-native latent memory, formalized as MtRK×dmodel with fixed length K:

随后从中采样一个二值决策 djBernoulli(pj){INVOKE,SKIP} 如果决策为 [SKIP]πθ 会继续标准自回归生成,即 zt,jπθ(st,zt,<j) 然而,如果决策为 INVOKE,推理过程会被短暂停下。 这会召唤框架的第二个核心组件:memory weaver Wweaver,它以相同认知状态 Ht,<j 作为刺激,执行一次生成式回忆。 它合成一段定制化、机器原生的潜在记忆,形式化为固定长度 KMtRK×dmodel

Mt:[mt,1,mt,2,,mt,K]=Wweaver(Ht,<j).

Where the memory is generated not merely from the parametric knowledge encoded within Wweaver but may also incorporate cues retrieved from external memory databases. Crucially, Mt is not a verbatim restatement of prior content but a selective reconstruction, filtered and integrated through Wweaver, akin to the hippocampus consolidating fragments of recollection into human memory. Once formed, the latent memory is woven seamlessly into the reasoner’s ongoing dynamics: its hidden states are prepended to Ht,<j, upon which the reasoner resumes generation conditioned on this enriched context,

这里的记忆并不仅仅来自 Wweaver 中编码的参数化知识,也可能融合从外部记忆数据库中检索到的线索。 关键的是,Mt 不是对先前内容的逐字复述,而是经过 Wweaver 过滤和整合后的选择性重构,类似海马体把回忆片段整合成人类记忆。 一旦形成,潜在记忆就会无缝编织进推理器正在进行的动态中:它的隐藏状态被前置到 Ht,<j,随后推理器在这个增强上下文条件下恢复生成:

zt,jπθ(st,zt,<j,Mt).

This iterative cycle of generation, monitoring, invocation, weaving, and reintegration elevates reasoning from a linear unfolding to a recursive dialogue with memory, all without altering the frozen reasoner πθ, and thereby preserving its general capabilities. In the following sections, we detail the implementations of the memory trigger ( Section 4.2) and the memory weaver ( Section 4.3).

这种生成、监控、调用、编织和重新整合的迭代循环,把推理从线性展开提升为与记忆的递归对话,而且全程不改变冻结推理器 πθ,从而保留其通用能力。 在后续小节中,作者将详细介绍 memory trigger( 第 4.2 节)和 memory weaver( 第 4.3 节)的实现。

4.2. Memory Trigger: Learning to Invocate Memory

In this section, we describe the concrete instantiation and training recipe of the memory trigger Ttrigger. Recall from Section 4.1 that Ttrigger serves as a metacognitive monitor, observing the evolving reasoning state of the frozen reasoner πθ and producing a binary decision dj{INVOKE,SKIP}(0SKIP,1INVOKE) that determines whether the memory weaver should be invoked at token j.

本节描述 memory trigger Ttrigger 的具体实例化和训练配方。 回顾第 4.1 节,Ttrigger 作为元认知监控器,观察冻结推理器 πθ 不断演化的推理状态,并产生一个二值决策 dj{INVOKE,SKIP}(0SKIP,1INVOKE),用于决定是否在 token j 处调用 memory weaver。

Instantiation

We instantiate Ttrigger as a lightweight LoRA adapter attached to the reasoner πθ. At the decoding step j of the timestep t, it receives the sequence of all hidden states, Ht,<jR(j1)×dmodel; conditioned on this context, Ttrigger outputs the action probability P(dj=INVOKE)[0,1]. To avoid excessive computational overhead, we adopt a sentence-granularity activation strategy, inspired by recent studies on LLM interpretability, which find that interventions between sentences can more effectively guide LLMs' reasoning path. Specifically, we define a delimiter token set D (e.g., commas, periods) and let the trigger act only when the current token falls in D. The invocation decision is computed as:

作者将 Ttrigger 实例化为附着在推理器 πθ 上的轻量 LoRA adapter。 在时间步 t 的解码步骤 j,它接收所有隐藏状态序列 Ht,<jR(j1)×dmodel;在该上下文条件下,Ttrigger 输出动作概率 P(dj=INVOKE)[0,1] 为避免过高计算开销,作者采用一种句子粒度激活策略,其灵感来自近期 LLM 可解释性研究:句间干预能够更有效地引导 LLM 推理路径。 具体来说,作者定义一个分隔符 token 集合 D(例如逗号、句号),并且只在当前 token 属于 D 时让 trigger 动作。 调用决策计算如下:

dj=Bernoulli(pj),pj={0if zjD,Ttrigger(Ht,<j)if zjD,

Which ensures that Ttrigger is invoked only at semantically significant boundaries, preserving decoding efficiency. We validate that MemGen does not incur excessive inference delay in Appendix.

这确保 Ttrigger 只在语义上重要的边界被调用,从而保持解码效率。 作者在附录中验证了 MemGen 不会带来过高的推理延迟。

Training Recipe

The memory trigger is trained via reinforcement learning, motivated by the need to balance two competing desiderata: ensuring that critical latent memories are invoked to improve task performance, while avoiding unnecessary or spurious invocations that could disrupt reasoning or incur computational overhead. Given a batch of seen tasks H={(xi,τi)}i=1N, the frozen reasoner πθ generates candidate trajectories while the memory weaver Wweaver remains fixed. At each activated step, the trigger selects an action d~j{INVOKE,SKIP} and receives a reward r(τi) reflecting the quality of the resulting trajectory with respect to the task objective. To encourage sparse yet strategically critical memory invocation, we introduce a reward-adaptive penalty, which discourages unnecessary activations while preserving essential ones, into the objective:

memory trigger 通过强化学习训练,其动机在于平衡两个相互竞争的目标:确保关键潜在记忆被调用以提升任务表现,同时避免不必要或伪造的调用干扰推理或带来计算开销。 给定一批已见任务 H={(xi,τi)}i=1N,冻结推理器 πθ 会生成候选轨迹,而 memory weaver Wweaver 保持固定。 在每个被激活的步骤,trigger 选择一个动作 d~j{INVOKE,SKIP},并接收奖励 r(τi),该奖励反映生成轨迹相对于任务目标的质量。 为了鼓励稀疏但具有战略关键性的记忆调用,作者在目标中引入奖励自适应惩罚,它在保留必要调用的同时抑制不必要激活:

maxϕEτiπθ,d~Ttriggerϕ[R(τi)λi,jmax(0,d~i,jp¯)].

Where p¯ is computed as the mean activation probability across high-reward trajectories, i.e., those with reward exceeding the batch median:

其中 p¯ 被计算为高奖励轨迹上的平均激活概率,也就是奖励超过批次中位数的轨迹:

p¯=1|Hhigh|iHhigh1|τi|jd~i,j,Hhigh={i:R(τi)mediank(R(τk))}.

Where ensures that Ttrigger learns to invoke memory selectively at key decision points, maximizing task reward while maintaining computational efficiency.

这确保 Ttrigger 学会在关键决策点选择性调用记忆,在保持计算效率的同时最大化任务奖励。

4.3. Memory Weaver: Synthesizing and Inserting Latent Memory

In this section, we elaborate on the weaver Wweaver, the memory carrier within the MemGen framework. When the agent assimilates new experiences, this information is exclusively internalized into the parameters of Wweaver, leaving the core reasoner πθ entirely unmodified. At junctures where the reasoner requires experiential support, a context-dependent hook activates the weaver to synthesize and externalize pertinent knowledge as a usable memory. To be more specific, recall from the equation above that after the Ttrigger signals the need for memory at step j, Wweaver accepts Ht,<j (as the hook) and generates a latent token sequence Mt (as the memory) for πθ.

本节进一步阐述 weaver Wweaver,它是 MemGen 框架中的记忆载体。 当智能体吸收新经验时,这些信息只会被内化到 Wweaver 的参数中,而核心推理器 πθ 完全不被修改。 在推理器需要经验支持的时刻,一个依赖上下文的 hook 会激活 weaver,使其合成并外化相关知识,作为可用的记忆 更具体地说,回顾上面的公式,在 Ttrigger 于步骤 j 发出需要记忆的信号后,Wweaver 接收 Ht,<j(作为 hook),并为 πθ 生成潜在 token 序列 Mt(作为 memory)。

Instantiation

We instantiate Wweaver using anthoer LoRA adapter attached to πθ. Formally, given the incoming hook Ht,<jR(j1)×dmodel, the weaver outputs a latent memory matrix: Mt=Wweaverθ(Ht,<j)RK×dmodel, where K denotes the fixed length of the latent memory sequence and θ are the trainable LoRA parameters. The synthesized Mt is then prepended to the current hidden states of πθ to guide subsequent token generation, as described in the equation above.

作者使用另一个附着在 πθ 上的 LoRA adapter 来实例化 Wweaver 形式上,给定输入 hook Ht,<jR(j1)×dmodel,weaver 输出潜在记忆矩阵:Mt=Wweaverθ(Ht,<j)RK×dmodel,其中 K 表示潜在记忆序列的固定长度,θ 是可训练 LoRA 参数。 随后,合成的 Mt 会被前置到 πθ 当前隐藏状态中,以引导后续 token 生成,如上式所述。

Training Recipe

The training of Wweaver proceeds over a batch of past trajectories H={(xi,τi)}i=1N. Distinct from conventional agent tuning, which directly integrates experiential data into the parameters of πθ, MemGen internalizes experiential knowledge solely into Wweaver, which ensures that πθ's general capabilities remain intact. Crucially, this separation makes MemGen agnostic to optimization strategies and compatible with diverse LLM backbones. Whether employing supervised fine-tuning (SFT) or RL-based objectives such as GRPO or DAPO, the weaver can be updated under a unified goal: optimizing the generation process of latent memory so as to maximize downstream reward. Formally, let ΠθWθ,T(x) denote the process of rolling out a trajectory for a task x by πθ in conjunction with weaver Wθ and trigger T. Given a reward functional R, the objective updates only θ by maximizing the expected reward:

Wweaver 的训练在一批过去轨迹 H={(xi,τi)}i=1N 上进行。 不同于常规智能体调优直接把经验数据整合进 πθ 的参数,MemGen 只把经验知识内化到 Wweaver 中,从而确保 πθ 的通用能力保持完整。 关键的是,这种分离使 MemGen 不依赖具体优化策略,并能够兼容多种 LLM backbone。 无论使用监督微调(SFT),还是 GRPO、DAPO 等基于 RL 的目标,weaver 都可以在统一目标下更新:优化潜在记忆生成过程,以最大化下游奖励。 形式上,令 ΠθWθ,T(x) 表示由 πθ 结合 weaver Wθ 和 trigger T 为任务 x rollout 轨迹的过程。 给定奖励泛函 R,目标只通过最大化期望奖励来更新 θ

maxθloraE(xi,τi)HEτΠθWθ,T(xi)[R(xi,τ)].

Where the gradients from R are propagated solely to θ, thereby equipping the weaver to supply precisely the memories that improve end-to-end performance without altering πθ. The equation above enables Wweaver to absorb diverse experiential signals and externalize them as dynamic, context-sensitive latent memories, independent of the architectural or training paradigm of the base reasoner.

其中来自 R 的梯度只传播到 θ,从而使 weaver 能够精确供应能够提升端到端表现的记忆,同时不改变 πθ 上式使 Wweaver 能够吸收多样经验信号,并将其外化为动态、上下文敏感的潜在记忆,而不依赖基础推理器的架构或训练范式。

Integration with Retrieval-based Memory

Although the memory generation above primarily draws on the weaver’s parametric knowledge, it can be combined with external memory sources. When triggered, any retrieval-based system (e.g., MemoryBank, ExpeL) can provide textual memory, which is merged with the hook Ht,<j and fed into W to produce latent memory. This allows W to integrate internal knowledge and external information, supplying the reasoner with richer memory support. Implementation details and results are placed in Appendix.

虽然上述记忆生成主要依赖 weaver 的参数化知识,但它也可以与外部记忆源结合。 当被触发时,任意基于检索的系统(例如 MemoryBank、ExpeL)都可以提供文本记忆;这些文本记忆会与 hook Ht,<j 合并,并输入 W 以产生潜在记忆。 这使 W 能够整合内部知识与外部信息,为推理器提供更丰富的记忆支持。 实现细节和结果见附录。

5. Experiments

In this section, we conduct extensive experiments to answer the following research questions:

  • (RQ1) Can MemGen surpass both parametric and retrieval-based memory?
  • (RQ2) Is the memory learnt by MemGen generalizable across task domains? And why?
  • (RQ3) Can MemGen facilitate continual learning and mitigate catastrophic forgetting?
  • (RQ4) Does MemGen implicitly evolve human-like memory hierachy?

本节通过大量实验回答以下研究问题:

  • (RQ1) MemGen 能否同时超越参数化记忆和基于检索的记忆?
  • (RQ2) MemGen 学到的记忆能否跨任务领域泛化?为什么?
  • (RQ3) MemGen 能否促进持续学习并缓解灾难性遗忘?
  • (RQ4) MemGen 是否会隐式演化出类人记忆层级?

5.1. Experimental Setup

Evaluation and Benchmarks

Our evaluation covers nine datasets from five domains, including ① web search: TriviaQA and PopQA; ② embodied action: ALFWorld; ③ math reasoning: AQuA, GSM8K, and MATH; ④ scientific reasoning: GPQA; and ⑤ coding: KodCode and BigCodeBench.

评估覆盖来自 5 个领域的 9 个数据集,包括 ① Web 搜索:TriviaQA 和 PopQA;② 具身行动:ALFWorld;③ 数学推理:AQuA、GSM8K 和 MATH;④ 科学推理:GPQA;以及 ⑤ 编程:KodCode 和 BigCodeBench。

Baselines

We compare MemGen against twelve baselines, categorized into four groups: (I) Prompt-based methods: Vanilla model, CoT; (II) Parametric memory, where experiential knowledge directly modifies model parameters via: SFT, GRPO, REINFORCE, REINFORCE++, Agent-FLAN; (III) Retrieval-based memory, where processing tasks sequentially and storing the experiences in an external database, represented by MemoryBank, ExpeL, Agent Workflow Memory (AWM); and (IV) Latent computation, where leveraging latent tokens as carriers of experiential knowledge, including SoftCoT and Co-processor.

作者将 MemGen 与 12 个基线比较,并将它们分为 4 组:(I)基于提示的方法:Vanilla model、CoT;(II)参数化记忆:经验知识通过 SFT、GRPO、REINFORCE、REINFORCE++、Agent-FLAN 等方式直接修改模型参数;(III)基于检索的记忆:顺序处理任务并将经验存储在外部数据库中,代表方法包括 MemoryBank、ExpeL、Agent Workflow Memory(AWM);以及 (IV)潜在计算:利用潜在 token 作为经验知识载体,包括 SoftCoT 和 Co-processor。

表1:SmolLM3-3B 和 Qwen3-8B 上的结果。所有数值表示各任务的性能指标(例如准确率 %)。加粗表示最佳结果,下划线表示次优结果。
MethodALFWorldTrivialQAPopQAKodCodeBigCodeBenchGPQAGSM8KMATH
SmolLM3-3B
Vanilla18.9610.478.2337.0535.969.3547.6316.22
CoT17.6012.889.9538.4539.4220.7058.9156.33
SFT32.3655.2537.2259.2540.7919.7063.4845.65
GRPO55.3565.8845.1668.4872.4422.7380.0361.23
REINFORCE53.1363.2046.8165.5367.1423.4482.0358.75
REINFORCE++53.9563.2044.1065.9068.8022.7381.5059.89
Agent-FLAN34.0056.7039.5056.8037.2017.8059.6036.84
ExpeL36.1846.2028.1651.1440.2215.1556.2338.11
MemoryBank32.8043.3025.8144.5031.8010.2058.3043.53
AWM40.5049.8029.60-----
SoftCoT35.0350.3834.9059.2039.1017.2256.3444.62
Co-processor38.3653.2838.9656.2545.4020.1057.6038.81
MemGenSFT50.6068.1342.3462.6542.9926.7570.4257.44
MemGenGRPO63.6079.3058.6072.8574.2425.2083.4763.65
Qwen3-8B
Vanilla58.9352.1834.1349.1033.3338.1889.4879.82
CoT57.1053.8033.2051.2535.5935.1587.6778.24
SFT83.5974.5551.1264.7541.3340.3390.7681.35
GRPO85.6076.1558.9073.3570.2439.5492.3083.54
REINFORCE82.1075.2257.9672.1170.2037.1291.2583.27
REINFORCE++84.8075.9058.3072.9071.8837.6891.9085.24
Agent-FLAN80.3270.3250.0862.9943.4039.5087.6080.05
ExpeL78.9765.5440.3357.2034.2335.1586.2077.40
MemoryBank70.4160.5641.6056.3940.6135.6690.3580.35
AWM80.3369.3043.69-----
SoftCoT75.6059.4239.4263.2838.2739.6086.3076.23
Co-processor73.2861.4245.5564.9042.1939.1576.2379.20
MemGenSFT85.8277.2254.6566.1540.3543.2391.2583.30
MemGenGRPO90.6080.6562.3076.1675.5640.2493.2088.24
Implementation Details

We select LLM backbones of varying sizes, including Qwen-2.5-1.5B, HuggingFace's SmolLM3-3B, and Qwen3-8B. The length of each latent memory sequence K is set among {2,4,8}. MemGen does not rely on a specific optimization algorithm, so we implement two variants: MemGenSFT and MemGenGRPO, in which the weaver is updated using SFT and GRPO signals. Details on these variants are provided in Appendix. More training setup and parameter configurations are listed in Appendix.

作者选择不同规模的 LLM backbone,包括 Qwen-2.5-1.5B、HuggingFace 的 SmolLM3-3B 和 Qwen3-8B。 每段潜在记忆序列的长度 K{2,4,8} 中设置。 MemGen 不依赖某一种特定优化算法,因此作者实现了两个变体:MemGenSFT 和 MemGenGRPO,其中 weaver 分别使用 SFT 和 GRPO 信号更新。 这些变体的细节见附录。 更多训练设置和参数配置也列在附录中。

5.2. Main Restuls

[For RQ1] MemGen provides high-performing memory across domains.

As shown in Appendix and Table 1, existing baselines exhibit clear limitations in cross-domain adaptivity. Retrieval-based memories (e.g., ExpeL, MemoryBank, AWM) occasionally surpass parazmetric tuning in embodied action; for instance, AWM reaches 36.18% on ALFWorld with SmolLM3-3B, exceeding SFT by 3.15%. Yet their effectiveness deteriorates on reasoning-intensive tasks: ExpeL achieves only 8.12% on GPQA+Qwen2.5-1.5B, and even underperforms the vanilla model by 6.9% on TriviaQA, underscoring its heavy reliance on backbone capacity. Parametric finetuning methods display the opposite tendency: they excel in structured domains such as code generation, where REINFORCE++ reaches 63.33% on KodCode with Qwen2.5-1.5B, but remain weak in knowledge-intensive reasoning, with GPQA below 14%. In contrast, MemGen consistently advances performance across all domains. For example, on ALFWorld+SmolLM3-3B, MemGenSFT and MemGenGRPO attain 50.60% and 63.60%, improving over vanilla by 31.64% and 44.64%, respectively. Similar gains appear with the larger Qwen3-8B, where MemGenGRPO achieves +27.06% on KodCode and +28.17% on PopQA, surpassing GRPO by up to 3.4%. Overall, the dynamic memory insertion of MemGen delivers substantial improvements across diverse task domains.

如附录和 表1 所示,现有基线在跨域适应性上表现出明显局限。 基于检索的记忆(例如 ExpeL、MemoryBank、AWM)有时会在具身行动任务上超过 parazmetric tuning;例如,在 SmolLM3-3B 上,AWM 在 ALFWorld 达到 36.18%,比 SFT 高 3.15% 但它们在推理密集型任务上的效果会恶化:ExpeL 在 GPQA+Qwen2.5-1.5B 上只有 8.12%,在 TriviaQA 上甚至比 vanilla 模型低 6.9%,这凸显了它对 backbone 能力的高度依赖。 参数化微调方法呈现相反趋势:它们在代码生成等结构化领域表现突出,例如 REINFORCE++ 在 Qwen2.5-1.5B 的 KodCode 上达到 63.33%,但在知识密集型推理上仍很弱,GPQA 低于 14% 相比之下,MemGen 在所有领域中都稳定提升性能。 例如,在 ALFWorld+SmolLM3-3B 上,MemGenSFT 和 MemGenGRPO 分别达到 50.60%63.60%,相对 vanilla 分别提升 31.64%44.64% 更大的 Qwen3-8B 上也出现类似收益,MemGenGRPO 在 KodCode 上达到 +27.06%,在 PopQA 上达到 +28.17%,最多超过 GRPO 3.4% 总体而言,MemGen 的动态记忆插入在多样任务领域中带来了显著提升。

MemGen generalization
图3:MemGen 的泛化研究。作者在一个数据集(ALFWorld 或 TriviaQA)上训练 MemGen-SFT,并在四个数据集(TriviaQA、ALFWorld、ScienceWorld 和 FEVER)上评估。
[For RQ2] MemGen Exhibits Strong Cross-Domain Generalization.

To evaluate whether the memory learned by MemGen can transfer across tasks, we train MemGen on one dataset and test it on several others. We include two out-of-domain datasets, ScienceWorld and FEVER, to further probe this. As shown in Figure 3 and Appendix, baselines such as SFT and MemoryBank achieve gains within their training domains (e.g., on ALFWorld, SFT +14.1% and MemoryBank +5.4% compared with vanilla), yet fail to generalize, with performance dropping sharply on FEVER by 16.2%. In contrast, MemGen not only attains substantial in-domain improvements (24.55%58.16% on KodCode, Appendix), but also exhibits effective transfer: when trained on KodCode, performance on MATH rises from 36.6%54.2%. Having empirically validated MemGen's generalizability, we take a further step to investigate the mechanism that underlies this capability.

为评估 MemGen 学到的记忆能否跨任务迁移,作者在一个数据集上训练 MemGen,并在多个其他数据集上测试。 作者还加入两个域外数据集 ScienceWorld 和 FEVER,以进一步探查这一点。 图3 和附录所示,SFT、MemoryBank 等基线能在训练域内获得收益(例如在 ALFWorld 上,相比 vanilla,SFT +14.1%,MemoryBank +5.4%),但无法泛化,在 FEVER 上性能急剧下降 16.2% 相比之下,MemGen 不仅获得显著域内提升(在 KodCode 上从 24.55%58.16%,见附录),还表现出有效迁移:当在 KodCode 上训练时,MATH 上的性能从 36.6% 提升到 54.2% 在经验上验证 MemGen 的泛化能力后,作者进一步研究这种能力背后的机制。

MemGen trigger frequency
图4:推理时跨基准的记忆调用频率(使用 MemGen-SFT + Qwen3-8B + GSM8K 训练)。
[For RQ2] The Memory Trigger Intelligently Determines When to Activate Memory Insertion, Mitigating Domain Conflict.

After training MemGen on GSM8K, we evaluate 150 samples each from GSM8K, KodCode, and GPQA, visualizing the frequency with which the memory trigger invoked the memory weaver at each relative position in the model output. We observe that the invocation frequency varies across domains and correlates directly with performance in Appendix: GSM8K exhibits the largest improvement (+19.64%) and maximal invocations, GPQA achieves moderate gains (+6.06%) with medium invocations, and KodCode shows the smallest improvement (+3.1%) with the fewest invocations. This indicates that MemGen autonomously assesses, based on task-specific context, when memory insertion will be beneficial, invoking the weaver less frequently in unfamiliar domains.

在 GSM8K 上训练 MemGen 后,作者分别从 GSM8K、KodCode 和 GPQA 中评估 150 个样本,并可视化 memory trigger 在模型输出各相对位置调用 memory weaver 的频率。 作者观察到调用频率会随领域变化,并且与附录中的性能直接相关:GSM8K 提升最大(+19.64%),调用次数也最多;GPQA 获得中等提升(+6.06%),调用频率中等;KodCode 提升最小(+3.1%),调用次数最少。 这表明 MemGen 会基于任务特定上下文自主评估何时插入记忆有益,并在陌生领域中更少调用 weaver。

[For RQ3] MemGen Effectively Mitigates Catastrophic Forgetting.

In Appendix, we sequentially train on four datasets and evaluate on all benchmarks after each stage, where MemGen exhibits stronger knowledge retention ability compared to baseline methods. For example, unlike SFT which primarily improves performance on the most recent task (54.10% on KodCode but only 2.53% on GPQA), MemGen demonstrates more balanced cross-task generalization, attaining 38.43% on AQuA and 21.72% on GPQA after GSM8K training. Finally, it mitigates forgetting on earlier tasks, preserving 40.34% on AQuA following KodCode training compared to 27.14% for ExpeL and 28.61% for SFT, indicating a more stable continual learning ability. More analysis is placed in Appendix.

在附录中,作者依次在 4 个数据集上训练,并在每个阶段后评估所有基准;相比基线方法,MemGen 表现出更强的知识保持能力。 例如,SFT 主要提升最近任务上的表现(KodCode 上 54.10%,但 GPQA 上只有 2.53%),而 MemGen 展示出更均衡的跨任务泛化,在 GSM8K 训练后,AQuA 达到 38.43%,GPQA 达到 21.72% 最后,它还能缓解早期任务遗忘:在 KodCode 训练后仍保留 AQuA 上 40.34% 的性能,而 ExpeL 为 27.14%,SFT 为 28.61%,说明其持续学习能力更稳定。 更多分析见附录。

5.3. Framework Analysis

Having established the expressive capabilities of MemGen, we further investigate its underlying mechanisms: what do the learned latent memories look like? Do they have specialized functions?

在确认 MemGen 的表达能力后,作者进一步研究其底层机制:学到的潜在记忆是什么样子? 它们是否具有专门功能?

MemGen t-SNE latent memory visualization
图5:(左)MemGen + Qwen3-8B 在不同数据集上生成的潜在记忆 t-SNE 可视化;(中、右)TriviaQA 和 GSM8K 数据集内的潜在记忆可视化,使用 K-means 聚类。每个簇中心的文本表示该簇中许多记忆序列共享的共同模式,例如 GSM8K 的 Cluster 0 中,许多序列以 _check 结尾。
[For RQ4] The Latent Memory Is Machine-Native and Human-Unreadable.

We first visualized the latent memory sequences learned by MemGen across different datasets using t-SNE in Figure 5 and Appendix. As shown in Figure 5 (Left), sequences from distinct domains form separate distributions, with related domains clustering closely (e.g., KodCode and BigCodeBench, GSM8K and MATH). Examining latent memories within the same dataset, we observed pronounced clustering patterns (as shown in Figure 5 (Middle and Right)). To explore potential commonalities within these clusters, we forcibly decoded the latent tokens. Although the decoded sequences are not human-readable, they exhibit intriguing regularities: many tokens within a cluster share structural conventions. For example, Cluster 0 in TriviaQA frequently follows the pattern "[...]SOC", whereas Cluster 3 in GSM8K often adopts the format "[...]_pick". A large corpus of latent memory tokens is provided in Appendix. Despite these sequences being machine-native and human-unreadable, we further investigate whether their underlying semantics can be interpreted.

作者首先在 图5 和附录中使用 t-SNE 可视化 MemGen 在不同数据集上学到的潜在记忆序列。 图5(左)所示,来自不同领域的序列形成分离分布,相关领域会聚得更近(例如 KodCode 和 BigCodeBench、GSM8K 和 MATH)。 考察同一数据集内的潜在记忆时,作者观察到明显的聚类模式(如 图5 中、右所示)。 为了探索这些簇内部可能存在的共性,作者强制解码潜在 token。 尽管解码后的序列不是人类可读的,它们却表现出有趣的规律性:同一簇中的许多 token 共享结构约定。 例如,TriviaQA 中的 Cluster 0 经常遵循 "[...]SOC" 模式,而 GSM8K 中的 Cluster 3 往往采用 "[...]_pick" 格式。 大量潜在记忆 token 语料见附录。 尽管这些序列是机器原生且人类不可读的,作者仍进一步研究其底层语义是否可以被解释。

MemGen failure analysis
图6:(左)潜在记忆长度 K 的参数敏感性分析;(右)在 TriviaQA 数据集上,选择性移除潜在记忆簇对不同智能体失败模式的影响。
[For RQ4] MemGen Implicitly Learns a Human-like Memory Hierarchy.

To uncover the functional roles of different latent memory clusters, we conducted a post-hoc intervention study. Following the taxonomy from prior work, we study eight distinct types of agent failure, including planning errors, tool response/parsing failures, answer formatting mistakes, etc, providing a structured framework to assess how memory influences performance. During evaluation, we selectively removed latent tokens close to a specific cluster while keeping others intact, measuring the resulting changes in these failure modes. Details on (1) the visualization process, (2) failure mode annotation, and (3) token filtration are in Appendix. As shown in Figure 6 (Right), distinct memory clusters exhibit varying influence on failure modes and can be mapped to different memory functions:

  • Planning Memory supports high-level task planning and strategic reasoning.

Removal of Cluster 2 substantially increases planning and compositional reasoning failures, indicating that this cluster is crucial for guiding the LLM agent's decision-making and sequencing of reasoning steps.

  • Procedural Memory captures task-specific operational knowledge, such as tool usage and formatting ability.

Cluster 3 corresponds to this role, as its removal leads to a marked increase in tool response errors, parsing failures, and answer formatting mistakes.

  • Working Memory manages the retention and effective use of prior context to maintain reasoning consistency.

Clusters 1 and 4 contribute to this function: for instance, removing Cluster 1’s memory tokens results in more frequent task misunderstandings and think-act inconsistency.

为揭示不同潜在记忆簇的功能角色,作者进行了事后干预研究。 遵循先前工作的分类法,作者研究 8 种不同智能体失败类型,包括规划错误、工具响应/解析失败、答案格式错误等,从而提供一个结构化框架来评估记忆如何影响表现。 在评估时,作者选择性移除接近某个特定簇的潜在 token,同时保持其他 token 不变,并测量这些失败模式的变化。 关于(1)可视化过程、(2)失败模式标注和(3)token 过滤的细节见附录。 图6(右)所示,不同记忆簇对失败模式的影响不同,并且可以映射到不同记忆功能:

  • 规划记忆 支持高层任务规划和战略推理。

移除 Cluster 2 会显著增加规划和组合推理失败,说明该簇对引导 LLM 智能体决策和安排推理步骤非常关键。

  • 程序性记忆 捕捉任务特定的操作知识,例如工具使用和格式化能力。

Cluster 3 对应该角色,因为移除它会明显增加工具响应错误、解析失败和答案格式错误。

  • 工作记忆 管理先前上下文的保持和有效使用,以维持推理一致性。

Cluster 1 和 4 贡献了这一功能:例如,移除 Cluster 1 的记忆 token 会导致任务误解和思考-行动不一致更频繁。

Nevertheless, these memory clusters are not entirely independent: for example, removing Cluster 1 also negatively affects planning ability, indicating that these memory faculties interact and jointly enable the LLM to leverage past experience effectively. This analysis reveals that MemGen spontaneously organizes latent memory into a structured, human-like hierarchy.

不过,这些记忆簇并非完全独立:例如,移除 Cluster 1 也会负面影响规划能力,这说明这些记忆能力会相互作用,并共同使 LLM 有效利用过去经验。 这一分析表明,MemGen 会自发地把潜在记忆组织成结构化、类人的层级。

Ablation Study & Sensitivity Analysis

We conduct a sensitivity analysis on the length of the latent memory sequence K, as shown in Figure 6 (Left). It can be observed that as the latent token length increases from 232, MemGen’s performance correspondingly improves, likely reflecting the expanded memory capacity. We then perform an ablation study on the memory trigger module in Appendix, demonstrating the necessity of a dedicatedly trained trigger for effective memory invocation. Furthermore, we analyze different training paradigms of the memory weaver in Appendix. Additional results and discussions are in Appendix.

作者对潜在记忆序列长度 K 进行敏感性分析,如 图6(左)所示。 可以观察到,当潜在 token 长度从 232 增加时,MemGen 的性能相应提升,这可能反映了记忆容量的扩展。 随后,作者在附录中对 memory trigger 模块做消融研究,证明专门训练的 trigger 对有效记忆调用是必要的。 此外,作者还在附录中分析 memory weaver 的不同训练范式。 更多结果和讨论见附录。

Efficiency Analysis

To confirm that the memory insertion process of MemGen does not introduce significant inference overhead, we show in Appendix that, while achieving up to 57.66% performance improvement over vanilla LLMs, the per-query inference delay remains consistently below the baseline, ranging from 24% to 94% of the vanilla LLM latency. This clearly demonstrates that MemGen delivers substantial performance gains without compromising efficiency.

为确认 MemGen 的记忆插入过程不会引入显著推理开销,作者在附录中表明:在相较 vanilla LLM 取得最高 57.66% 性能提升的同时,每个 query 的推理延迟始终低于基线,范围为 vanilla LLM 延迟的 24%94% 这清楚说明,MemGen 能够在不牺牲效率的情况下带来显著性能收益。

6. Conclusion

In this work, we introduced MemGen, a dynamic and generative memory framework designed for LLM Agents. By interleaving reasoning with memory synthesis through a reinforcement-learned memory trigger and a generative memory weaver, MemGen transcends the limitations of parametric and retrieval-based paradigms. Extensive experiments showcase substantial performance gains, robust cross-domain generalization, strong continual learning ability, and MemGen's explicitly modeled memory hierarchy (i.e., planning, procedural, and working memory). These results suggest a promising path toward self-evolving LLM agents capable of fluid and reconstructive intelligence.

本文提出了 MemGen,一个面向 LLM 智能体的动态生成式记忆框架。 通过强化学习学到的 memory trigger 和生成式 memory weaver,把推理与记忆合成交织起来,MemGen 超越了参数化和基于检索的范式限制。 大量实验展示了显著性能提升、稳健跨域泛化、强持续学习能力,以及 MemGen 显式建模的记忆层级(即规划记忆、程序性记忆和工作记忆)。 这些结果为构建具备流动性和重构性智能的自进化 LLM 智能体指出了一条有前景的道路。