E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory

MemoryAgentMulti-AgentLong-contextICML 2026CCF-ASJTUState Key Lab of Digital FinanceSJTU Suzhou

Wang K, Lin Y, Wang Z, et al. E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory[J]. arXiv preprint arXiv:2601.21714v5, 2026.
https://github.com/dog-last/E-mem

E-mem：面向 LLM 智能体记忆的多智能体情景上下文重构

Abstract

The evolution of Large Language Model (LLM) agents towards System 2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de-contextualization. By compressing complex sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from memory preprocessing to episodic context reconstruction inspired by biological engrams. E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54% F1, surpassing the state-of-the-art GAM by 7.75%, while reducing token cost by over 70%. Our work is available on https://github.com/dog-last/E-mem.

大语言模型（LLM）智能体正在走向 System 2 推理，这类推理以审慎、高精度的问题求解为特征，因此需要在更长时域中保持严格的逻辑完整性。然而，主流记忆预处理范式会遭受破坏性的去上下文化。这些方法把复杂的序列依赖压缩成预定义结构（例如 embedding 或图），从而切断深度推理所必需的上下文完整性。为了解决这一问题，我们提出 E-mem，这是一个从记忆预处理转向情景上下文重构的框架，其灵感来自生物 engram。 E-mem 采用异构分层架构，其中多个助手智能体（assistant agents）维护未压缩的记忆上下文，而一个中心主智能体（master agent）负责全局规划。不同于被动检索，我们的机制使助手智能体能够在被激活的片段内进行局部推理，并在聚合前抽取上下文感知的证据。 LoCoMo 基准上的评估表明，E-mem 达到超过 54% 的 F1，超过此前最优的 GAM 7.75%，同时把 token 成本降低超过 70%。我们的工作可在 https://github.com/dog-last/E-mem 获取。

Traditional Memory System vs. E-mem — 图1：传统记忆系统与 E-mem 的比较。

1. Introduction

Large Language Models (LLMs) have evolved from stochastic text generators into the central cognitive controllers of autonomous agents. Empowered by advanced planning capabilities and external tool integration, these systems are now transitioning towards System 2 reasoning—characterized by deliberative, sequential problem-solving in dynamic environments. However, supporting this shift demands rigorous adherence to causal chains. In such scenarios, maintaining extensive history becomes pivotal to preserving the logical integrity essential for deep, long-horizon planning.

大语言模型（LLM）已经从随机文本生成器演化为自主智能体的核心认知控制器。在高级规划能力和外部工具集成的支持下，这些系统正在转向 System 2 推理，即在动态环境中进行审慎、序列化的问题求解。然而，支撑这种转变需要严格遵循因果链。在这类场景中，维护大量历史信息对于保留深度、长程规划所需的逻辑完整性至关重要。

However, expanding operational horizons presents significant challenges. Merely extending context windows often triggers the "Lost-in-the-Middle" phenomenon, explicitly necessitating robust memory management mechanisms. Existing approaches primarily rely on preprocessing to index memory, mapping raw, unstructured contexts into pre-defined structures (e.g., static embeddings, knowledge graphs, or hierarchical archives). While enabling efficient lookup, this strategy results in destructive de-contextualization: by compressing complex sequential dependencies into rigid representations, it disrupts the critical integrity required for deep reasoning (as shown at the top of Figure 1). Consequently, such methods struggle to reconstruct complex causal chains or comprehend memories within their original sequential contexts, ultimately yielding suboptimal performance on information-dense benchmarks like LoCoMo.

然而，扩展操作时域会带来显著挑战。仅仅延长上下文窗口往往会触发 “Lost-in-the-Middle” 现象，因此明确需要稳健的记忆管理机制。现有方法主要依赖预处理来索引记忆，把原始、非结构化上下文映射到预定义结构中（例如静态 embedding、知识图谱或分层档案）。虽然这种策略支持高效查找，但它会导致破坏性的去上下文化：通过把复杂序列依赖压缩成刚性表示，它破坏了深度推理所需的关键完整性（如图1 顶部所示）。因此，这类方法难以重构复杂因果链，也难以在记忆原本的序列上下文中理解记忆，最终在 LoCoMo 这类信息密集型基准上表现不佳。

To address these limitations and strictly ensure logical deduction over memory, we introduce E-mem. This framework transitions from memory preprocessing to episodic context reconstruction (as shown at the bottom of Figure 1). (i) Inspired by biological engrams, E-mem preserves the full episodic context of original experiences, enabling the active re-experiencing of past events. (ii) To mitigate the information distortion inherent in ultra-long contexts while ensuring cost-effectiveness for scalable deployment, E-mem adopts a heterogeneous hierarchical master-assistant architecture. In this design, a central master agent orchestrates global planning, while multiple assistant agents (implemented as small language models, SLMs) serve as memory units. Each assistant agent maintains the raw memory context of a specific segment. (iii) For each query, a routing mechanism selectively activates a relevant subset of assistant agents. Crucially, instead of merely retrieving text chunks, these agents execute episodic context reconstruction—a process where they actively re-experience and reason within the restored native contexts to derive precise, local evidence for the master agent. This mechanism not only mitigates context window constraints and reduces computational costs but also ensures high-fidelity, context-preserving inference, offering a distinct advantage in complex, multi-hop reasoning tasks where traditional retrieval mechanisms fall short.

为了解决这些局限，并严格保证基于记忆的逻辑推演，我们引入 E-mem。该框架从记忆预处理转向情景上下文重构（如图1 底部所示）。 (i) 受生物 engram 启发，E-mem 保留原始经历的完整情景上下文，从而支持主动重新体验过去事件。 (ii) 为了缓解超长上下文中固有的信息失真，并保证可扩展部署的成本效益，E-mem 采用异构分层的主-助手架构。在这种设计中，一个中心主智能体负责全局规划，多个助手智能体（实现为小语言模型，SLM）作为记忆单元。每个助手智能体维护某个特定片段的原始记忆上下文。 (iii) 对于每个查询，路由机制会选择性激活相关的助手智能体子集。关键在于，这些智能体并不只是检索文本块，而是执行情景上下文重构：它们在恢复后的原生上下文中主动重新体验并推理，为主智能体推导精确的局部证据。该机制不仅缓解上下文窗口限制并降低计算成本，还保证高保真、上下文保持的推理，在传统检索机制力有不逮的复杂多跳推理任务中提供了明显优势。

Empirical evaluations on LoCoMo and HotpotQA confirm that E-mem achieves state-of-the-art performance. The system delivers superior accuracy compared to SOTA baselines while significantly reducing token cost, validating the efficiency of the proposed episodic context reconstruction paradigm. In summary, our main contributions are as follows:

LoCoMo 和 HotpotQA 上的实证评估确认，E-mem 取得了最优性能。与最优基线相比，该系统在显著降低 token 成本的同时获得更高准确率，验证了所提出情景上下文重构范式的效率。总之，我们的主要贡献如下：

Episodic Context Reconstruction. To address the destructive de-contextualization inherent in traditional memory preprocessing, we introduce E-mem, a framework that centered on episodic context reconstruction.

Unlike static retrieval methods that sever sequential dependencies, our approach delegates active reasoning to assistant agents.

They preserve and process the full context of memory segments locally, ensuring that only logically deduced evidence—rather than raw, noisy fragments—is surfaced to master agents.

Heterogeneous Hierarchical Master-Assistant Architecture. We propose E-mem, a scalable framework that decouples high-level planning from memory retention.

By coordinating a master agent with lightweight, SLM-based assistant agents, our design mitigates the information loss inherent in preprocessing, ensuring high-fidelity reasoning across long horizons without suffering from the "lost-in-the-middle" phenomenon.

SOTA Performance with Token Efficiency. Extensive evaluations on LoCoMo and HotpotQA demonstrate that E-mem outperforms strong baselines by an average of 7.75% (F1), with notable gains in complex multi-hop (+8.56%) and temporal reasoning (+8.87%).

Crucially, these improvements are realized while reducing token cost by over 70%.

These results confirm E-mem as a vital complement to traditional memory paradigms for System 2 reasoning.

情景上下文重构。 为了解决传统记忆预处理中固有的破坏性去上下文化，我们引入 E-mem，这是一个以情景上下文重构为中心的框架。

不同于会切断序列依赖的静态检索方法，我们的方法把主动推理委托给助手智能体。

它们在本地保留并处理记忆片段的完整上下文，确保浮现给主智能体的是经过逻辑推导的证据，而不是原始、含噪的碎片。

异构分层主-助手架构。 我们提出 E-mem，这是一个将高层规划与记忆保留解耦的可扩展框架。

通过协调一个主智能体和基于轻量 SLM 的助手智能体，我们的设计缓解了预处理固有的信息损失，确保在长时域上进行高保真推理，同时避免 “lost-in-the-middle” 现象。

具备 token 效率的最优性能。 在 LoCoMo 和 HotpotQA 上的大量评估表明，E-mem 平均超过强基线 7.75%（F1），并在复杂多跳（+8.56%）和时间推理（+8.87%）上取得显著提升。

关键的是，这些改进是在降低超过 70% token 成本的同时实现的。

这些结果确认，E-mem 是传统记忆范式在 System 2 推理上的重要补充。

Retrieval-Augmented Generation. RAG has established itself as a fundamental paradigm to mitigate LLM hallucination and knowledge obsolescence by grounding generation in external corpora. While standard frameworks typically employ a vector-based "retrieve-then-generate" pipeline, recent Agentic RAG grants systems greater autonomy in retrieval planning and iterative context refinement. Despite these advancements in retrieval logic, the underlying storage mechanism remains predominantly reliant on preprocessing. This approach inherently compresses rich sequential contexts into fixed geometric points, risking the loss of fine-grained sequential dependencies essential for reasoning.

检索增强生成。 RAG 已经成为缓解 LLM 幻觉和知识过时问题的基础范式，它通过把生成过程建立在外部语料之上来实现这一点。标准框架通常采用基于向量的“先检索后生成”流程，而近期的 Agentic RAG 赋予系统在检索规划和迭代式上下文细化方面更大的自主性。尽管检索逻辑已有这些进展，底层存储机制仍主要依赖预处理。这种方法会把丰富的序列上下文压缩成固定的几何点，从而有丢失推理所需细粒度序列依赖的风险。

Memory Systems for Autonomous Agents. Beyond simple retrieval, persistent and adaptive memory systems are critical for agents to maintain long-term interaction coherence. Addressing the finite context window, MemGPT employs an operating system-inspired virtual context management technique via hierarchical paging. However, this paradigm relies on swapping fragmented chunks, necessitating redundant re-processing to restore sequential dependencies. Other works focus on optimizing memory structure and active evolution. For instance, G-Memory introduces a graph-based hierarchical structure to enable navigation between global macro-views and local micro-interactions. Similarly, A-Mem adopts a self-evolving framework based on the Zettelkasten method, while GAM and ReasoningBank leverage multi-agent deep research and reasoning trajectory storage, respectively. Although recent efforts extend to personalization (Mem0) and benchmarking (MemoryBench), these approaches remain bound to text-based preprocessing paradigms. By compressing complex contexts into rigid structures, they often disrupt the contextual integrity required for deep reasoning.

自主智能体的记忆系统。 除了简单检索之外，持久且自适应的记忆系统对于智能体维持长期交互一致性至关重要。为了解决有限上下文窗口问题，MemGPT 通过分层分页采用了一种受操作系统启发的虚拟上下文管理技术。然而，这一范式依赖交换碎片化文本块，因此需要冗余的再处理来恢复序列依赖。其他工作关注优化记忆结构和主动演化。例如，G-Memory 引入基于图的分层结构，以支持在全局宏观视图和局部微观交互之间导航。类似地，A-Mem 采用基于 Zettelkasten 方法的自演化框架，而 GAM 和 ReasoningBank 分别利用多智能体深度研究和推理轨迹存储。尽管近期工作扩展到个性化（Mem0）和基准测试（MemoryBench），这些方法仍受限于基于文本的预处理范式。通过把复杂上下文压缩成刚性结构，它们往往会破坏深度推理所需的上下文完整性。

In contrast, E-mem introduces episodic context reconstruction, which ensures seamless inference integrity and serves as a critical complement to existing paradigms for high-precision, complex reasoning tasks.

相比之下，E-mem 引入情景上下文重构，确保无缝的推理完整性，并成为现有范式在高精度复杂推理任务中的关键补充。

3. Method

Cognitive science defines memory as the re-experience of intact episodic contexts rather than static retrieval. In contrast, prevalent preprocessing paradigms force dynamic inputs into fixed structures, resulting in destructive de-contextualization. This rigid compression disrupts sequential dependencies, severing the contextual integrity essential for deep reasoning.

认知科学把记忆定义为对完整情景上下文的重新体验，而不是静态检索。相比之下，主流预处理范式会把动态输入强行放入固定结构，从而造成破坏性的去上下文化。这种刚性压缩会破坏序列依赖，切断深度推理所必需的上下文完整性。

We propose E-mem, a framework centered on episodic context reconstruction, designed to explicitly preserve uncompressed memory segments and their inherent sequential dependencies (as shown in Figure 2). Implemented via a heterogeneous hierarchical architecture, the system functions through a streamlined three-stage process: first, a routing mechanism performs coarse-grained localization to selectively activate relevant archived memory units; subsequently, multiple assistant agents execute parallel fine-grained reasoning within these raw contexts to derive specific evidence; finally, the central master agent aggregates these distributed insights into a coherent, synergistic response.

我们提出 E-mem，这是一个以情景上下文重构为中心的框架，旨在显式保留未压缩的记忆片段及其内在序列依赖（如图2 所示）。该系统通过异构分层架构实现，并以简洁的三阶段流程运行：首先，路由机制执行粗粒度定位，以选择性激活相关的归档记忆单元；随后，多个助手智能体在这些原始上下文中并行执行细粒度推理，以推导具体证据；最后，中心主智能体把这些分布式洞见聚合成一个连贯、协同的响应。

3.1 Architecture

We propose E-mem, a heterogeneous hierarchical architecture designed to scale long-context reasoning by decoupling high-level planning from low-level memory retention. Formally, the system is defined as a collaborative tuple:

我们提出 E-mem，这是一种异构分层架构，旨在通过解耦高层规划和低层记忆保留来扩展长上下文推理。形式上，该系统被定义为一个协作元组：

\begin{aligned} F & = ⟨ A_{master}, {A_{asst}^{(i)}}_{i = 1}^{N}, R ⟩ . \end{aligned}

where $A_{master}$ acts as the central planner, ${A_{asst}^{(i)}}$ represents a dynamic set of multiple assistant agents, and $R$ denotes the multi-pathway semantic routing mechanism.

其中， $A_{master}$ 作为中心规划器， ${A_{asst}^{(i)}}$ 表示多个助手智能体构成的动态集合， $R$ 表示多路径语义路由机制。

Master Agent ( $A_{master}$ ): Global Planner and Synthesizer. The master agent functions as the central orchestrator, decoupled from the burden of raw memory retention. Its primary role is to execute high-level cognitive planning and synthesize distributed evidence into a coherent response. Rather than processing the extensive raw context directly—which would incur prohibitive computational costs—the master agent operates in a sparse global planning space. It delegates the specific memory activation and localization tasks to the routing mechanism and interacts exclusively with the local evidence from activated memory units. Formally, we define the master agent's operation as a mapping function:

主智能体（ $A_{master}$ ）：全局规划器与综合器。 主智能体作为中心协调者运行，并与原始记忆保留的负担解耦。它的主要作用是执行高层认知规划，并把分布式证据综合成连贯响应。主智能体不会直接处理庞大的原始上下文，因为这会带来高昂的计算成本；相反，它在稀疏的全局规划空间中运行。它把具体的记忆激活和定位任务委托给路由机制，并且只与来自被激活记忆单元的局部证据交互。形式上，我们把主智能体的操作定义为一个映射函数：

\begin{aligned} R & = A_{master} (q, {e_{i} ∣ i \in A^{*}}) . \end{aligned}

where ${e_{i}}$ denotes the set of local evidence tuples derived by the selected assistants. This design ensures that the reasoning process remains computationally tractable and focused on logical deduction, even as the scale of the underlying memory archive expands indefinitely.

其中， ${e_{i}}$ 表示由所选助手智能体推导出的局部证据元组集合。这种设计确保即使底层记忆档案的规模无限扩展，推理过程仍然在计算上可处理，并专注于逻辑推导。

Assistant Agents ( $A_{asst}^{(i)}$ ): Episodic Context Holders. The assistant agents serve as the memory units for parallelized storage and execution. To ensure deployment feasibility and scalability, we instantiate these agents using SLMs. Uniquely, each assistant $A_{asst}^{(i)}$ employs a dual-representation strategy: it preserves the immutable, complete episodic context $E_{i}$ for fine-grained local reasoning, while also maintaining a concise semantic summary $s_{i}$ used for global routing. We formalize this composite context $S_{i}$ as a tuple:

助手智能体（ $A_{asst}^{(i)}$ ）：情景上下文持有者。 助手智能体作为并行化存储和执行的记忆单元。为了确保部署可行性和可扩展性，我们使用 SLM 实例化这些智能体。独特的是，每个助手智能体 $A_{asst}^{(i)}$ 采用双表示策略：它保留不可变、完整的情景上下文 $E_{i}$ 用于细粒度局部推理，同时维护一个用于全局路由的简洁语义摘要 $s_{i}$ 。我们把这个复合上下文 $S_{i}$ 形式化为一个元组：

\begin{aligned} S_{i} & = ⟨ E_{i}, s_{i} ⟩ . \end{aligned}

where $E_{i}$ represents the high-fidelity raw token sequence and $s_{i}$ denotes the lightweight summary. This architecture allows the system to store extensive histories hierarchically while selectively reconstructing only the most relevant contexts. Formally, only the subset of agents identified by the routing mechanism (denoted as $A^{*}$ ) is transitioned to the active inference path. This mechanism ensures high-fidelity reasoning by enabling the selected agents to re-experience the original memory context without noise.

其中， $E_{i}$ 表示高保真的原始 token 序列， $s_{i}$ 表示轻量摘要。这种架构使系统能够以分层方式存储大量历史，同时只选择性重构最相关的上下文。形式上，只有路由机制识别出的智能体子集（记为 $A^{*}$ ）会被转入活跃推理路径。该机制使所选智能体能够在无噪声的情况下重新体验原始记忆上下文，从而保证高保真推理。

Multi-Pathway Routing Mechanism ( $R$ ). To efficiently filter the memory archive, the system employs a dedicated routing mechanism that goes beyond simple summarization. Given a query $q$ , this mechanism executes a multi-pathway policy $π$ to generate an activation distribution:

多路径路由机制（ $R$ ）。 为了高效过滤记忆档案，系统采用了一个专用路由机制，它超越了简单摘要。给定查询 $q$ ，该机制执行多路径策略 $π$ ，生成激活分布：

\begin{aligned} P_{act} & = π (q ∣ S, R) \in [0, 1]^{N} . \end{aligned}

where $R$ synthesizes heterogeneous signals, including global narrative alignment derived from pre-computed summaries $S = {s_{1}, \dots, s_{N}}$ , symbolic entity triggers, and latent semantic vector associations. This design enables scalable, retrieval-based memory access without incurring runtime LLM inference overhead.

其中， $R$ 综合异构信号，包括从预计算摘要 $S = {s_{1}, \dots, s_{N}}$ 得到的全局叙事对齐、符号实体触发和潜在语义向量关联。这种设计支持可扩展的、基于检索的记忆访问，并且不会引入运行时 LLM 推理开销。

3.2 Memory Building and Storing

To transform the unbounded input stream into manageable memory contexts, E-mem implements a block-wise handling strategy. This approach is designed to preserve the full episodic contexts, avoiding the information loss typical of preprocessing.

为了把无界输入流转换为可管理的记忆上下文，E-mem 实现了一种分块处理策略。该方法旨在保留完整的情景上下文，避免预处理中常见的信息损失。

Sliding Window Segmentation with Overlap. Given an unbounded input stream $X = (x_{1}, x_{2}, \dots)$ , we employ a sliding window strategy to partition it directly into a sequence of discrete episodic contexts $E = {E_{1}, E_{2}, \dots, E_{N}}$ . With a window length $L$ and stride $S < L$ , we introduce an overlap $δ = L - S$ to preserve local sequential dependencies across boundaries. The $i$ -th context is formally defined as:

带重叠的滑动窗口切分。 给定无界输入流 $X = (x_{1}, x_{2}, \dots)$ ，我们采用滑动窗口策略，直接把它划分为离散情景上下文序列 $E = {E_{1}, E_{2}, \dots, E_{N}}$ 。在窗口长度为 $L$ 且步幅 $S < L$ 时，我们引入重叠 $δ = L - S$ ，以保留边界处的局部序列依赖。第 $i$ 个上下文形式化定义为：

\begin{aligned} E_{i} & = {x_{t} ∣ (i - 1) S < t \leq (i - 1) S + L} . \end{aligned}

This overlap buffer ensures that tokens at the segment edges retain their immediate predecessor context, thereby maintaining semantic coherence during the routing and reconstruction phases.

这个重叠缓冲确保片段边缘处的 token 保留其直接前驱上下文，从而在路由和重构阶段维持语义连贯性。

Episodic Memory Context Retention and Isolation. Each $E_{i}$ explicitly encapsulates the original uncompressed tokens and is maintained by a dedicated assistant agent $A_{asst}^{(i)}$ as a standalone memory unit. Functioning primarily as Archived Memory, these units remain in a dormant state by default. They are selectively transitioned to an active state for immediate inference only when explicitly triggered by the routing mechanism. This retention of the full context enables the system to resume generation directly from original data, ensuring strict logical integrity during reasoning.

情景记忆上下文保留与隔离。 每个 $E_{i}$ 都显式封装原始未压缩 token，并由一个专用助手智能体 $A_{asst}^{(i)}$ 作为独立记忆单元维护。这些单元主要作为归档记忆运行，默认保持休眠状态。只有在被路由机制显式触发时，它们才会被选择性转入活跃状态以进行即时推理。这种完整上下文保留使系统能够直接从原始数据恢复生成，从而在推理期间保证严格的逻辑完整性。

Incremental Updates. E-mem supports efficient $O (1)$ streaming updates. New incoming tokens $Δ x$ are directly appended to the active agent's current episodic context $E_{active}$ via standard auto-regression, extending the memory buffer: $E^{'} \leftarrow E_{active} \cup {Δ x}$ . When the capacity $L$ is reached, the context is solidified as a completed memory unit. A new agent $A_{N + 1}$ is instantiated by carrying over the overlap region to maintain context flow:

增量更新。 E-mem 支持高效的 $O (1)$ 流式更新。新的输入 token $Δ x$ 会通过标准自回归直接追加到活跃智能体的当前情景上下文 $E_{active}$ 中，扩展记忆缓冲区： $E^{'} \leftarrow E_{active} \cup {Δ x}$ 。当容量 $L$ 达到上限时，该上下文会被固化为一个完成的记忆单元。通过携带重叠区域来实例化新的智能体 $A_{N + 1}$ ，以维持上下文流：

\begin{aligned} E_{N + 1}^{init} & = Extract (E_{N}, overlap = δ) . \end{aligned}

This context transfer acts as a continuity bridge, ensuring that sequential flow remains uninterrupted and logically consistent as the system scales linearly.

这种上下文转移充当连续性桥梁，确保系统线性扩展时，序列流仍然不被打断且逻辑一致。

3.3 Memory Activation via Multi-Pathway Routing

E-mem formulates activation as a Hierarchical Associative Routing process, which performs coarse-grained localization to selectively transition archived memory units to active inference. To accommodate the multifaceted nature of recall—ranging from broad narrative intents to precise entity details—we introduce a Multi-Pathway Activation framework. This mechanism routes the input query $q$ through three orthogonal signaling pathways operating in parallel:

E-mem 把激活形式化为一个分层关联路由过程，该过程执行粗粒度定位，以选择性地把归档记忆单元转入活跃推理。为了适应回忆的多面性，从宽泛叙事意图到精确实体细节，我们引入多路径激活框架。该机制把输入查询 $q$ 路由到三个并行运行的正交信号路径：

Global Alignment ( $P_{global}$ ): Activates memories via macroscopic narrative anchoring.

By efficiently computing the dense vector similarity and sparse lexical alignment between the query and the concise semantic summaries ( $s_{i}$ ), this pathway leverages the "frozen reasoning" of the summarization phase.

It functions as a high-pass semantic filter, capturing the user's broader intent to exclude irrelevant noise and identify logically relevant segments.

Semantic Association ( $P_{vec}$ ): Activates memories based on implicit latent alignment between the query and the full episodic context ( $E_{i}$ ).

Unlike the summary-based global path, this pathway utilizes high-dimensional vector similarity against the raw chunk embeddings.

It serves as a robust failsafe to identify chunks that resonate with the query's abstract intent, specifically compensating for cases where critical semantic nuances may have been lost or distorted during the summarization process.

Symbolic Trigger ( $P_{kw}$ ): Activates memories via explicit entity matching between the query and the original raw text content.

Analogous to how a specific name triggers a flashback, this pathway employs sparse retrieval (e.g., BM25) to detect exact lexical overlaps.

This ensures the high-precision recall of unique factual anchors (such as specific IDs or names) that might be omitted in the high-level summaries, guaranteeing that fine-grained details are not overlooked.

全局对齐（ $P_{global}$ ）： 通过宏观叙事锚定激活记忆。

该路径高效计算查询与简洁语义摘要（ $s_{i}$ ）之间的密集向量相似度和稀疏词汇对齐，从而利用摘要阶段的“冻结推理”。

它作为高通语义过滤器，捕捉用户更宽泛的意图，以排除无关噪声并识别逻辑相关片段。

语义关联（ $P_{vec}$ ）： 基于查询与完整情景上下文（ $E_{i}$ ）之间的隐式潜在对齐来激活记忆。

与基于摘要的全局路径不同，该路径利用针对原始块 embedding 的高维向量相似度。

它作为稳健的安全补偿机制，用于识别与查询抽象意图产生共振的块，尤其补偿摘要过程中关键语义细节可能丢失或失真的情况。

符号触发（ $P_{kw}$ ）： 通过查询与原始文本内容之间的显式实体匹配来激活记忆。

类似于某个具体名字会触发闪回，该路径采用稀疏检索（例如 BM25）来检测精确词汇重叠。

这确保对可能在高层摘要中被省略的独特事实锚点（例如特定 ID 或名字）进行高精度回忆，保证细粒度细节不会被忽略。

Episodic Context Reconstruction via Activation Union. To ensure comprehensive routing we employ a Multi-Source Activation Union strategy. A memory unit $A_{asst}^{(i)}$ undergoes an episodic memory transition from archived to active (inference-ready) if it receives a valid activation signal from any pathway. Formally, the set of activated agents $A^{*}$ is defined as:

通过激活并集进行情景上下文重构。 为了保证全面路由，我们采用多源激活并集策略。如果记忆单元 $A_{asst}^{(i)}$ 从任一路径收到有效激活信号，它就会经历从归档到活跃（推理就绪）的情景记忆转移。形式上，被激活智能体集合 $A^{*}$ 定义为：

\begin{aligned} A^{*} = {A_{asst}^{(i)} ∣ & A_{asst}^{(i)} \in P_{global} \lor A_{asst}^{(i)} \in P_{vec} \\ \lor A_{asst}^{(i)} \in P_{kw}} . \end{aligned}

Notably, the total number of activated memory chunks, denoted as $k$ , can be flexibly adjusted according to requirements of the task. This mechanism ensures that the system aggregates all potentially relevant contexts re-integrating for the subsequent reasoning phase, strictly preserving the integrity of the information required for complex inference.

值得注意的是，被激活记忆块总数记为 $k$ ，可以根据任务需求灵活调整。该机制确保系统聚合所有潜在相关上下文，并将其重新整合进后续推理阶段，从而严格保留复杂推理所需信息的完整性。

3.4 Synergistic Reasoning and Response Generation

Upon identifying the candidate set $A^{*}$ , the system transitions to the phase of parallel episodic context reconstruction. While the router performs coarse-grained localization, the master agent delegates the specific inference burden to the assistant agents for fine-grained local reasoning. In this process, each activated assistant agent $A_{asst}^{(i)}$ does not merely retrieve data but actively generates a response by attending to its preserved episodic context $E_{i}$ . Crucially, to ensure causal consistency, we enforce Temporal Anchoring in the reasoning output. The assistant scrutinizes the original sequential dependencies to derive a temporally grounded evidence tuple $e_{i}$ , modeled as:

在识别候选集合 $A^{*}$ 后，系统转入并行情景上下文重构阶段。路由器执行粗粒度定位，而主智能体把具体推理负担委托给助手智能体，以进行细粒度局部推理。在这一过程中，每个被激活的助手智能体 $A_{asst}^{(i)}$ 并不只是检索数据，而是通过关注其保留的情景上下文 $E_{i}$ 主动生成响应。关键的是，为了保证因果一致性，我们在推理输出中强制使用时间锚定。助手智能体审视原始序列依赖，以推导出一个有时间依据的证据元组 $e_{i}$ ，建模如下：

\begin{aligned} e_{i} & = ⟨ c_{i}, τ_{i} ⟩ = Φ_{asst} (q ∣ E_{i}) . \end{aligned}

where $c_{i}$ represents the deduced semantic evidence, and $τ_{i}$ denotes the absolute timestamp corresponding to the identified event or state transition. This mechanism ensures logical integrity by re-experiencing the raw memory, enabling the extraction of subtle evidence accompanied by its precise temporal occurrence. By explicitly associating facts with their time of occurrence, the system empowers the master agent to resolve conflicting states (e.g., object displacement) during the subsequent aggregation phase. We categorize this collaborative reasoning process into two modes:

其中， $c_{i}$ 表示推导出的语义证据， $τ_{i}$ 表示与被识别事件或状态转移对应的绝对时间戳。该机制通过重新体验原始记忆来保证逻辑完整性，使系统能够抽取细微证据及其精确发生时间。通过显式地把事实与其发生时间关联起来，系统使主智能体能够在后续聚合阶段解决冲突状态（例如物体位置变化）。我们把这一协作推理过程分为两种模式：

Direct Inference. Tailored for queries necessitating explicit fact retrieval, this mode focuses on synthesizing the evidence set $E = {e_{i}}_{i \in A^{*}}$ generated by the assistants. The master agent acts as a central reasoner, aggregating these local evidence tuples to derive a final response $R$ . This global synthesis is formally modeled as:

直接推理。 该模式面向需要显式事实检索的查询，重点综合由助手智能体生成的证据集合 $E = {e_{i}}_{i \in A^{*}}$ 。主智能体作为中心推理器，聚合这些局部证据元组以推导最终响应 $R$ 。这种全局综合形式化建模为：

\begin{aligned} R & = Ψ_{master} (q, E) . \end{aligned}

where $Ψ_{master}$ denotes the high-level reasoning function. Unlike simple concatenation, the master agent actively resolves semantic conflicts by enforcing chronological logic upon the given evidence. Specifically, it compares the absolute timestamps $τ$ associated with each evidence unit to reconcile state discrepancies (e.g., location changes), prioritizing the most recent information to construct a cohesive and accurate logical chain. This hierarchical decoupling ensures rigorous logical integrity while maximizing scalability through parallelized context processing.

其中， $Ψ_{master}$ 表示高层推理函数。不同于简单拼接，主智能体会通过对给定证据施加时间顺序逻辑来主动解决语义冲突。具体而言，它比较与每个证据单元关联的绝对时间戳 $τ$ ，以调和状态差异（例如位置变化），并优先使用最新信息来构建连贯且准确的逻辑链。这种分层解耦在通过并行上下文处理最大化可扩展性的同时，保证严格的逻辑完整性。

Iterative Reasoning. Beyond single-pass retrieval, the E-mem architecture is naturally extensible to an iterative Refine-and-Query framework for complex tasks requiring sequential deduction. In this operational protocol, the master agent maintains a dynamic reasoning trace $S^{(t)}$ . At each step, should the current information be insufficient, the master agent formulates a targeted sub-query:

迭代推理。 除了单轮检索之外，E-mem 架构可以自然扩展为迭代式 Refine-and-Query 框架，用于需要序列推导的复杂任务。在这一操作协议中，主智能体维护动态推理轨迹 $S^{(t)}$ 。在每一步，如果当前信息不足，主智能体会构造一个有针对性的子查询：

\begin{aligned} q^{(t)} & = π_{plan} (q_{init}, S^{(t - 1)}) . \end{aligned}

to probe relevant assistant agents. These assistants execute local inference over their encapsulated episodic contexts $E_{i}$ , yielding context-aware evidence $e_{i}^{(t)} = Φ_{asst} (q^{(t)} ∣ E_{i})$ . The master agent then aggregates these findings to update the global trace $S^{(t)} \leftarrow Agg (S^{(t - 1)}, {e_{i}^{(t)}})$ . This cycle facilitates principled multi-step reasoning, terminating upon either trace convergence or a predefined iteration limit.

以探查相关的助手智能体。这些助手智能体在其封装的情景上下文 $E_{i}$ 上执行局部推理，产生上下文感知证据 $e_{i}^{(t)} = Φ_{asst} (q^{(t)} ∣ E_{i})$ 。随后，主智能体聚合这些发现，以更新全局轨迹 $S^{(t)} \leftarrow Agg (S^{(t - 1)}, {e_{i}^{(t)}})$ 。这一循环支持有原则的多步推理，并在轨迹收敛或达到预定义迭代上限时终止。

表1：LoCoMo 基准上的结果，与其他优秀基线比较。

Method	Overall		Single-Hop		Multi-Hop		Temporal		Open Domain
	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1
GPT-4o-mini
Long-Context	37.31	29.57	46.68	37.54	29.23	22.76	25.97	19.42	16.87	13.70
RAG	44.73	39.40	52.45	47.94	27.50	20.13	46.07	40.35	23.23	17.94
A-mem	39.65	32.31	44.65	37.06	27.02	20.09	45.85	36.67	12.14	12.01
Mem0	45.10	35.08	47.65	37.82	38.72	27.13	48.93	40.15	28.64	21.58
MEMORYOS	42.84	35.54	48.62	42.99	35.27	25.22	41.15	30.76	20.02	16.52
LIGHTMEM	38.44	34.37	41.79	37.83	29.78	24.90	43.71	39.72	16.89	13.92
GAM	45.31	37.78	47.74	40.90	34.84	27.72	53.91	43.93	26.03	19.48
E-mem	54.17	44.34	59.23	50.58	42.64	34.38	59.82	44.57	24.89	18.15
Qwen2.5-14B
Long-Context	38.31	31.89	46.05	39.56	32.08	24.46	30.51	24.45	14.89	11.41
RAG	38.27	33.07	47.87	42.89	26.38	19.54	30.78	25.97	14.16	10.52
A-mem	28.98	24.47	33.75	30.04	22.09	15.28	27.19	22.05	13.49	10.74
Mem0	36.04	29.91	42.58	35.15	31.73	24.82	28.96	26.24	15.03	11.28
MEMORYOS	40.28	34.72	46.33	41.32	38.19	29.26	32.24	27.86	20.27	15.94
LIGHTMEM	31.39	27.15	34.92	31.22	25.45	19.61	32.03	27.70	15.81	11.81
GAM	50.41	43.48	56.35	51.07	38.94	28.55	53.76	50.01	20.84	15.09
E-mem	57.04	46.75	61.14	52.50	49.15	34.87	63.59	50.61	22.38	18.41

4. Experiments

We evaluate E-mem against state-of-the-art memory mechanisms for LLM agents, focusing on performance, robustness, and cost. Our analysis addresses three core questions: RQ1: Comparative Efficacy. How does E-mem compare against state-of-the-art RAG and agentic memory systems on complex long-context reasoning? RQ2: Component & Scaling Analysis. How do specific architectural components and backbone model sizes impact the system's overall effectiveness? RQ3: Cost-Efficiency. Does E-mem achieve a superior trade-off between token consumption and reasoning performance compared to baselines?

我们从性能、稳健性和成本三个方面评估 E-mem，并将其与 LLM 智能体的最优记忆机制进行比较。我们的分析回答三个核心问题： RQ1：比较有效性。 在复杂长上下文推理上，E-mem 与最优 RAG 和智能体式记忆系统相比如何？ RQ2：组件与扩展分析。 具体架构组件和骨干模型规模如何影响系统整体有效性？ RQ3：成本效率。 与基线相比，E-mem 是否在 token 消耗和推理性能之间实现了更优权衡？

4.1 Experiment Settings

Datasets. We evaluate E-mem against competitive baselines on two benchmarks: LoCoMo: A benchmark assessing long-term memory in multi-session dialogues. It evaluates coherence across long horizons via five sub-tasks: single-hop retrieval, multi-hop reasoning, temporal understanding, open-ended generation, and adversarial tasks. HotpotQA: A Wikipedia-based multi-hop QA dataset. To test scalability with ultra-long contexts (over 200K tokens), we adapt it into a streaming setting across three scales (400, 800, and 1600 documents), stress-testing evidence recall from extensive archives.

数据集。 我们在两个基准上将 E-mem 与有竞争力的基线进行评估：LoCoMo：一个评估多会话对话中长期记忆的基准。它通过五个子任务评估跨长时域的一致性：单跳检索、多跳推理、时间理解、开放式生成和对抗任务。 HotpotQA：一个基于 Wikipedia 的多跳问答数据集。为了测试超长上下文（超过 200K token）下的可扩展性，我们将其改造成跨三个规模（400、800 和 1600 个文档）的流式设置，从而压力测试从大规模档案中回忆证据的能力。

表2：HotpotQA 基准上的 F1 分数。

HotpotQA	GPT4o-mini			Qwen2.5-14B
HotpotQA	400 F1	800 F1	1600 F1	400 F1	800 F1	1600 F1
Long-Context	56.56	49.71	53.92	49.75	46.82	43.17
RAG	52.71	51.84	54.01	51.81	46.72	48.36
A-mem	33.90	30.22	31.37	27.04	25.65	22.92
Mem0	32.85	31.74	27.41	30.12	32.44	26.55
MEMORYOS	26.47	23.10	24.16	24.58	30.25	23.13
LIGHTMEM	40.93	35.28	30.02	37.30	27.72	28.25
GAM	54.75	52.86	53.71	48.40	41.10	44.32
E-mem	61.46	55.46	55.76	61.13	47.91	54.87

Baselines. We benchmark E-mem against two categories: Memory-Free Baselines. Direct context processing without explicit maintenance. (1) Long-Context Windowing: Uses a sliding window to segment history into chunks processed independently, selecting the highest-confidence output as the answer. (2) Standard RAG: Retrieves the top- $k$ ( $k = 20$ ) relevant segments via dense vector similarity to augment generation. Memory-Based Systems. We compare against cutting-edge memory-augmented agent frameworks established in 2024--2025, including A-Mem, Mem0, MemoryOS, LightMem, and GAM. These methods construct specialized external memory architectures (e.g., hierarchy or graph) to actively curate historical information, aiming to enhance precision and efficiency in downstream tasks.

基线。 我们从两个类别将 E-mem 与基线进行比较： 无记忆基线。 直接处理上下文而不进行显式维护。 (1) Long-Context Windowing：使用滑动窗口把历史切分为独立处理的块，并选择置信度最高的输出作为答案。 (2) Standard RAG：通过密集向量相似度检索 top- $k$ （ $k = 20$ ）相关片段来增强生成。 基于记忆的系统。 我们比较了 2024--2025 年建立的前沿记忆增强智能体框架，包括 A-Mem、Mem0、MemoryOS、LightMem 和 GAM。这些方法构建专门的外部记忆架构（例如分层结构或图）来主动整理历史信息，目标是在下游任务中提升精度和效率。

Implementation details. Experiments were conducted on four NVIDIA RTX 4090 GPUs. We instantiate E-mem with GPT-4o-mini and Qwen2.5-14B as master agents, supported by a set of Qwen3-4B assistant agents. To ensure a fair comparison, E-mem and all baselines utilize the same master LLM backbone and are restricted to the same number of retrieval rounds. Evaluation is performed using the F1 score and BLEU-1 metrics.

实现细节。 实验在四块 NVIDIA RTX 4090 GPU 上进行。我们用 GPT-4o-mini 和 Qwen2.5-14B 作为主智能体实例化 E-mem，并由一组 Qwen3-4B 助手智能体支持。为了保证公平比较，E-mem 和所有基线使用相同的主 LLM 骨干，并限制为相同的检索轮数。评估使用 F1 分数和 BLEU-1 指标进行。

表3：E-mem 在 LoCoMo 对抗子集上的幻觉分析。

assistant agent
	F1	BLEU-1
Qwen3-0.6B	85.11	80.14
Qwen3-1.7B	87.31	83.17
Qwen3-4B	89.94	85.51
Qwen3-8B	95.74	88.09
Qwen3-14B	95.03	88.06

master agent
	F1	BLEU-1
GPT4o-mini	89.94	85.51
GPT4o	89.37	86.52
Gemini2.5-flash	93.62	90.3
DeepseekV3	75.87	73.4
Grok4-fast	77.80	75.53

表4：比较 LoCoMo conversation 1 上不同 master 和 assistant 模型性能的模型消融研究。

Model	Overall		Single Hop		Multi-Hop		Temporal		Open Domain
	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1
Assistant Agent Model
Qwen3-0.6B	28.89	22.26	26.40	21.19	17.32	18.35	46.44	33.49	20.77	12.75
Qwen3-1.7B	44.78	35.10	40.82	34.30	35.64	29.53	65.03	48.63	31.02	14.66
Qwen3-4B	50.70	40.93	49.46	41.21	42.66	35.38	66.91	51.45	31.08	23.15
Qwen3-8B	49.80	40.94	45.98	37.05	52.11	47.45	62.28	49.29	29.11	22.11
Qwen3-14B	50.02	40.27	49.96	41.43	46.38	37.25	64.53	51.66	30.32	24.41
Master Agent Model
GPT-4o-mini	50.70	40.93	49.46	41.21	42.66	35.38	66.91	51.45	31.08	23.15
GPT-4o	51.73	43.70	51.04	43.04	45.88	42.37	64.77	52.83	32.74	24.53
Gemini2.5-flash	49.05	41.15	49.57	41.26	38.20	34.17	65.87	54.57	25.08	19.53
Grok4-fast	46.34	36.13	46.75	36.87	35.65	26.72	65.34	52.29	30.28	20.51
Deepseekv3	50.88	42.25	51.80	43.27	43.16	37.13	62.64	51.07	31.42	24.30

4.2 Performance Analysis

Results on LoCoMo Benchmark. Table 1 details the LoCoMo evaluation, where E-mem consistently establishes a new state-of-the-art across diverse backbones. Specifically, it surpasses the strongest baseline (GAM) by substantial margins, achieving +8.86% on GPT-4o-mini and +6.63% on Qwen2.5-14B in overall F1 scores. This performance advantage is critical in complex reasoning tasks like Multi-hop and Temporal subsets, where standard RAG often falters due to severe context fragmentation. In contrast, E-mem significantly boosts Multi-hop F1 by over 10 points (49.15% vs. 38.94%) on the Qwen backbone. This confirms that our episodic context reconstruction paradigm effectively preserves the autoregressive dependencies typically lost in preprocessing de-contextualization, ensuring deep reasoning capabilities and robustness.

LoCoMo 基准上的结果。 表1 详细展示了 LoCoMo 评估，其中 E-mem 在不同骨干上都持续建立新的最优结果。具体而言，它以显著幅度超过最强基线（GAM），在整体 F1 分数上分别比 GPT-4o-mini 和 Qwen2.5-14B 设置高出 +8.86% 和 +6.63%。这种性能优势在 Multi-hop 和 Temporal 子集这类复杂推理任务中尤为关键，因为标准 RAG 往往因严重上下文碎片化而表现不佳。相比之下，在 Qwen 骨干上，E-mem 将 Multi-hop F1 大幅提升超过 10 分（49.15% vs. 38.94%）。这确认了我们的情景上下文重构范式能够有效保留预处理式去上下文化通常会丢失的自回归依赖，从而确保深度推理能力和稳健性。

Impact of the Number of Routed Chunks on Performance — 图3：记忆路由消融研究。左：路由块数量对性能的影响；右：路由路径对系统性能的影响。

Impact of Routing Pathways on System Performance — 图3：记忆路由消融研究。左：路由块数量对性能的影响；右：路由路径对系统性能的影响。

Results on HotpotQA Benchmark. As illustrated in Table 2, E-mem demonstrates exceptional stability in ultra-long context scenarios, consistently maintaining the highest F1 scores against all baselines. A particularly notable observation is the performance of RAG compared to other complex memory-based baselines, representing a clear divergence from the trends observed in the LoCoMo results. We attribute this discrepancy to the distinct information density inherent in the benchmarks: while LoCoMo features high-similarity, dense dialogues laden with adversarial noise that significantly impairs vector retrieval, HotpotQA comprises distinct, low-interference evidence passages that are structurally favorable for semantic matching. Nevertheless, E-mem still outperforms RAG by a substantial margin ( $+ 6.51 %$ F1 in the 1600-doc setting). This confirms that while vector retrieval suffices for locating distinct factoids, the proposed episodic context reconstruction is indispensable for preserving the sequential dependencies required for complex multi-hop reasoning over extensive horizons.

HotpotQA 基准上的结果。 如表2 所示，E-mem 在超长上下文场景中展现出卓越稳定性，相比所有基线始终保持最高 F1 分数。一个特别值得注意的现象是，RAG 相比其他复杂的基于记忆的基线表现更强，这与 LoCoMo 结果中的趋势明显不同。我们认为这种差异来自两个基准固有的信息密度不同：LoCoMo 具有相似度高、信息密集且充满对抗噪声的对话，会显著损害向量检索；而 HotpotQA 由清晰、低干扰的证据段落组成，在结构上更有利于语义匹配。尽管如此，E-mem 仍以显著幅度超过 RAG（在 1600 文档设置中 F1 为 $+ 6.51 %$ ）。这确认了向量检索虽然足以定位清晰事实，但所提出的情景上下文重构对于在广阔时域中保留复杂多跳推理所需的序列依赖不可或缺。

4.3 Robustness Against Hallucination

To rigorously assess resilience against context noise, we conducted experiments on the LoCoMo adversarial subset by cross-evaluating varying Assistant scales and master backbones. Results (Table 3) demonstrate E-mem's robustness, achieving a peak F1 of 95.74%. Crucially, our analysis reveals that optimal performance relies on the synergistic collaboration between both agents. While scaling the Assistant from 0.6B to 8B consistently enhances local episodic context reconstruction, the master agent's reasoning capability proves equally decisive. We observe significant performance variance across backbones, indicating that powerful global orchestration is indispensable for final reasoning and filtering hallucinations.

为了严格评估对上下文噪声的韧性，我们在 LoCoMo 对抗子集上进行了实验，交叉评估不同助手规模和主智能体骨干。结果（表3）展示了 E-mem 的稳健性，最高 F1 达到 95.74%。关键的是，我们的分析表明，最优性能依赖两个智能体之间的协同合作。虽然将助手智能体从 0.6B 扩展到 8B 会持续增强局部情景上下文重构，但主智能体的推理能力同样关键。我们观察到不同骨干之间存在显著性能差异，这表明强大的全局协调对于最终推理和过滤幻觉不可或缺。

表5：记忆块大小粒度对 E-mem 性能的影响。

Chunk Size	Overall		Single Hop		Multi-Hop		Temporal		Open Domain
	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1	F1	BLEU-1
4K	45.89	37.35	47.05	38.76	34.67	28.62	59.37	47.56	28.93	22.19
8K	50.70	40.93	49.46	41.21	42.66	35.38	66.91	51.45	31.08	23.15
12K	47.97	37.85	47.63	38.24	38.04	29.67	65.44	49.89	24.51	13.43
16K	47.08	37.19	44.81	36.25	37.18	32.83	66.24	50.79	29.11	14.30
32K	43.00	33.92	44.50	36.70	33.79	28.52	55.30	42.47	22.64	7.92

4.4 Ablation Study

Backbone Sensitivity Analysis. We evaluate model capacity by cross-evaluating master backbones and assistant scales (0.6B--14B) on LoCoMo conversation 1. As shown in Table 4, overall performance plateaus around the 4B mark (F1 $\approx$ 50.7%), indicating that small models suffice for basic episodic context reconstruction. However, detailed analysis reveals divergence across tasks. Scaling assistants to 8B and 14B significantly boosts Multi-hop reasoning (+9.45% over 4B). We attribute this to larger models' ability to discern implicit dependencies within memory contexts, bridging disparate evidence. Conversely, larger models suffer slight degradation in Single-hop tasks. We hypothesize this stems from over-reasoning, where models hallucinate complex correlations that distract from simple pattern matching. Smaller models rely on surface features, ensuring robustness for factual retrieval. Finally, master agent variations cause marginal fluctuations ($<$4%), confirming that E-mem's efficacy depends primarily on the Assistant's local reasoning.

骨干敏感性分析。 我们通过在 LoCoMo conversation 1 上交叉评估主智能体骨干和助手智能体规模（0.6B--14B）来评估模型容量。如表4 所示，整体性能在 4B 左右进入平台期（F1 $\approx$ 50.7%），这表明小模型足以完成基本情景上下文重构。然而，详细分析显示不同任务之间存在分化。把助手智能体扩展到 8B 和 14B 会显著提升 Multi-hop 推理（比 4B 高 +9.45%）。我们认为这是因为更大模型能够辨别记忆上下文中的隐式依赖，从而连接分散证据。相反，更大模型在 Single-hop 任务上会出现轻微退化。我们推测这源于过度推理，即模型会幻觉出复杂相关性，从而干扰简单模式匹配。较小模型依赖表层特征，因此能保证事实检索的稳健性。最后，主智能体变化只造成边际波动（$<$4%），确认 E-mem 的有效性主要依赖助手智能体的局部推理。

Impact of Memory Chunk Granularity. We further investigate the sensitivity of E-mem to memory chunk size ( $S_{c h u n k}$ ) under a constrained total memory budget ( $\approx$ 32K tokens) on the LoCoMo conversation 1. As shown in Table 5, performance exhibits an inverted U-shaped trajectory, peaking at $S_{c h u n k} = 8 K$ (F1=50.70%). When chunks are overly granular (e.g., 4K), the system retrieves an excessive number of fragmented contexts to fill the budget. This introduces significant semantic noise and irrelevant distractors, which overwhelms the reasoning process and leads to retrieval failure. Conversely, overly large chunks (e.g., 32K) force the agent to process extended contiguous sequences, causing it to suffer from attention dilution or the "lost-in-the-middle" phenomenon, where critical details are overshadowed by the massive context. The 8K configuration strikes an optimal balance, preserving local context coherence while allowing for diverse evidence aggregation.

记忆块粒度的影响。 我们进一步研究 E-mem 对记忆块大小（ $S_{c h u n k}$ ）的敏感性，设置为 LoCoMo conversation 1 上受限的总记忆预算（ $\approx$ 32K token）。如表5 所示，性能呈现倒 U 形轨迹，并在 $S_{c h u n k} = 8 K$ 时达到峰值（F1=50.70%）。当块过于细粒度（例如 4K）时，系统会检索过多碎片化上下文来填满预算。这会引入显著语义噪声和无关干扰项，压垮推理过程并导致检索失败。相反，过大的块（例如 32K）会迫使智能体处理很长的连续序列，使其遭受注意力稀释或 “lost-in-the-middle” 现象，即关键细节被庞大上下文遮蔽。 8K 配置取得了最佳平衡，既保留局部上下文连贯性，又允许聚合多样证据。

Router Ablation Study. We evaluate the scalability of the multi-pathway routing mechanism on the HotpotQA-1600 dataset. Empirical results (Figure 3) indicate that activating a minimal subset of memory chunks (e.g., $k = 8$ ) is sufficient to surpass strong baselines. Crucially, we observe information saturation: increasing $k$ from 8 to 20 yields marginal gains despite the significantly larger retrieval volume. This phenomenon suggests that our router successfully filters noise, concentrating the vast majority of requisite semantic signals within the top-ranked candidates. It also serves as strong evidence for the system's scalability, since the routing mechanism can effectively isolate the most critical episodic dependencies within a compact window. By achieving near-optimal reasoning fidelity at low activation densities, E-mem avoids the computational overhead of processing extensive but redundant contexts, making it highly adaptable to large-scale deployment scenarios.

路由器消融研究。 我们在 HotpotQA-1600 数据集上评估多路径路由机制的可扩展性。实证结果（图3）表明，激活一个极小的记忆块子集（例如 $k = 8$ ）就足以超过强基线。关键的是，我们观察到信息饱和：尽管检索量显著增大，但把 $k$ 从 8 增加到 20 只带来边际增益。这一现象表明，我们的路由器成功过滤噪声，把绝大多数必要语义信号集中在排序靠前的候选项中。它也为系统可扩展性提供了有力证据，因为路由机制能够在紧凑窗口中有效隔离最关键的情景依赖。通过在低激活密度下实现接近最优的推理保真度，E-mem 避免了处理大量但冗余上下文的计算开销，因此高度适应大规模部署场景。

We further conduct an ablation study to evaluate the contribution of individual routing pathways. The results (Figure 3) show that the full model achieves a peak F1 of 55.76. Removing Global Alignment ( $P_{global}$ ) precipitates the most significant drop to 45.30 ( $Δ - 10.46$ ), underscoring macroscopic narrative anchoring as the primary driver preventing context fragmentation. The Semantic Association ( $P_{vec}$ ) proves to be the second most critical component (47.90), demonstrating that high-dimensional vector alignment is essential for capturing implicit latent intents. Finally, excluding Symbolic Triggers ( $P_{kw}$ ) yields a moderate decline to 52.42. While less dominant, this pathway remains necessary for ensuring precise lexical grounding of specific entities. Overall, the consistent performance gaps confirm that these orthogonal signals are complementary, validating the efficacy of our Multi-Source Activation Union strategy.

我们进一步进行消融研究，以评估单个路由路径的贡献。结果（图3）显示，完整模型达到 55.76 的峰值 F1。移除全局对齐（ $P_{global}$ ）会导致最大幅度下降至 45.30（ $Δ - 10.46$ ），强调宏观叙事锚定是防止上下文碎片化的主要驱动因素。语义关联（ $P_{vec}$ ）被证明是第二重要的组件（47.90），说明高维向量对齐对于捕捉隐式潜在意图至关重要。最后，排除符号触发（ $P_{kw}$ ）会带来中等幅度下降至 52.42。虽然该路径不是最主导的，但它对于确保特定实体的精确词汇落地仍然必要。总体而言，持续存在的性能差距确认这些正交信号是互补的，验证了我们多源激活并集策略的有效性。

Multi-Agent Architecture. To investigate whether a simpler pipeline—where retrieved text chunks are directly fed into a single LLM—is sufficient, we conducted an ablation study on the LoCoMo benchmark by removing the Assistant agents. In this setting, the Master LLM directly processes the concatenated raw episodic contexts.

多智能体架构。 为了研究更简单的流程是否足够，即把检索到的文本块直接输入单个 LLM，我们通过移除助手智能体，在 LoCoMo 基准上进行了消融研究。在这一设置中，主 LLM 直接处理拼接后的原始情景上下文。

表6：LoCoMo 基准上的消融研究，比较所提出多智能体系统（MAS）与更简单的“direct-read”流程。

Architecture	Overall F1	Single-Hop	Multi-Hop	Temporal	Open Domain
Simpler Pipeline (Direct Read)	38.27	48.83	30.62	31.87	18.55
E-mem (MAS)	54.17	59.23	42.64	59.82	24.89

As shown in Table 6, the drastic performance collapse in the simpler pipeline confirms that our multi-agent design is fundamentally essential rather than an optional add-on. We attribute this to the following factors:

如表6 所示，更简单流程中的性能大幅崩塌确认了我们的多智能体设计在根本上是必要的，而不是可选附加项。我们将其归因于以下因素：

Cognitive Decoupling: Forcing a single LLM to process concatenated raw text chunks requires it to simultaneously execute low-level detail extraction and high-level logical synthesis, leading to severe "attention dilution."

Our MAS strictly decouples these tasks: Assistant agents specialize in local, noise-resistant extraction, while the Master agent focuses purely on global aggregation and planning.

Overcoming "Lost-in-the-Middle": Concatenating multiple raw chunks often exceeds the effective reasoning window of standard models.

Assistant agents process their assigned episodic contexts in complete isolation, performing parallel local distillation to shield the Master agent from raw distractor noise.

Cost Efficiency: By delegating the heavy, complex reasoning of raw texts to multiple cost-effective Small Language Models (SLMs) and reserving the expensive Master LLM strictly for final aggregation, E-mem avoids context window explosion.

This heterogeneous compute strategy significantly reduces token consumption and overall deployment costs.

认知解耦： 迫使单个 LLM 处理拼接后的原始文本块，需要它同时执行低层细节抽取和高层逻辑综合，从而导致严重的“注意力稀释”。

我们的 MAS 严格解耦这些任务：助手智能体专门负责局部、抗噪抽取，而主智能体只专注于全局聚合和规划。

克服 “Lost-in-the-Middle”： 拼接多个原始块往往会超过标准模型的有效推理窗口。

助手智能体完全隔离地处理其分配到的情景上下文，执行并行局部蒸馏，从而使主智能体免受原始干扰噪声影响。

成本效率： E-mem 把原始文本中繁重、复杂的推理委托给多个成本更低的小语言模型（SLM），并把昂贵的主 LLM 严格保留给最终聚合，从而避免上下文窗口爆炸。

这种异构计算策略显著降低 token 消耗和整体部署成本。

4.5 Cost Effectiveness Analysis

To evaluate the deployment costs, we report the average number of tokens required per query on the LoCoMo. We categorize the computational overhead into Large Model Tokens ( $T_{L}$ , e.g., GPT-4o) and Small Model Tokens ( $T_{S}$ , e.g., Qwen3-4B). To quantify the overall cost, we introduce a Normalized Cost metric. Given the significant discrepancy in pricing and computational complexity between the two models, we adopt a highly conservative cost ratio of 1:10 ( $c_{S} : c_{L}$ ), assuming one large model token is equivalent to ten small model tokens in terms of the cost.

为了评估部署成本，我们报告 LoCoMo 上每个查询所需的平均 token 数。我们把计算开销分为大模型 token（ $T_{L}$ ，例如 GPT-4o）和小模型 token（ $T_{S}$ ，例如 Qwen3-4B）。为了量化总体成本，我们引入归一化成本指标。考虑到两个模型在价格和计算复杂度上存在显著差异，我们采用高度保守的 1:10 成本比（ $c_{S} : c_{L}$ ），即假设一个大模型 token 在成本上等价于十个小模型 token。

As shown in Table 7, full-context methods like Long-Context incur a prohibitive normalized cost (169k units), driven entirely by expensive large model processing. While retrieval baselines like RAG reduce this, they still depend on the large model for reasoning. In contrast, E-mem strategically offloads episodic context processing. Even under our conservative 1:10 assumption, E-mem reduces the normalized cost to approx. 3.6k units---a 43 $\times$ reduction compared to Long-Context---while achieving superior F1 performance. This demonstrates that E-mem not only delivers outstanding performance on accuracy but also economically viable for large-scale applications.

如表7 所示，像 Long-Context 这样的全上下文方法会产生高得难以承受的归一化成本（169k 单位），其成本完全由昂贵的大模型处理驱动。虽然 RAG 这类检索基线降低了这一成本，但它们仍然依赖大模型进行推理。相比之下，E-mem 有策略地卸载情景上下文处理。即使在我们保守的 1:10 假设下，E-mem 也把归一化成本降低到约 3.6k 单位，相比 Long-Context 降低 43 $\times$ ，同时取得更高 F1 性能。这表明 E-mem 不仅在准确率上表现出色，而且对于大规模应用也具有经济可行性。

表7：LoCoMo 数据集上的成本-性能分析。我们报告每个查询的平均 token 消耗以及准确率指标（F1 和 BLEU-1）。

Method	F1	BLEU-1	T_S	T_L	Total Cost
Long-Context	37.31	29.57	none	16910	169100
RAG	44.73	39.40	none	643	6430
A-mem	39.65	32.31	none	2520	25200
Mem0	45.1	34.92	none	973	9730
MEMORYOS	42.84	35.54	none	3874	38740
LIGHTMEM	38.44	34.37	none	612	6120
GAM	45.31	37.78	none	1254	12540
E-mem	54.17	44.34	2271	135	3621

5. Conclusion

In this paper, we presented E-mem, a framework that shifts from destructive de-contextualization to episodic context reconstruction, which effectively preserves the contextual integrity essential for deep reasoning. Technically, we implement this by encapsulating episodic contexts managed by a heterogeneous hierarchical master-assistant architecture, which enables the precise reactivation of archived memory units within the active reasoning contexts. Extensive evaluations on both LoCoMo and HotpotQA benchmarks validate that our approach establishes a comprehensive performance lead, significantly outperforming existing preprocessing paradigms---particularly in complex multi-hop tasks---while maintaining low token cost via heterogeneous collaboration. We believe that E-mem serves as a vital complement to existing memory paradigms, offering a robust solution for high-precision, complex System 2 reasoning.

在本文中，我们提出 E-mem，这是一个从破坏性去上下文化转向情景上下文重构的框架，能够有效保留深度推理所必需的上下文完整性。在技术上，我们通过封装由异构分层主-助手架构管理的情景上下文来实现这一点，使归档记忆单元能够在活跃推理上下文中被精确重新激活。在 LoCoMo 和 HotpotQA 两个基准上的大量评估验证了我们的方法建立了全面的性能领先，在显著超过现有预处理范式的同时，尤其是在复杂多跳任务中，通过异构协作保持低 token 成本。我们认为，E-mem 是现有记忆范式的重要补充，为高精度、复杂的 System 2 推理提供了稳健解决方案。

Impact Statement

This paper introduces E-mem, a framework designed to enhance the long-context reasoning capabilities of LLM agents via episodic context reconstruction. Our work primarily aims to advance the reliability and precision of autonomous systems in complex domains such as legal forensics, scientific discovery, and medical diagnosis.

本文介绍 E-mem，这是一个通过情景上下文重构增强 LLM 智能体长上下文推理能力的框架。我们的工作主要旨在提升自主系统在法律取证、科学发现和医学诊断等复杂领域中的可靠性和精度。

However, we acknowledge potential societal implications associated with memory-augmented agents. First, the persistence of episodic memory raises privacy concerns regarding the storage and retrieval of sensitive user data. While our hierarchical architecture separates low-level storage from high-level planning, rigorous data governance and access control mechanisms are essential for real-world deployment. Second, as agents gain deeper reasoning capabilities (System 2), ensuring their alignment with human values and preventing the hallucination of false memories remains critical to mitigating safety risks. We believe this work encourages further research into robust, accountable, and transparent memory systems for AI agents.

然而，我们承认记忆增强智能体可能带来社会影响。 首先，情景记忆的持久性会引发关于敏感用户数据存储和检索的隐私担忧。 虽然我们的分层架构把低层存储与高层规划分离，但真实部署仍然需要严格的数据治理和访问控制机制。 其次，随着智能体获得更深层推理能力（System 2），确保其与人类价值对齐并防止虚假记忆幻觉，对于缓解安全风险仍然至关重要。 我们相信，这项工作会鼓励对稳健、可问责且透明的 AI 智能体记忆系统开展进一步研究。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory

E-mem：面向 LLM 智能体记忆的多智能体情景上下文重构

Abstract

1. Introduction

3. Method

3.1 Architecture

3.2 Memory Building and Storing

3.3 Memory Activation via Multi-Pathway Routing

3.4 Synergistic Reasoning and Response Generation

4. Experiments

4.1 Experiment Settings

4.2 Performance Analysis

4.3 Robustness Against Hallucination

4.4 Ablation Study

4.5 Cost Effectiveness Analysis

5. Conclusion

Impact Statement

🤖 Rasa

E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory ​

E-mem：面向 LLM 智能体记忆的多智能体情景上下文重构 ​

Abstract ​

1. Introduction ​

2. Related Work ​

3. Method ​

3.1 Architecture ​

3.2 Memory Building and Storing ​

3.3 Memory Activation via Multi-Pathway Routing ​

3.4 Synergistic Reasoning and Response Generation ​

4. Experiments ​

4.1 Experiment Settings ​

4.2 Performance Analysis ​

4.3 Robustness Against Hallucination ​

4.4 Ablation Study ​

4.5 Cost Effectiveness Analysis ​

5. Conclusion ​

Impact Statement ​

E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory

E-mem：面向 LLM 智能体记忆的多智能体情景上下文重构

Abstract

1. Introduction

2. Related Work

3. Method

3.1 Architecture

3.2 Memory Building and Storing

3.3 Memory Activation via Multi-Pathway Routing

3.4 Synergistic Reasoning and Response Generation

4. Experiments

4.1 Experiment Settings

4.2 Performance Analysis

4.3 Robustness Against Hallucination

4.4 Ablation Study

4.5 Cost Effectiveness Analysis

5. Conclusion

Impact Statement