Skip to content

Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents

MemoryMicrosoft中科大

Chen Y, Lai H, Feng Y, et al. Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents[J]. arXiv preprint arXiv:2606.06090v1, 2026.


超越语义组织:将记忆作为长程智能体的执行状态管理

Abstract

LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose Mage (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that Mage improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.

基于 LLM 的智能体越来越多地处理具有相互依赖决策的长程任务,其中每个动作都会重塑未来约束,而中间错误可能级联扩散。 现有 RAG 和智能体记忆系统按语义相似性组织历史,并在决策时检索内容相关的条目。 作者认为,这种设计与执行状态依赖不匹配:它会割裂决策轨迹,并混合有效和错误轨迹,从而阻碍连贯的状态重建和错误隔离。 作者提出 MageMemory as Agent-Guided Exploration),这是一个主动的执行状态管理器,将交互存储在层次化状态树中。 智能体从活动的 root-to-current 路径推导自身状态,结合子目标摘要、近期轨迹以及来自先前分支的提示。 四个耦合操作维护这棵树:Grow 记录新轨迹,Compress 总结已完成子目标,Maintain 验证摘要,Revise 恢复目标边界并在新分支上继续。 这一设计在限制上下文增长的同时保持状态完整性,并将有缺陷的片段从活动路径中隔离出来。 在 MemoryArena 上的实验表明,相比基线,Mage 将平均任务成功率提高 7.8--20.4 个百分点,同时将 token 消耗降低 55.1%。

1. Introduction

With the growing ability of large language models (LLMs) to interact with complex environments through tool use and multi-step reasoning, LLM-based agents are increasingly deployed for long-horizon tasks with interdependent decisions. These tasks involve hundreds of steps where each action reshapes future choices, and intermediate errors can cascade to invalidate subsequent progress. Unlike recall-oriented memory benchmarks that answer questions over past conversations or agentic traces, the interdependent long-horizon agent tasks we study require maintaining a coherent, evolving execution state, as each decision depends on the cumulative outcome of prior steps.

随着大语言模型(LLM)通过工具使用和多步推理与复杂环境交互的能力不断增强,基于 LLM 的智能体越来越多地被部署到具有相互依赖决策的长程任务中。 这些任务涉及数百个步骤,其中每个动作都会重塑未来选择,中间错误也可能级联扩散并使后续进展失效。 不同于面向回忆的记忆基准,它们回答关于过去对话或智能体轨迹的问题;本文研究的相互依赖长程智能体任务需要维持一个连贯且不断演化的执行状态,因为每个决策都依赖先前步骤的累积结果。

This requirement becomes harder as exploration history grows beyond the model's effective context window. To address this, recent works introduce memory systems that record past information as compact entries and retrieve relevant ones on demand. Yet recent benchmarks reveal a counter-intuitive pattern: these systems often fail to improve long-horizon agent performance and sometimes underperform approaches that simply retain the full history in context. As shown in Figure 1, many such systems consume substantial tokens while still trailing the long-context approach.

随着探索历史增长并超出模型的有效上下文窗口,这一需求变得更加困难。 为解决这一问题,近期工作引入了记忆系统,它们将过去信息记录为紧凑条目,并按需检索相关条目。 然而,近期基准揭示了一个反直觉模式:这些系统经常无法提升长程智能体表现,有时甚至不如简单地在上下文中保留完整历史的方法。 图1 所示,许多这类系统消耗了大量 token,却仍落后于长上下文方法。

Paradigm comparison on long-horizon agent tasks
图1:长程智能体任务上的范式比较(MemoryArena)。长上下文方法取得较强任务表现,但 token 成本较高;基线方法降低上下文时有丢失状态依赖并表现下降的风险。通过将记忆管理为执行状态树,Mage 以低于长上下文的 token 消耗达到最高任务表现,位于理想的左上区域。

We argue that a key cause lies in the shared design philosophy. Although these systems vary in their data structures, ranging from flat vector stores to entity-relation graphs to hierarchical architectures, they generally rely on semantic relationships to organize and retrieve information, surfacing entries by their content relevance to the current query rather than their role in the execution trajectory.

作者认为,一个关键原因在于这些系统共享的设计哲学。 尽管这些系统在数据结构上各不相同,从扁平向量存储,到实体关系图,再到层次化架构,但它们通常依赖语义关系来组织和检索信息,根据条目与当前查询的内容相关性而不是它们在执行轨迹中的作用来呈现条目。

Such similarity-driven organization leads to two recurring problems when handling interdependent long-horizon tasks. First, it causes state fragmentation that weakens execution state integrity. The agent's execution state is built up through a chain of dependent decisions where each step is conditioned on the context established by prior steps. Existing systems, even those with graph structures, organize this state as entries linked by semantic or topical relationships rather than state dependencies, discarding critical execution context that binds them together. As a result, the system may fail to reconstruct a complete, coherent execution state, leading to erroneous actions based on incomplete information (Figure 2(a)--(b)).

在处理相互依赖的长程任务时,这种由相似性驱动的组织方式会反复导致两个问题。 第一,它会造成状态碎片化,削弱执行状态完整性。 智能体的执行状态是通过一条相互依赖的决策链建立起来的,其中每一步都以先前步骤建立的上下文为条件。 现有系统即使带有图结构,也会把该状态组织为由语义或主题关系连接的条目,而不是由状态依赖连接的条目,从而丢弃将它们绑定在一起的关键执行上下文。 因此,系统可能无法重建完整、连贯的执行状态,进而基于不完整信息采取错误动作(图2(a)--(b))。

Second, it hinders effective error isolation. Similarity-based memory mixes entries from different trajectories or exploration attempts in the same relevance space, so erroneous and valid traces can be surfaced together and contaminate subsequent reasoning (Figure 2(d)). Without explicit path structure and revision boundaries, it is also difficult to trace an error back to its origin or isolate the affected segment, allowing errors to propagate and accumulate over the course of execution.

第二,它会阻碍有效错误隔离 基于相似性的记忆会把来自不同轨迹或探索尝试的条目混合在同一个相关性空间中,因此错误轨迹和有效轨迹可能一起浮现,并污染后续推理(图2(d))。 如果没有显式路径结构和修订边界,也很难把错误追溯到其源头或隔离受影响的片段,导致错误在执行过程中传播并累积。

These observations suggest that memory for interdependent long-horizon agents should shift from a similarity-driven archive to an execution-state manager. To this end, we propose Mage (Memory as Agent-Guided Exploration), which treats memory as an execution state structure rather than a pool of retrievable facts. Mage organizes the agent's history as a persistent two-layer hierarchical state tree. The bottom layer records the step-by-step action-observation trace, while the top layer stores summaries generated at subgoal or decision boundaries. This boundary-aware compression reduces context without interrupting an active trace or breaking execution-state integrity. The current execution state is read from the active tree path instead of being assembled from semantically similar entries, combining compressed state, recent raw state, and execution hints from sibling branches. This path-based representation addresses state fragmentation by keeping the agent-facing state coherent while still bounding the context size.

这些观察表明,相互依赖长程智能体的记忆应当从由相似性驱动的档案,转向执行状态管理器 为此,作者提出 Mage(Memory as Agent-Guided Exploration),它把记忆视为一种执行状态结构,而不是可检索事实的池子。 Mage 将智能体历史组织为一棵持久的两层层次化状态树。 底层记录逐步的动作-观测轨迹,而顶层存储在子目标或决策边界处生成的摘要。 这种边界感知压缩在不打断活动轨迹、不破坏执行状态完整性的情况下减少上下文。 当前执行状态从活动树路径读取,而不是由语义相似的条目拼装而成;它结合了压缩状态、近期原始状态,以及来自兄弟分支的执行提示。 这种基于路径的表示通过保持面向智能体的状态连贯,同时限制上下文大小,从而解决状态碎片化问题。

Building on this tree, Mage further supports error isolation by making memory an agent-manipulable object rather than a shared pool of mixed entries. Through a closed-loop execution cycle, Grow extends the raw trace and Compress summarizes the accumulated trace at subgoal or decision boundaries. Before a new summary becomes trusted memory, Maintain validates the summary and its underlying trace against the task, catching missing information or execution errors before they propagate. If an error is detected, Revise restores the execution state to the target boundary and resumes execution as a new branch. The erroneous segment is therefore excluded from the active path, while the valid progress before the target boundary is preserved, isolating the error from subsequent decisions. As shown in Figure 1, Mage occupies the optimal upper-left quadrant, achieving stronger task progress with lower token consumption.

基于这棵树,Mage 进一步通过让记忆成为智能体可操作对象,而不是混合条目的共享池,来支持错误隔离。 通过一个闭环执行周期,Grow 扩展原始轨迹,Compress 在子目标或决策边界处总结累积轨迹。 在新摘要成为可信记忆之前,Maintain 会根据任务验证该摘要及其底层轨迹,在错误传播前捕获缺失信息或执行错误。 如果检测到错误,Revise 会把执行状态恢复到目标边界,并作为新分支继续执行。 因此,错误片段会被排除在活动路径之外,而目标边界之前的有效进展会被保留,从而把错误从后续决策中隔离出来。 图1 所示,Mage 位于最优的左上象限,以更低 token 消耗取得更强任务进展。

Our contributions are as follows. (1) We propose Mage, which organizes agentic memory as a two-layer hierarchical tree whose root-to-current path provides a complete execution state by construction, shifting memory from similarity-driven retrieval to compact execution-state management. (2) We design four coupled operations that make this tree an agent-manipulable object, forming a closed-loop state-management cycle that isolates errors into separate branches and keeps the active execution state free from erroneous traces. (3) On MemoryArena, Mage improves the task success rate by 7.8--20.4 percentage points over baselines on average, while reducing token consumption by 55.1% compared with the long-context approach.

本文贡献如下。 (1) 作者提出 Mage,它将智能体记忆组织为两层层次化树,其 root-to-current 路径通过构造提供完整执行状态,从而将记忆从相似性驱动检索转向紧凑的执行状态管理。 (2) 作者设计了四个耦合操作,使这棵树成为智能体可操作对象,形成闭环状态管理周期,将错误隔离到独立分支中,并让活动执行状态不含错误轨迹。 (3) 在 MemoryArena 上,相比基线,Mage 平均将任务成功率提高 7.8--20.4 个百分点;相比长上下文方法,将 token 消耗降低 55.1%。

Case study of baseline failures in MemoryArena shopping tasks
图2:MemoryArena 购物任务中的基线失败案例。在该任务中,每次购买都必须满足由先前已购商品引入的约束。案例 (a)--(c) 展示状态碎片化:HippoRAG 只检索到 Product 2 信息,而 MemoryOS 检索到 Product 4 探索轨迹却没有检索到最终购买项;两者都遗漏 Product 4 包含 gold,并选择了不兼容的 unicorn 主题选项。相反,Mage 保留执行状态并选择兼容的 wedding 选项。案例 (d) 展示错误污染,其中 SimpleMem 从正确和错误轨迹中检索混合证据,并购买了违反当前 vegetable 相关规避规则的 berry 口味商品。

2.1. Problem Setting

Long-horizon agent tasks with interdependent decisions can be formulated as a Markov decision process (MDP). At step t, the environment state stS evolves deterministically as st+1=T(st,at) after action atA, and the agent receives an observation ot describing the resulting state. As a result, the interaction history is ht=(a1,o1,,at,ot); as t grows, this history can exceed the model's effective context window, making it the central challenge to organize ht compactly while still supporting complete state reconstruction.

具有相互依赖决策的长程智能体任务可以被表述为马尔可夫决策过程(MDP)。 在步骤 t,环境状态 stS 在动作 atA 之后按 st+1=T(st,at) 确定性演化,智能体接收描述结果状态的观测 ot 因此,交互历史为 ht=(a1,o1,,at,ot);随着 t 增长,该历史可能超过模型的有效上下文窗口,因此核心挑战是在支持完整状态重建的同时紧凑组织 ht

This formulation highlights two requirements. First, since st is determined by previous actions (a1,,at), a sufficient memory representation must preserve the decision chain on which each step depends rather than only relevant entries. Second, if an action ak is erroneous, downstream states sk+1,,st may become invalid; recovery therefore requires identifying the error origin, reverting to sk, and re-executing from that point. This distinguishes our setting from traditional memory benchmarks, which mainly test recall of facts, preferences, or events from past conversations or traces. Since answers in these benchmarks do not alter the environment or invalidate future states, they measure retrieval fidelity rather than dynamic execution-state management.

这一表述突出了两个要求。 第一,由于 st 由先前动作 (a1,,at) 决定,充分的记忆表示必须保留每一步所依赖的决策链,而不仅是相关条目。 第二,如果动作 ak 是错误的,下游状态 sk+1,,st 可能变得无效;因此恢复需要识别错误起源,回退到 sk,并从该点重新执行。 这将本文设置与传统记忆基准区分开来,后者主要测试从过去对话或轨迹中回忆事实、偏好或事件。 由于这些基准中的回答不会改变环境或使未来状态失效,它们衡量的是检索保真度,而不是动态执行状态管理。

2.2. Memory and Retrieval Systems for Agents

A natural approach to managing long histories is retrieval-augmented generation (RAG), which augments the LLM context with information retrieved from an external store. Existing RAG methods include direct retrieval with sparse or dense matching, iterative retrieval with query refinement, graph-structured RAG for multi-hop reasoning, and memory-augmented RAG that uses a lightweight model to form global memory or retrieval clues. These methods are effective for grounding generation in external knowledge, but the retrieved corpus is typically static and independent of the agent's action-conditioned state.

管理长历史的一种自然方法是检索增强生成(RAG),它用从外部存储中检索到的信息增强 LLM 上下文。 现有 RAG 方法包括使用稀疏或稠密匹配的直接检索、带查询细化的迭代检索、用于多跳推理的图结构 RAG,以及使用轻量模型形成全局记忆或检索线索的记忆增强 RAG 这些方法能有效地把生成建立在外部知识之上,但被检索的语料通常是静态的,并且独立于智能体由动作条件化的状态。

Agent memory systems instead store the agent's evolving history. They differ in storage design: flat systems keep independent records retrieved by embedding similarity; graph-based systems organize memories through entity or event relations; hierarchical systems maintain multiple granularities to balance detail and compression; and hybrid systems combine granularities or narrative structures for compact coverage. Other work improves retrieval with prospective indexing or retrospective reflection.

智能体记忆系统则存储智能体不断演化的历史。 它们在存储设计上各不相同:扁平系统保留由 embedding 相似性检索的独立记录;基于图的系统通过实体或事件关系组织记忆;层次化系统维护多种粒度以平衡细节和压缩;混合系统则结合多种粒度或叙事结构来实现紧凑覆盖。 其他工作通过前瞻索引或回顾反思改进检索。

Despite this diversity, these systems commonly expose memory through similarity-driven update and retrieval: they maintain a store M via MUpdate(M,at,ot) and retrieve entries Retrieve(M,q){e1,,ek} by semantic relevance to query q. This design reduces context length but does not preserve the path structure needed for long-horizon tasks with interdependent decisions. It therefore causes state fragmentation, where the decision chain defining st is scattered across semantic fragments, and insufficient error isolation, where valid and erroneous trajectories coexist in the same memory pool without structural boundaries for rollback. Figure 2 illustrates both failures on shopping tasks in MemoryArena.

尽管存在这种多样性,这些系统通常通过相似性驱动的更新和检索来暴露记忆:它们通过 MUpdate(M,at,ot) 维护存储 M,并通过与查询 q 的语义相关性检索条目 Retrieve(M,q){e1,,ek} 这种设计减少了上下文长度,但没有保留具有相互依赖决策的长程任务所需的路径结构。 因此,它会导致状态碎片化,即定义 st 的决策链分散在语义片段中;也会导致不充分的错误隔离,即有效和错误轨迹在同一个记忆池中共存,却没有用于回滚的结构边界。 图2 展示了 MemoryArena 购物任务中的这两类失败。

3. Method

To address the issues inherent in similarity-driven memory systems, we propose Mage, which shifts agentic memory from passive semantic storage and retrieval to active execution state management. Figure 3 illustrates the overall design.

为解决相似性驱动记忆系统的内在问题,作者提出 Mage,将智能体记忆从被动语义存储和检索转向主动执行状态管理。 图3 展示了总体设计。

3.1. Overview

Overview of Mage
图3:Mage 概览。Mage 维护一棵两层执行状态树:原始动作-观测节点在底层增长,而已完成子目标被压缩到顶层。当检测到错误时,Revise 会恢复目标边界,并沿新分支继续探索,同时保留未受影响的进展。

Cognitive science suggests that humans performing complex sequential tasks rely on coordinated neural mechanisms. The prefrontal cortex organizes behavior into hierarchical subgoals and chunks completed segments to free working memory for subsequent planning. The anterior cingulate cortex monitors execution and signals failures at subgoal boundaries before they propagate to downstream decisions. After detecting errors, executive control selectively backtracks to the relevant boundary and repairs the affected segment while preserving unaffected goal structure. This cycle of chunking, monitoring, and correction motivates a memory system that manages execution state actively rather than merely storing past information.

认知科学表明,人类执行复杂序列任务时依赖协调的神经机制。 前额叶皮层将行为组织为层次化子目标,并把已完成片段分块,以释放工作记忆用于后续规划。 前扣带皮层监控执行,并在失败传播到下游决策之前,在子目标边界处发出失败信号。 检测到错误后,执行控制会选择性回退到相关边界,并在保留未受影响目标结构的同时修复受影响片段。 这种分块、监控和纠正的循环,启发了一种主动管理执行状态而不只是存储过去信息的记忆系统。

Motivated by this architecture, we propose Mage (Memory as Agent-Guided Exploration), which represents the agent's execution history as a two-layer hierarchical state tree. The bottom layer records raw action-observation nodes in execution order, preserving fine-grained state dependencies. The top layer stores summary nodes that cover completed bottom-layer segments, progressively chunking long local traces into compact subgoal-level states. Together, these two layers keep the root-to-current path complete rather than fragmented, while bounding the context and retaining the boundaries needed for future revision. Based on this tree, Mage constructs the agent-facing execution state S, consisting of compressed summaries C, recent raw trace R, and execution hints H from previously explored branches and diagnostic notes.

受这种架构启发,作者提出 Mage(Memory as Agent-Guided Exploration),将智能体的执行历史表示为一棵两层层次化状态树。 底层按执行顺序记录原始动作-观测节点,保留细粒度状态依赖。 顶层存储覆盖已完成底层片段的摘要节点,将长局部轨迹逐步分块为紧凑的子目标级状态。 这两层共同保持 root-to-current 路径完整而非碎片化,同时限制上下文并保留未来修订所需边界。 基于这棵树,Mage 构建面向智能体的执行状态 S,由压缩摘要 C、近期原始轨迹 R,以及来自先前探索分支和诊断备注的执行提示 H 组成。

Given this state representation, we design four operations to maintain the tree and refresh S in a closed loop, mirroring the cognitive cycle summarized in Table 1. During execution, Grow appends each new action-observation pair to the bottom-layer tree, extending the recent raw trace R. Compress moves completed raw segments from R into top-layer summaries in C, freeing context while preserving subgoal boundaries. Maintain acts as a boundary-level error monitor, validating each new summary before it becomes trusted memory and recording diagnostic notes. Upon detecting an error, Revise provides selective correction by restoring C and R to the relevant boundary, injecting diagnostic feedback into H, and resuming execution as a new branch from that point.

给定这一状态表示,作者设计四个操作来维护这棵树,并在闭环中刷新 S,对应 表1 总结的认知循环。 执行期间,Grow 将每个新的动作-观测对追加到底层树中,扩展近期原始轨迹 R Compress 将已完成原始片段从 R 移入 C 中的顶层摘要,在保留子目标边界的同时释放上下文。 Maintain 充当边界级错误监控器,在每个新摘要成为可信记忆之前验证它,并记录诊断备注。 检测到错误后,Revise 通过将 CR 恢复到相关边界、把诊断反馈注入 H,并从该点作为新分支继续执行,来提供选择性纠正。

表1:Mage 操作与人类复杂任务执行背后的认知机制相对应。
Cognitive MechanismFunctionOperation
Hierarchical chunkingOrganize subgoals; free working memoryGrow + Compress
Error monitoringDetect errors at subgoal boundariesMaintain
Selective correctionBacktrack to error origin; correct affected branchRevise

This closed-loop design directly addresses state fragmentation and ineffective error isolation. First, Mage constructs the current execution state from one path of the tree, combining compressed summaries for completed segments with raw traces for recent steps instead of retrieving disconnected memory entries based on similarity. Second, boundary-level maintenance and revision prevent erroneous segments from entering or contaminating the active execution state. When an error is detected, Revise branches from the target boundary, isolating the affected segment while preserving valid progress elsewhere. We next present the hierarchical execution state tree and the four state-transition operations.

这一闭环设计直接解决状态碎片化和无效错误隔离问题。 第一,Mage 从树的一条路径构造当前执行状态,把已完成片段的压缩摘要与近期步骤的原始轨迹结合起来,而不是基于相似性检索断开的记忆条目。 第二,边界级维护和修订可以防止错误片段进入或污染活动执行状态。 当检测到错误时,Revise 从目标边界分支,隔离受影响片段,同时保留其他位置的有效进展。 接下来介绍层次化执行状态树和四个状态转移操作。

3.2. Hierarchical Execution State Tree

Mage organizes the agent's execution history as a two-layer hierarchical tree with a unified node structure (Table 2). The bottom layer records every raw action-observation pair as a node, preserving the fine-grained state of the execution. The root-to-current path through this layer yields the complete execution trajectory, while children of each node expose previously explored alternatives that can help the agent avoid repeating errors. When the agent revises a decision, new actions branch as siblings of the failed path, structurally isolating erroneous traces from valid ones. The top layer compresses contiguous bottom-layer segments into summary nodes, bounding the context as the task progresses. Each top-layer node corresponds to a completed subgoal, and we apply Maintain and Revise at these subgoal boundaries, the same locus where the prefrontal cortex chunks completed segments and the anterior cingulate cortex monitors for errors before they propagate (Table 1).

Mage 将智能体执行历史组织为一棵两层层次化树,并采用统一节点结构(表2)。 底层把每个原始动作-观测对记录为一个节点,保留执行的细粒度状态。 穿过这一层的 root-to-current 路径产生完整执行轨迹,而每个节点的子节点暴露先前探索过的替代选项,可帮助智能体避免重复错误。 当智能体修订某个决策时,新动作会作为失败路径的兄弟节点分支出来,从结构上将错误轨迹与有效轨迹隔离。 顶层把连续的底层片段压缩为摘要节点,随着任务推进限制上下文。 每个顶层节点对应一个已完成子目标,作者在这些子目标边界处应用 MaintainRevise;这也是前额叶皮层分块已完成片段、前扣带皮层在错误传播前监控错误的位置(表1)。

表2:执行状态树的节点结构。
FieldDescription
idUnique identifier
contentAction-observation pair (bottom layer) or compressed summary (top layer)
parentPointer to parent node
childrenSet of child node pointers
cover_nodesOrdered bottom-node pointers covered by this summary (top layer only)
noteDiagnostic feedback (top layer only)

At runtime, Mage navigates this two-layer structure with pointers pb and pt, tracking the agent's current positions in the bottom and top layers, respectively, and a global step assigning monotonically increasing ids to newly created nodes.

运行时,Mage 使用指针 pbpt 导航这一两层结构,分别跟踪智能体在底层和顶层的当前位置,并使用全局步骤 为新创建节点分配单调递增的 id。

Based on the hierarchical tree, Mage derives the agent's execution state S by composing three parts: (1) compressed state C, consisting of top-layer summaries along the root-to-pt path, each annotated by its step id, allowing the agent to revise failed subgoals from the corresponding boundary upon detecting errors; (2) raw state R, the bottom-layer nodes accumulated since the last compression that provide fine-grained recent context; and (3) execution hint H, which surfaces children of pb and pt to reveal previously explored alternatives, along with diagnostic feedback from prior failed attempts. This representation equips the agent with a complete execution state and corrective guidance, sustaining coherent decision-making over long-horizon tasks with interdependent steps.

基于层次化树,Mage 通过组合三部分推导智能体的执行状态 S:(1) 压缩状态 C,由 root-to-pt 路径上的顶层摘要组成,每个摘要都标注其步骤 id,使智能体在检测到错误时能够从对应边界修订失败子目标;(2) 原始状态 R,即自上次压缩以来累积的底层节点,提供细粒度近期上下文;(3) 执行提示 H,它呈现 pbpt 的子节点以揭示先前探索过的替代选项,并附带来自先前失败尝试的诊断反馈。 这一表示为智能体提供完整执行状态和纠正指导,使其能够在具有相互依赖步骤的长程任务中维持连贯决策。

3.3. State-Transition Operations

The hierarchical tree becomes an active execution manager through four operations that transition the execution state as the agent progresses. Algorithm 1 provides the pseudocode of these operations.

随着智能体推进,层次化树通过四个操作转移执行状态,从而成为主动执行管理器。 算法1 给出了这些操作的伪代码。

Grow. When the agent executes an action and receives an observation, Mage automatically invokes Grow to update the bottom layer. If pb already has a child with identical content from a prior exploration, the pointer advances to that child, merging back into the explored path without duplication; otherwise, a new node is created and linked as a child of pb:

Grow. 当智能体执行一个动作并收到一个观测时,Mage 会自动调用 Grow 来更新底层。 如果 pb 已经有一个来自先前探索、内容相同的子节点,指针会前进到该子节点,在不重复的情况下合并回已探索路径;否则,会创建一个新节点并将其链接为 pb 的子节点:

pb={c,cpb.children:c.content=(a,o),NewNode(,(a,o)), +1,otherwise.

The raw state R is then extended with the new action-observation pair, and the execution hint H is updated with children of the current node:

随后,原始状态 R 会用新的动作-观测对扩展,执行提示 H 则用当前节点的子节点更新:

R=R(a,o),H=Update(H,pb.children)

This informs the agent of continuations attempted in prior explorations and helps it avoid repeating failed strategies.

这会告知智能体先前探索中尝试过的延续方式,并帮助它避免重复失败策略。

Compress. Compress bounds context growth by replacing a completed bottom-layer segment with a top-layer summary node, freeing space while preserving the decision boundary needed for later recovery. It is invoked when the agent marks a subgoal complete with summary content provided as an argument, or by Mage as a fallback when the raw state R exceeds a length threshold. Through this boundary-aware compression, Mage avoids interrupting unfinished subgoals and keeps the state compact without discarding dependencies needed by future decisions.

Compress. Compress 通过用顶层摘要节点替换已完成的底层片段来限制上下文增长,释放空间,同时保留后续恢复所需的决策边界。 当智能体将某个子目标标记为完成并以参数形式提供摘要内容时,或者当原始状态 R 超过长度阈值、Mage 作为回退机制介入时,它会被调用。 通过这种边界感知压缩,Mage 避免打断未完成子目标,并在不丢弃未来决策所需依赖的情况下保持状态紧凑。

Operationally, Compress traces the bottom-layer tree from pb back to the last compressed boundary (recorded as pt.cover_nodes[1]), and uses the traversed nodes in execution order as the new summary node's cover_nodes. If a child of pt already covers the same bottom nodes, it is reused; otherwise, a new summary node is created and inserted into the top-layer tree:

在操作上,Compress 会从 pb 沿底层树回溯到上一个压缩边界(记录为 pt.cover_nodes[1]),并按执行顺序使用遍历到的节点作为新摘要节点的 cover_nodes 如果 pt 的某个子节点已经覆盖相同的底层节点,就复用它;否则,创建一个新摘要节点并插入顶层树:

Cb=Trace(pt.cover_nodes[1],pb),b=pt.cover_nodes[1].id,pt={c,cpt.children:c.cover_nodes=Cb,NewNode(b,sum_content,Cb),otherwise.

Then, Compress clears the current raw state R, appends the summary content to the compressed state C, and updates the execution hint H with children of pt:

然后,Compress 清空当前原始状态 R,将摘要内容追加到压缩状态 C,并用 pt 的子节点更新执行提示 H

R=,C=C(pt.content),H=Update(H,pt.children).

This exposes previously attempted subgoals from the new boundary while keeping the compressed state compact.

这会从新边界暴露先前尝试过的子目标,同时保持压缩状态紧凑。

Maintain. Immediately after compression, Maintain validates the just-completed subgoal before the new summary becomes a trusted part of memory. This check protects the execution state from incorrect memory writes, allowing Mage to detect missing information, unsatisfied task requirements, or broken dependencies before such errors accumulate. An LLM examines the compressed subtree together with the summary content and task instruction:

Maintain. 压缩后,Maintain 会立即验证刚完成的子目标,然后新摘要才会成为记忆中的可信部分。 这一检查保护执行状态免受错误记忆写入影响,使 Mage 能够在这类错误累积前检测缺失信息、未满足的任务要求或断裂的依赖。 一个 LLM 会结合摘要内容和任务指令检查压缩子树:

f=LLM(task_inst, pt.cover_nodes, pt.content).

If validation passes, execution continues. Otherwise, Maintain records the diagnostic feedback f in pt.note and returns a failure signal with the revision target pt.id.

如果验证通过,执行继续。 否则,Maintain 会把诊断反馈 f 记录到 pt.note 中,并返回一个带有修订目标 pt.id 的失败信号。

Revise. Triggered by a Maintain failure or invoked proactively by the agent upon detecting an error, Revise restores the active path to the target step t, which is either returned by Maintain or selected from exposed compressed-state boundaries. Mage rolls both pointers backward until the target is reached, where re-exploration begins:

Revise. ReviseMaintain 失败触发,或在智能体检测到错误时由其主动调用;它会将活动路径恢复到目标步骤 t,该步骤要么由 Maintain 返回,要么从暴露出的压缩状态边界中选择。 Mage 会将两个指针都向后回滚,直到到达重新探索开始的目标位置:

(pt,pb)=Restore(pt,pb,t).

The compressed state C and raw state R are reverted to these positions, while the execution hint H is updated with diagnostic feedback and alternatives from the restored nodes, providing extra guidance that helps avoid repeating the same error:

压缩状态 C 和原始状态 R 会回退到这些位置,而执行提示 H 会用诊断反馈和来自恢复节点的替代选项更新,从而提供额外指导,帮助避免重复同一错误:

C=RestoreState(pt),R=,H=Update(H,pt.childrenpb.children{f}).

Subsequent actions branch from this restored point as sibling paths, achieving error isolation without discarding valid progress on other branches.

后续动作会从这个恢复点作为兄弟路径分支出来,在不丢弃其他分支有效进展的情况下实现错误隔离。

算法1:Mage 的状态转移操作。

Algorithm 1: Mage state transition operations.

Global Variable: Pointers to the bottom-layer tree pb and top-layer tree pt, global step .
Execution State: S=(C,R,H) for compressed state, raw state, and execution hint.

  1. function Grow(a,o)
  2.   for all cpb.children do
  3.     if c.content=(a,o) then ▷ merge node
  4.       pbc
  5.       RR(a,o)
  6.       H.Update(pb.children)
  7.       return
  8.     end if
  9.   end for
  10.   vNewNode(,(a,o)); +1
  11.   v.parentpb; pb.childrenpb.children{v}
  12.   pbv; RR(a,o)
  13.   H.Update(pb.children)
  14. end function
  15. function Compress(m) ▷ input summary content
  16.   bpt.cover_nodes[1] ▷ compressed boundary
  17.   CTrace(b,pb) ▷ track nodes in execution order
  18.   for all cpt.children do ▷ merge node
  19.     if c.cover_nodes=C then
  20.       c.contentm; ptc
  21.       R; CC(pt.content,pt.id)
  22.       H.Update(pt.children)
  23.       return
  24.     end if
  25.   end for
  26.   vNewNode(b.id,m,C)
  27.   v.parentpt; pt.childrenpt.children{v}
  28.   ptv
  29.   R; CC(pt.content,pt.id)
  30.   H.Update(pt.children)
  31. end function
  32. function Maintain(τ) ▷ input task instruction
  33.   TFlatten(pt.cover_nodes) ▷ flatten traces
  34.   (q,f)LLM(τ,T,pt.content) ▷ LLM judge
  35.   if q then
  36.     return Pass
  37.   else
  38.     pt.notef
  39.     return (Fail,f,pt.id)
  40.   end if
  41. end function
  42. function Revise(f,b) ▷ input feedback and target step
  43.   while pt.idb do
  44.     C.Delete(pt); ptpt.parent
  45.   end while
  46.   C.Delete(pt)
  47.   ptpt.parent ▷ skip the failed compressed node
  48.   pbpt.cover_nodes[1]; R
  49.   Hpt.childrenpb.children{f}
  50. end function

4. Experiments

4.1. Experimental Setup

Benchmark. We evaluate on MemoryArena, an interdependent long-horizon benchmark where agents operate in a continuous Memory-Agent-Environment loop for up to hundreds of steps. Unlike conventional benchmarks that test static fact retrieval or question answering over past dialogues and traces, MemoryArena follows action-conditioned MDPs: each action can reshape future constraints, so success requires tracking the evolving execution state rather than recalling facts.

Benchmark. 作者在 MemoryArena 上评估,这是一个相互依赖的长程基准,其中智能体在连续的 Memory-Agent-Environment 循环中运行,最多可达数百步。 不同于测试静态事实检索或关于过去对话和轨迹的问答的传统基准,MemoryArena 遵循动作条件化 MDP:每个动作都可能重塑未来约束,因此成功需要跟踪不断演化的执行状态,而不是回忆事实。

MemoryArena spans four domains with long dependency structures: Bundled Web Shopping, where the agent purchases a bundle of related products and later choices depend on earlier items; Group Travel Planning, in which the agent coordinates multi-person itineraries to satisfy interdependent preferences; Progressive Web Search, where the model answers complex queries progressively using information gathered from previous sub-queries; and Formal Reasoning, in which the agent proves complex claims through sequential derivations that build on previously established results.

MemoryArena 覆盖四个具有长依赖结构的领域:Bundled Web Shopping,智能体购买一组相关商品,后续选择依赖先前商品;Group Travel Planning,智能体协调多人行程以满足相互依赖的偏好;Progressive Web Search,模型使用先前子查询收集的信息逐步回答复杂查询;以及 Formal Reasoning,智能体通过建立在先前已证明结果之上的顺序推导来证明复杂命题。

Baselines. We compare Mage against representative methods across three paradigms. Long Context retains full interaction history. RAG systems include HippoRAG2, which builds a knowledge graph and applies Personalized PageRank for multi-hop retrieval, and MemoRAG, which uses a lightweight memory model to generate retrieval clues. Memory systems include Mem0, which extracts and consolidates facts into graph-based memory, ReasoningBank, which distills reusable reasoning strategies from past experiences, MemoryOS, which maintains hierarchical storage layers with dynamic cross-level updating, and SimpleMem, which performs semantic compression and recursive consolidation for efficient memory management. All methods use the default hyperparameters from their original papers.

Baselines. 作者将 Mage 与三种范式下的代表性方法比较。 Long Context 保留完整交互历史。 RAG systems 包括 HippoRAG2 和 MemoRAG;前者构建知识图并应用 Personalized PageRank 进行多跳检索,后者使用轻量记忆模型生成检索线索。 Memory systems 包括 Mem0、ReasoningBank、MemoryOS 和 SimpleMem;Mem0 将事实抽取并整合进基于图的记忆,ReasoningBank 从过去经验中蒸馏可复用推理策略,MemoryOS 维护带有动态跨层更新的层次化存储层,SimpleMem 则执行语义压缩和递归整合以实现高效记忆管理。 所有方法都使用其原论文中的默认超参数。

Model. All methods use Qwen3.6-27B as the backbone LLM with ReAct for agent exploration. Baselines requiring embeddings use Qwen3-8B-Embedding. Inference runs on NVIDIA A100 GPUs with vLLM 0.20.0 under Python 3.12. Results with additional backend models are reported in Appendix.

Model. 所有方法都使用 Qwen3.6-27B 作为骨干 LLM,并使用 ReAct 进行智能体探索。 需要 embedding 的基线使用 Qwen3-8B-Embedding。 推理在 NVIDIA A100 GPU 上运行,使用 Python 3.12 下的 vLLM 0.20.0。 其他后端模型的结果报告在附录中。

Metrics. Following MemoryArena, we report average Task Success Rate (SR, %), Task Progress Score (PS, %), and total token consumption. SR measures full task completion: all subtasks must be correct in Shopping and Travel Planning, while the final subtask determines success in the other two domains. PS measures completed-subtask fraction, and token consumption includes both prompt and generation tokens.

Metrics. 遵循 MemoryArena,作者报告平均任务成功率(SR,%)、任务进展分数(PS,%)和总 token 消耗。 SR 衡量完整任务完成情况:Shopping 和 Travel Planning 中所有子任务都必须正确,而另外两个领域由最终子任务决定成功与否。 PS 衡量已完成子任务比例,token 消耗同时包括 prompt token 和 generation token。

4.2. Main Results

表3:MemoryArena 上的主结果。SR = Task Success Rate(%);PS = Task Progress Score(%);#tokens = 每个任务的平均 token 消耗。最佳结果用粗体表示,次优结果加下划线。Mage 在降低 token 消耗的同时取得最佳任务准确率。
Bundled
Web Shopping
Group
Travel Planning
Progressive
Web Search
Formal Reasoning
MethodSRPS#tokensSRPS#tokensSRPS#tokensMath
SR
Math
PS
Physics
SR
Physics
PS
#tokens
Long Context0.33330.75781528K0.05190.40703211K0.48420.28036045K0.40000.41240.65000.69771782K
HippoRAG20.20670.71891720K0.09630.55693153K0.36200.21332865K0.45000.41530.60000.70932268K
MemoRAG0.20670.72001251K0.04810.47312829K0.42080.25353535K0.45000.42370.65000.72091314K
Mem00.19330.68221753K0.02590.34982786K0.35290.22062715K0.40000.40400.55000.69771853K
ReasoningBank0.11330.6033868K0.00000.23461463K0.30320.19931923K0.42500.42660.55000.67441187K
MemoryOS0.20000.70441448K0.02590.44052984K0.33030.20842973K0.45000.42090.55000.69771452K
SimpleMem0.16000.67672939K0.02220.32464616K0.35750.23103715K0.40000.42940.55000.66281656K
Mage (Ours)0.39330.77781015K0.15190.53511978K0.56560.37901727K0.42500.44920.65000.69771195K

Table 3 presents task performance across four domains. On tasks with complex state dependencies, RAG and memory-based baselines underperform the long-context approach, with SR drops of 12.7--22.0 pp on Web Shopping and 6.3--18.1 pp on Web Search. This gap suggests that similarity-driven retrieval fragments the execution state into isolated entries, discarding structural dependencies retained in full interaction history. The partial exceptions occur when dependencies can be recovered as sparse local evidence. In Formal Reasoning, baselines perform comparably or slightly better than the long-context approach because mathematical dependencies are explicit and sparse, enabling precise lemma retrieval. Similarly, HippoRAG2 achieves good results on Travel Planning because local constraints are anchored by stable entities and attributes (e.g., hotel--city, restaurant--cuisine), making graph retrieval effective for recovering candidate facts; yet full travel plans require cross-traveler constraints from prior decisions, and its fragmented retrieved triples do not preserve this evolving execution state, so its PS advantage does not translate into higher SR than Mage.

表3 展示了四个领域中的任务表现。 在具有复杂状态依赖的任务上,RAG 和基于记忆的基线表现低于长上下文方法:Web Shopping 上 SR 下降 12.7--22.0 个百分点,Web Search 上 SR 下降 6.3--18.1 个百分点。 这一差距表明,相似性驱动检索会把执行状态碎片化为孤立条目,丢弃完整交互历史中保留的结构依赖。 当依赖可以作为稀疏局部证据恢复时,会出现部分例外。 在 Formal Reasoning 中,基线与长上下文方法表现相当或略好,因为数学依赖是显式且稀疏的,从而支持精确 lemma 检索。 类似地,HippoRAG2 在 Travel Planning 上取得不错结果,因为局部约束锚定在稳定实体和属性上(例如 hotel--city、restaurant--cuisine),使图检索能够有效恢复候选事实;然而完整旅行计划需要来自先前决策的跨旅行者约束,而其碎片化检索三元组无法保留这一演化执行状态,因此它的 PS 优势并没有转化为高于 Mage 的 SR。

Conversely, Mage outperforms the long-context approach by average margins of 7.8 pp in SR and 8.7 pp in PS. This gain stems from treating memory as an active execution-state manager, not a passive archive. Modeling the trajectory as an active path within a two-layer hierarchical tree, Mage maintains a coherent current state, reuses historical exploration to avoid recurring errors, and quarantines flawed segments into inactive branches to prevent contamination of the active path.

相反,Mage 相比长上下文方法平均在 SR 上高出 7.8 个百分点,在 PS 上高出 8.7 个百分点。 这一收益来自于将记忆视为主动执行状态管理器,而不是被动档案。 Mage 将轨迹建模为两层层次化树中的活动路径,维持连贯的当前状态,复用历史探索来避免重复错误,并把有缺陷片段隔离到非活动分支中,以防污染活动路径。

4.3. Token Efficiency

Regarding efficiency, Table 3 also shows token consumption per task. While RAG and memory baselines theoretically reduce context length, this benefit primarily materializes in document-heavy environments like Web Search, where they reduce token usage by 38.5--68.2% compared with the long-context approach. In domains with shorter observations and frequent state updates, the overhead of memory maintenance (e.g., extraction, query rewriting), compounded by the additional reasoning and execution steps required to synthesize fragmented retrieved states, frequently eclipses the compression savings. Consequently, systems like HippoRAG2 and Mem0 end up consuming 12.6--14.7% more tokens than the long-context approach on Web Shopping. In contrast, Mage consistently reduces token usage by 32.9--71.4% across all domains. Unlike traditional memory systems that require continuous, token-heavy auxiliary LLM calls to extract entities or generate queries, Mage maintains the tree structure deterministically and invokes auxiliary LLMs only during Compress and Maintain at natural subgoal boundaries. This design minimizes maintenance overhead without compromising state integrity.

在效率方面,表3 也展示了每个任务的 token 消耗。 虽然 RAG 和记忆基线理论上会减少上下文长度,但这一收益主要体现在 Web Search 这类文档密集环境中;相比长上下文方法,它们在这里将 token 使用减少 38.5--68.2%。 在观测较短且状态更新频繁的领域中,记忆维护开销(例如抽取、查询改写)叠加合成碎片化检索状态所需的额外推理和执行步骤,常常超过压缩带来的节省。 因此,HippoRAG2 和 Mem0 等系统在 Web Shopping 上最终比长上下文方法多消耗 12.6--14.7% token。 相比之下,Mage 在所有领域中都稳定地将 token 使用减少 32.9--71.4%。 传统记忆系统需要持续进行 token 开销很高的辅助 LLM 调用来抽取实体或生成查询,而 Mage 确定性地维护树结构,并且只在自然子目标边界处的 CompressMaintain 期间调用辅助 LLM。 这一设计在不损害状态完整性的情况下最小化维护开销。

4.4. Ablation Study

表4:消融实验结果。每一行都从完整 Mage 中移除一个机制。
VariantWeb ShoppingTravel Planning
SRPS#tokensSRPS#tokens
Mage (Full)0.39330.77781015K0.15190.53511978K
w/o Compress0.32000.72332469K0.08520.44963539K
w/o Maintain0.32670.7389887K0.10000.46831256K
w/o Revise0.35330.73781157K0.10000.47082551K

To validate the contribution of the main state-management mechanisms, we evaluate Mage variants that remove one mechanism at a time. Table 4 shows that all three mechanisms contribute to reliable execution-state management.

为验证主要状态管理机制的贡献,作者评估了每次移除一个机制的 Mage 变体。 表4 表明,三个机制都对可靠执行状态管理有贡献。

Removing Compress consistently hurts task completion, reducing SR by 7.3 pp on Web Shopping and 6.7 pp on Travel Planning, while increasing token consumption to 2.4× and 1.8× that of the full model. This confirms that simply retaining the raw action-observation stream is not a sufficient substitute for memory: without boundary-aware compression, the active state becomes diluted by low-level traces, and Maintain must verify an increasingly long trajectory.

移除 Compress 会持续损害任务完成度,使 Web Shopping 上 SR 降低 7.3 个百分点、Travel Planning 上 SR 降低 6.7 个百分点,同时将 token 消耗增加到完整模型的 2.4×1.8× 这证实,仅保留原始动作-观测流并不能充分替代记忆:没有边界感知压缩时,活动状态会被低层轨迹稀释,Maintain 也必须验证越来越长的轨迹。

Removing Maintain leads to a different failure mode. Although it reduces token usage by avoiding boundary-level verification, SR drops by 5.2--6.7 pp on two domains. This result demonstrates that memory writes should be validated before they become trusted state. Otherwise, incomplete or erroneous subgoal summaries are committed to the active path and later decisions are conditioned on corrupted execution state.

移除 Maintain 会导致另一种失败模式。 虽然它通过避免边界级验证减少 token 使用,但两个领域上的 SR 会下降 5.2--6.7 个百分点。 这一结果表明,记忆写入在成为可信状态之前应当被验证。 否则,不完整或错误的子目标摘要会被提交到活动路径,后续决策也会以损坏的执行状态为条件。

Finally, removing Revise lowers SR by 4.0--5.2 pp on two domains. This shows that error detection alone is insufficient: after a failed boundary is identified, the agent must restore the corresponding state, branch away from the flawed segment, and continue from the preserved valid prefix; otherwise, the error remains on the active path and contaminates later decisions.

最后,移除 Revise 会使两个领域上的 SR 降低 4.0--5.2 个百分点。 这表明仅有错误检测是不够的:在识别出失败边界后,智能体必须恢复相应状态、从有缺陷片段分支出去,并从保留下来的有效前缀继续;否则,错误会留在活动路径上并污染后续决策。

Notably, these weakened variants remain competitive with or superior to the baselines in Table 3. This suggests that organizing memory around the execution path mitigates the state fragmentation caused by semantic retrieval, while the full operation loop further prevents corrupted or failed segments from contaminating the active state.

值得注意的是,这些削弱后的变体仍然与 表3 中的基线有竞争力,或优于这些基线。 这表明,围绕执行路径组织记忆可以缓解语义检索导致的状态碎片化,而完整操作循环会进一步防止损坏或失败片段污染活动状态。

5. Conclusion

We presented Mage, a memory framework that reframes long-horizon agent memory as active execution-state management rather than similarity-driven retrieval. By organizing interaction history as a two-layer hierarchical state tree, Mage preserves the active root-to-current execution path while compressing completed subgoals and exposing execution hints from previously explored branches. Its four coupled operations allow agents to extend traces, bound context growth, validate newly compressed states, and isolate erroneous segments through branching. Experiments on MemoryArena show that this design improves task success across diverse long-horizon domains while substantially reducing token consumption. These results suggest that preserving execution-state structure is a key principle for building reliable and efficient memory systems for real-world LLM agents.

本文提出 Mage,这是一个记忆框架,将长程智能体记忆重新表述为主动执行状态管理,而不是相似性驱动检索。 通过将交互历史组织为一棵两层层次化状态树,Mage 在压缩已完成子目标、暴露来自先前探索分支的执行提示的同时,保留活动的 root-to-current 执行路径。 它的四个耦合操作允许智能体扩展轨迹、限制上下文增长、验证新压缩状态,并通过分支隔离错误片段。 MemoryArena 上的实验表明,该设计在多样长程领域中提升任务成功率,同时显著减少 token 消耗。 这些结果表明,保留执行状态结构是为真实世界 LLM 智能体构建可靠高效记忆系统的一项关键原则。