Skip to content


ENGRAM:面向对话智能体的有效轻量级记忆编排

Abstract

Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the capacity to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval, and operating-system–style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight state-of-the-art memory system that organizes conversation into three canonical memory types—episodic, semantic, and procedural—through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and persisted in a database. At query time, the system retrieves top-k dense neighbors per type, merges results with simple set operations, and provides relevant evidence as context to the model. ENGRAM attains state-of-the-art results on the LoCoMo benchmark, a realistic multi-session conversational question-answering (QA) suite for long-horizon memory, and exceeds the full-context baseline by 15 absolute points on LongMemEval, an extended-horizon conversational benchmark, while using only 1% of the tokens. Our results suggest that careful memory typing and straightforward dense retrieval enable effective long-term memory management in language models, challenging the trend toward architectural complexity in this domain.

部署在面向用户应用中的大语言模型(LLM)需要长程一致性:也就是记住先前交互、尊重用户偏好,并把推理建立在过往事件之上的能力。 然而,当代记忆系统常采用知识图谱、多阶段检索和操作系统式调度器等复杂架构,这会引入工程复杂性和可复现性挑战。 我们提出 ENGRAM,这是一个轻量级的最先进记忆系统,它通过单个路由器和检索器,将对话组织为三种标准记忆类型:情景记忆、语义记忆和程序记忆。 每个用户轮次都会被转换为带有规范化 schema 和嵌入的类型化记忆记录,并持久化到数据库中。 在查询时,系统按类型检索 top-k 密集邻居,用简单集合操作合并结果,并把相关证据作为上下文提供给模型。 ENGRAM 在 LoCoMo 基准上达到最先进结果,该基准是用于长程记忆的现实多会话对话问答(QA)套件;在扩展长程对话基准 LongMemEval 上,ENGRAM 比 full-context 基线高出 15 个绝对点,同时只使用约 1% 的 token。 我们的结果表明,细致的记忆类型划分和直接的密集检索能够让语言模型进行有效长期记忆管理,并挑战该领域走向架构复杂化的趋势。

1. Introduction

LLMs are now embedded in personal assistants, tutoring systems, productivity tools and many other user-facing applications. These deployments demand long-horizon consistency, meaning remembering prior interactions, respecting user preferences, and grounding reasoning in past events. However, unlike humans, LLMs reset once input falls outside the context window. This leads to brittle behaviors such as forgetting, contradictions, or reliance on pre-training. Prior efforts extended the Transformer architecture to increase effective context, but they do not obviate the need for external memory.

LLM 现在已经嵌入个人助手、辅导系统、生产力工具以及许多其他面向用户的应用中。 这些部署需要长程一致性,也就是记住先前交互、尊重用户偏好,并把推理建立在过往事件之上。 然而,与人类不同,一旦输入落到上下文窗口之外,LLM 就会重置。 这会导致遗忘、矛盾或依赖预训练等脆弱行为。 先前工作扩展了 Transformer 架构以增加有效上下文,但它们并不能消除对外部记忆的需求。

Contemporary memory systems have converged on increasingly elaborate architectures, often involving knowledge graphs, multi-stage retrieval pipelines, or operating-system style schedulers. These designs introduce engineering complexity and many degrees of freedom, making reproducibility and analysis difficult. In contrast, we argue for a different point on the design spectrum: a compact memory layer that is intentionally simple yet sufficient to deliver state-of-the-art accuracy and reliable performance.

当代记忆系统已经逐渐汇聚到越来越精巧的架构上,通常涉及知识图谱、多阶段检索流水线或操作系统式调度器。 这些设计引入工程复杂性和许多自由度,使可复现性和分析变得困难。 相比之下,我们主张设计谱系上的另一种位置:一个刻意保持简单、但足以提供最先进准确率和可靠性能的紧凑记忆层。

We introduce ENGRAM, a lightweight memory system that separates three memory typesepisodic, semantic, and procedural — and combines them through a single router and a single retriever. Each user message is converted into typed memory records with normalized JSON schemas and embeddings. Records are persisted in a local SQLite store, and at query time the system retrieves top-k neighbors from each memory type, merges results, and provides the evidence set as context to the answering prompt.

我们提出 ENGRAM,这是一个轻量级记忆系统,它区分三种记忆类型情景记忆、语义记忆和程序记忆,并通过单个路由器和单个检索器把它们组合起来。 每条用户消息都会被转换成带有规范化 JSON schema 和嵌入的类型化记忆记录。 记录会被持久化到本地 SQLite 存储中;在查询时,系统从每种记忆类型中检索 top-k 邻居,合并结果,并把证据集作为上下文提供给回答提示。

Our central claim is that careful memory typing, minimal routing, and straightforward retrieval suffice to achieve state-of-the-art performance on LoCoMo, and to surpass full-context baselines on LongMemEval. We demonstrate consistent gains across single-hop, multi-hop, open-domain, and temporal categories. Beyond headline metrics, the simplicity of ENGRAM makes it an attractive foundation for principled experimentation: each component is small, interpretable, and easy to swap out. Finally, we provide a complete implementation and evaluation harness to support rigorous comparison and foster adoption by the community.

我们的核心主张是,细致的记忆类型划分、最小化路由和直接检索,足以在 LoCoMo 上达到最先进性能,并在 LongMemEval 上超过 full-context 基线 我们在 single-hop、multi-hop、open-domain 和 temporal 类别上都展示出稳定增益。 除 headline 指标之外,ENGRAM 的简单性也使它成为原则性实验的有吸引力基础:每个组件都很小、可解释,并且易于替换。 最后,我们提供完整实现和评估框架,以支持严格比较并促进社区采用。

Existing work on long-term memory for language models spans non-parametric retrieval methods, graph-based structures, and system-level abstractions. Non-parametric approaches augment a model with an external store accessed through dense or lexical retrieval. Nearest-neighbor LMs and large-scale retrieval-pretraining extend this line. They are attractive for mutability, yet often depend on retriever calibration and heuristic chunking. Graph-based methods organize memories as nodes and relations to support structured traversal and multi-hop reasoning. These designs capture compositional structure but introduce orchestration complexity and latency overhead at inference time. System-level approaches treat memory as a schedulable resource, adding lifecycle management and governance primitives.

关于语言模型长期记忆的现有工作涵盖非参数检索方法、基于图的结构和系统级抽象。 非参数方法用可通过密集检索或词法检索访问的外部存储来增强模型。 最近邻 LM 和大规模检索预训练扩展了这条路线。 它们因可变性而有吸引力,但通常依赖检索器校准和启发式 chunking。 基于图的方法把记忆组织为节点和关系,以支持结构化遍历和多跳推理。 这些设计能捕获组合结构,但在推理时引入编排复杂性和延迟开销。 系统级方法把记忆视为可调度资源,加入生命周期管理和治理原语。

Our work is closest to recent memory modules aimed at practical deployment. In contrast to multi-layer schedulers and heavy graph construction, ENGRAM retains the benefits of typed memory and semantic retrieval while minimizing moving parts, yielding a system that is straightforward to operate in both research and production at scale.

我们的工作最接近面向实际部署的近期记忆模块。 与多层调度器和繁重图构建不同,ENGRAM 保留了类型化记忆和语义检索的好处,同时最小化运动部件,从而得到一个在研究和规模化生产中都易于运行的系统。

3. The ENGRAM Architecture

System overview of ENGRAM
图1:ENGRAM 的系统概览。轮次被路由到类型化存储中,嵌入并持久化,随后在传给回答模型作为上下文之前通过语义搜索检索。该图同时突出记忆创建阶段(路由和抽取)以及检索阶段(top-k 选择、聚合和提示注入)。数字 (1)--(5) 标记下文引用的主要组件。

A router (1) determines which memory buckets apply to an incoming utterance. For each selected bucket, a lightweight extractor converts the utterance into a normalized record and requests an embedding (2). Records are persisted in SQLite together with their embeddings. At query time, the system retrieves the top-k items by cosine similarity from all buckets (3), merges and deduplicates the results (4), and supplies the selected snippets as context to the answering model (5). The stages (1)--(5) together with the overall architecture are shown in Figure 1. We formalize each stage in the subsections that follow, and provide an end-to-end QA walkthrough in the appendix.

路由器 (1) 决定传入话语适用于哪些记忆桶。 对于每个被选中的桶,轻量级抽取器把该话语转换为规范化记录,并请求一个嵌入 (2)。 记录会连同其嵌入一起持久化到 SQLite 中。 在查询时,系统从所有桶中按余弦相似度检索 top-k 条目 (3),合并并去重结果 (4),然后把选中的片段作为上下文提供给回答模型 (5)。 阶段 (1)--(5) 与整体架构如图1所示。 我们在后续小节中形式化每个阶段,并在附录中提供端到端 QA walkthrough。

3.1 Problem Setup

We model a dialogue as a sequence of turns

我们把一个对话建模为轮次序列

C={xt}t=1T,xt=(st,ut,τt).

where st{A,B} denotes the speaker identity, utU is the turn text drawn from the space of natural language, and τtR+ is a temporal marker. ENGRAM learns a mapping f:CM that transforms a dialogue C into a durable memory state M capable of supporting answers to questions qU posed after the conversation has unfolded.

其中 st{A,B} 表示说话者身份,utU 是来自自然语言空间的轮次文本,τtR+ 是时间标记。 ENGRAM 学习映射 f:CM,把对话 C 转换为持久记忆状态 M,该状态能够支持回答对话展开后提出的问题 qU

3.2 Routing and Storage

Every turn must be mapped into one or more memory types. ENGRAM employs a router (1) that decides which of the three stores a turn ut should update, represented as a compact three-bit mask bt

每个轮次都必须被映射到一种或多种记忆类型。 ENGRAM 使用路由器 (1) 决定一个轮次 ut 应该更新三个存储中的哪几个,并把结果表示为紧凑的三位 mask bt

r(ut){0,1}3bt=(btepi,btsem,btpro).

When the router outputs a one for a given type k{epi,sem,pro}, the turn is written to that store. For each selected type k, the system creates a structured record from the turn and pairs it with an embedding vector eRd produced by the encoder g:URd. Each memory record therefore has two parts: an interpretable set of fields (e.g., text and timestamp) and a dense vector representation suitable for similarity search. Because the router produces a compact three-bit mask, the routing process remains both interpretable and easy to ablate.

当路由器对给定类型 k{epi,sem,pro} 输出 1 时,该轮次会被写入对应存储。 对于每个被选中的类型 k,系统从该轮次创建结构化记录,并把它与编码器 g:URd 产生的嵌入向量 eRd 配对。 因此,每条记忆记录有两个部分:一组可解释字段(例如文本和时间戳),以及适合相似度搜索的密集向量表示。 由于路由器产生紧凑的三位 mask,路由过程既保持可解释,也容易消融。

3.3 Typed Memory Stores

Once turns are routed and stored, ENGRAM organizes them into three typed stores (2): episodic mepi, semantic msem, and procedural mpro. Episodic memory encodes events that unfold in time, semantic memory preserves stable facts or preferences, and procedural memory retains instructions or workflows. Formally, we represent records in each memory store as

一旦轮次被路由并存储,ENGRAM 就把它们组织成三种类型化存储 (2):情景记忆 mepi、语义记忆 msem 和程序记忆 mpro 情景记忆编码随时间展开的事件,语义记忆保存稳定事实或偏好,程序记忆保留指令或工作流。 形式上,我们把每个记忆存储中的记录表示为

mepi=(t,σ,δ,e),msem=(f,δ,e),mpro=(t,c,δ,e).

where t is a concise title, σ a short summary, δ a temporal anchor, and eRd an embedding vector from g. For semantic records, f is a fact string, and for procedural records, c is normalized content that may correspond to multi-step instructions. These typed records populate the memory state for a user

其中 t 是简洁标题,σ 是短摘要,δ 是时间锚点,eRd 是来自 g 的嵌入向量。 对于语义记录,f 是事实字符串;对于程序记录,c 是可能对应多步指令的规范化内容。 这些类型化记录填充用户的记忆状态:

M=(Mepi,Msem,Mpro).

where each Mk is a finite sequence of records of the corresponding schema—episodic (t,σ,δ,e), semantic (f,δ,e), and procedural (t,c,δ,e). Here, the typed separation constrains extraction, reduces competition at retrieval, and exposes structure that can be directly inspected by researchers or downstream models.

其中每个 Mk 都是对应 schema 的有限记录序列:情景记录 (t,σ,δ,e)、语义记录 (f,δ,e) 和程序记录 (t,c,δ,e) 在这里,类型化分离约束抽取、减少检索时的竞争,并暴露出研究者或下游模型可以直接检查的结构。

3.4 Dense Retrieval

Given the memory state M, the system must retrieve relevant records at query time. A query q is embedded as eq=g(q). For each store k{epi,sem,pro} and record mMk, ENGRAM computes cosine similarity and selects the top-k items within that store (3). These selected memories represent the most relevant items per type (e.g., events from episodic, facts from semantic, or instructions from procedural memory). The retrieved records are then merged and deduplicated across stores. Finally, the combined results are truncated (4) to a fixed budget of K=25, an intentional choice motivated by our ablation analysis (see Appendix Figure 1).

给定记忆状态 M,系统必须在查询时检索相关记录。 查询 q 被嵌入为 eq=g(q) 对于每个存储 k{epi,sem,pro} 和记录 mMk,ENGRAM 计算余弦相似度,并在该存储内选择 top-k 条目 (3)。 这些被选中的记忆表示每种类型中最相关的条目(例如来自情景记忆的事件、来自语义记忆的事实,或来自程序记忆的指令)。 随后,检索到的记录会跨存储合并并去重。 最后,合并结果被截断 (4) 到固定预算 K=25,这是一个由我们的消融分析推动的有意选择(见附图1)。

Rk(q)=TopK{score(q,m)mMk},R~(q)=TruncateK(DedupkRk(q)).

Rk(q) denotes the top-k records retrieved from store k, and R~(q) is the final set after merging, deduplication, and truncation. The result R~(q) is the set of memories passed forward to the answer generation stage.

Rk(q) 表示从存储 k 检索出的 top-k 记录,R~(q) 是合并、去重和截断后的最终集合。 结果 R~(q) 就是传递到答案生成阶段的记忆集合。

3.5 Answer Generation

At this stage, the retrieved memories are organized into speaker-specific banks to handle multi-speaker settings. Given a query q about a dialogue between speakers A and B, the system produces R~(q,A) and R~(q,B). Each record m is serialized as

在这一阶段,检索到的记忆被组织成说话者特定的记忆库,以处理多说话者设置。 给定一个关于说话者 AB 之间对话的查询 q,系统产生 R~(q,A)R~(q,B) 每条记录 m 被序列化为

(m)=δ(m):text(m).

which produces a compact representation (m) that aligns the temporal anchor δ(m) with the corresponding textual content text(m). To construct the final input for the model, we define Template as a fixed, non-learned formatting function that deterministically combines the query and serialized records into a natural-language prompt. The final prompt is then assembled (5) to include memory records from both speakers:

这会产生紧凑表示 (m),把时间锚点 δ(m) 与对应文本内容 text(m) 对齐。 为构造模型的最终输入,我们把 Template 定义为一个固定的、非学习的格式化函数,它以确定性方式把查询和序列化记录组合成自然语言提示。 随后,最终提示被组装 (5),以包含来自两位说话者的记忆记录:

P(q)=Template(q,{(m)}mR~(q,A),{(m)}mR~(q,B)).

This prompt is passed to the language model to produce the answer a^=LLM(P(q)). Separating speaker-specific banks ensures that evidence remains properly attributed, avoids conflating voices, and allows disambiguation when multiple interlocutors are present. This completes the end-to-end pipeline described in Section 3.

该提示被传给语言模型以产生答案 a^=LLM(P(q)) 分离说话者特定的记忆库可以确保对证据进行正确归因,避免混淆不同声音,并在存在多个对话参与者时支持消歧。 这完成了第 3 节所描述的端到端流水线。

4. Evaluation

We evaluate ENGRAM on two complementary long-horizon conversational benchmarks. LoCoMo compresses realistic two-speaker dialogues into long, multi-session conversations that probe diverse reasoning categories. LongMemEval instead embeds QA tasks in extended user–assistant histories, stressing durability, updates, and abstention. Our evaluation covers dataset-specific preprocessing, answer-quality and retrieval metrics, a principled baseline suite, and latency analysis. Numerical results appear in the next section.

我们在两个互补的长程对话基准上评估 ENGRAM。 LoCoMo 把现实的双人对话压缩成长的多会话对话,用来探测多样化推理类别。 LongMemEval 则把 QA 任务嵌入扩展的用户-助手历史中,强调持久性、更新和拒答。 我们的评估覆盖特定数据集预处理、答案质量和检索指标、有原则的基线套件以及延迟分析。 数值结果见下一节。

4.1 Benchmarks

LoCoMo LoCoMo comprises long-term multi-session dialogues constructed via a human--machine pipeline grounded in personas and event graphs, followed by human edits for long-range consistency. The released benchmark contains 10 dialogues, each averaging 600 turns and 16K tokens across up to 32 sessions. The QA split labels questions into five categories: single-hop, multi-hop, temporal, commonsense/world knowledge, and adversarial/unanswerable. Following common practice, we exclude adversarial/unanswerable items when reporting QA metrics and provide category-wise breakdowns for the remaining four types.

LoCoMo。 LoCoMo 包含长期多会话对话,这些对话通过基于 persona 和事件图的人机流水线构建,随后由人类编辑以保证长程一致性。 发布的基准包含 10 个对话,每个对话平均约 600 轮、约 16K token,最多跨 32 个会话。 QA 划分把问题标注为五类:single-hop、multi-hop、temporal、commonsense/world knowledge 以及 adversarial/unanswerable。 按照常见做法,我们在报告 QA 指标时排除 adversarial/unanswerable 条目,并为其余四类提供按类别分解的结果。

LongMemEval LongMemEval targets interactive memory in user--assistant settings and evaluates five core abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention (i.e. declining to answer when evidence is insufficient). The benchmark provides 500 curated questions embedded in length-configurable chat histories. We evaluate on LongMemEvalS (115K tokens per problem) and report QA metrics.

LongMemEval。 LongMemEval 面向用户-助手设置中的交互式记忆,并评估五项核心能力:信息抽取、多会话推理、时间推理、知识更新和拒答(即在证据不足时拒绝回答)。 该基准提供 500 个经策划的问题,它们嵌入在长度可配置的聊天历史中。 我们在 LongMemEvalS(每个问题约 115K token)上评估并报告 QA 指标。

4.2 Metrics

We report a suite of metrics that capture semantic correctness, lexical fidelity, retrieval quality, and efficiency. F1 and B1 (lexical fidelity). We report token-level F1 and BLEU-1/2 or B1-1/2 with smoothing as conventional indicators of surface-level agreement. Prior to scoring, predictions and references undergo deterministic normalization (Unicode NFKC, lower-casing, removal of articles and punctuation, whitespace compaction, light numeric canonicalization and resolution of relative temporal forms to ISO-8601). These metrics quantify wording overlap and enable comparability with prior work, but they are insensitive to factual inversions (e.g., Fiona was born in March'' vs. Fiona is born in February'' yields high overlap despite a critical error in semantic meaning). Consequently, F1/B1 are treated as complementary diagnostics rather than primary measures of correctness.

我们报告一组指标,用来捕获语义正确性、词法忠实度、检索质量和效率。 F1 和 B1(词法忠实度)。 我们报告 token 级 F1 以及带平滑的 BLEU-1/2 或 B1-1/2,把它们作为表层一致性的常规指标。 在打分前,预测和参考答案会经过确定性规范化(Unicode NFKC、小写化、移除冠词和标点、压缩空白、轻量数值规范化,以及把相对时间形式解析为 ISO-8601)。 这些指标量化措辞重叠,并支持与先前工作比较,但它们对事实反转不敏感(例如,“Fiona was born in March” 与 “Fiona is born in February” 尽管在语义意义上有关键错误,仍会得到较高重叠)。 因此,F1/B1 被视为互补诊断,而不是主要正确性度量。

LLM-as-a-Judge (semantic correctness). To capture factual accuracy beyond lexical overlap, we employ an independent LLM that, given the question, gold answer(s), and prediction, renders a binary semantic-correctness decision based on factual fidelity, relevance, completeness, and contextual appropriateness. We use GPT-4o-mini as the judging model. Because judge inferences are stochastic, each method is evaluated three times over the full test set and we report the mean ± one standard deviation. Judge scores serve as our principal measure of correctness, with F1/B1 reported alongside to contextualize lexical fidelity.

LLM-as-a-Judge(语义正确性)。 为了捕获超出词法重叠的事实准确性,我们使用一个独立 LLM;给定问题、标准答案和预测,它会基于事实忠实度、相关性、完整性和上下文适切性,给出二元语义正确性判断。 我们使用 GPT-4o-mini 作为评判模型。 由于 judge 推理具有随机性,每种方法都会在完整测试集上评估三次,我们报告均值 ± 一个标准差。 Judge 分数作为我们的主要正确性度量,同时报告 F1/B1 以提供词法忠实度背景。

Latency (efficiency). We measure per-question retrieval latency. This involves the search process for memories (e.g., similarity computation, ranking). Additionally, we sum the retrieval latency time and the answer generation time in order to report a full end-to-end latency metric.

延迟(效率)。 我们测量每个问题的检索延迟。 这包括针对记忆的搜索过程(例如相似度计算、排序)。 此外,我们把检索延迟时间和答案生成时间相加,以报告完整端到端延迟指标。

4.3 Baselines

Our comparison strategy separates breadth and depth to provide clear, high-signal conclusions. LoCoMo (breadth). LoCoMo’s realistic two-speaker dialogues and rich category labels make it well-suited for a broad baseline matrix that teases apart design choices. We therefore compare against a diverse set of memory systems, including Mem0 (API-based memory), MemOS (operating-system–style scheduler with MemCubes), LangMem and Zep (commercial or open-source APIs), RAG (retrieval-augmented generation without persistent stores), and a full-context control (upper bound with the entire conversation in context). This suite isolates the contributions of typed memory, minimal routing, and dense-only retrieval against widely used alternatives, enabling category-wise attribution of gains.

我们的比较策略把广度深度分开,以提供清晰且高信号的结论。 LoCoMo(广度)。 LoCoMo 的现实双人对话和丰富类别标签,使它非常适合用于拆解设计选择的宽基线矩阵。 因此,我们与多样化记忆系统比较,包括 Mem0(基于 API 的记忆)、MemOS(带 MemCubes 的操作系统式调度器)、LangMemZep(商业或开源 API)、RAG(没有持久存储的检索增强生成),以及 full-context 对照(把整个对话放入上下文的上界)。 这一套件把类型化记忆、最小化路由和纯密集检索相对于广泛使用替代方案的贡献隔离出来,从而支持按类别归因增益。

LongMemEval (depth). LongMemEval serves as a generalization stress test rather than a leaderboard battleground. The guiding research question is: How does ENGRAM behave when conversational horizons expand by orders of magnitude? To answer this cleanly, we freeze the LoCoMo-validated configuration and compare only against a strong, architecture-agnostic, full-context control. This isolates horizon generalization effects while avoiding confounds from retriever engineering that a broad baseline panel would reintroduce. It also respects reproducibility constraints on a benchmark whose histories are already very long and heavy.

LongMemEval(深度)。 LongMemEval 作为泛化压力测试,而不是 leaderboard 战场。 其指导性研究问题是: 当对话跨度扩展几个数量级时,ENGRAM 会如何表现? 为了干净地回答这个问题,我们冻结在 LoCoMo 上验证过的配置,并且只与一个强大的、架构无关的 full-context 对照比较。 这会隔离跨度泛化效果,同时避免宽基线面板重新引入检索器工程造成的混淆。 它也尊重该基准的可复现性约束,因为其历史已经非常长且沉重。

5. Results

5.1 Performance on LoCoMo

To contextualize ENGRAM’s performance, we evaluate against a diverse set of strong baselines spanning the principal design axes of previous state-of-the-art long-horizon memory systems (described in Section 4.3) while holding the LLM backbone fixed (gpt-4o-mini) to ensure consistency on the LoCoMo benchmark.

为了给 ENGRAM 的性能提供背景,我们在固定 LLM backbone(gpt-4o-mini)以确保 LoCoMo 基准一致性的同时,与一组多样化强基线进行评估,这些基线覆盖先前最先进长程记忆系统的主要设计轴(见第 4.3 节)。

表1:ENGRAM 与先前记忆系统在 LoCoMo 基准上的比较。报告指标包括各主要 QA 类别和 overall aggregate 上的 LLM-as-Judge 分数、token-level F1 和 BLEU-1(B1)。
CategoryMethodChunk/Mem TokTop-KLLM-as-Judge ScoreF1B1
single hoplangmem167--67.32 ± 0.1041.7434.82
mem011822073.65 ± 0.1246.2640.54
memOS15962078.32 ± 0.1645.3538.31
openai4104--61.73 ± 0.2436.7830.45
zep23012051.42 ± 0.1732.4427.37
ENGRAM9192079.90 ± 0.1223.1313.68
multi hoplangmem188--56.71 ± 0.1236.0227.23
mem011602057.85 ± 0.2035.4325.87
memOS15342063.70 ± 0.0135.3726.56
openai3967--59.82 ± 0.0433.0223.18
zep23432042.10 ± 0.0623.1114.69
ENGRAM9192079.79 ± 0.0618.3213.23
open domainlangmem211--49.56 ± 0.1629.6323.12
mem011512044.93 ± 0.0227.6719.97
memOS15042054.56 ± 0.2429.4622.32
openai4080--32.87 ± 0.0017.1711.01
zep22842039.12 ± 0.1419.9813.67
ENGRAM8952072.92 ± 0.178.565.47
temporal
reasoning
langmem138--24.21 ± 0.2138.3032.21
mem011852053.34 ± 0.3345.2038.03
memOS16622072.68 ± 0.1653.3445.95
openai4042--29.26 ± 0.0223.4018.36
zep23022019.47 ± 0.3118.6214.43
ENGRAM9112070.79 ± 0.1921.9014.74
overalllangmem168--55.28 ± 0.1339.2232.16
mem011772064.73 ± 0.1742.9036.05
memOS15932072.99 ± 0.1444.2036.75
openai4064--52.81 ± 0.1432.0825.39
zep23082042.29 ± 0.1827.0721.50
ENGRAM9162077.55 ± 0.1321.0813.31

Table 1 reports category-wise and overall performance on LoCoMo. ENGRAM achieves the highest overall semantic correctness, with an LLM-as-Judge score of 77.55 under a shared backbone and prompt. By category, the largest margins appear on multi-hop (79.79) and open-domain (72.92), and ENGRAM also leads on single hop reasoning (79.90). For temporal-reasoning questions, performance is stronger on all baselines except for memOS (72.68).

表1报告 LoCoMo 上按类别和 overall 的性能。 ENGRAM 达到最高 overall 语义正确性,在共享 backbone 和 prompt 下的 LLM-as-Judge 分数为 77.55。 按类别看,最大优势出现在 multi-hop(79.79)和 open-domain(72.92)上,ENGRAM 也在 single hop reasoning(79.90)上领先。 对于 temporal-reasoning 问题,除 memOS(72.68)之外,ENGRAM 的表现强于所有基线。

Importantly, these accuracy gains come with a smaller evidence budget (reported as “Chunk / Mem Tok”), which on average is 916 tokens for ENGRAM. This is more than a 35% decrease in memory tokens when compared to almost all of the other baselines, indicating that typed dense retrieval concentrates support into a compact context while preserving (and often improving) correctness. As anticipated in Section 4.2, F1/B1 solely reward surface-level overlap, not semantic accuracy. Despite leading LLM-as-Judge scores across all categories, we find that ENGRAM’s lexical scores appear lower because some responses are longer. We therefore include F1/B1 simply as complementary diagnostics and use the LLM-Judge as the principal correctness signal.

重要的是,这些准确率增益伴随更小的证据预算(报告为 “Chunk / Mem Tok”),ENGRAM 平均为 916 个 token。 与几乎所有其他基线相比,这意味着记忆 token 减少超过 35%,说明类型化密集检索能把支持证据集中到紧凑上下文中,同时保持(并常常改进)正确性。 正如第 4.2 节所预期,F1/B1 只奖励表层重叠,而不是语义准确性。 尽管 ENGRAM 在所有类别上的 LLM-as-Judge 分数都领先,我们发现它的词法分数看起来较低,因为有些响应更长。 因此,我们只把 F1/B1 作为互补诊断,并把 LLM-Judge 作为主要正确性信号。

To better understand the impact of the specific ENGRAM architecture, we conduct an ablation that removes typed routing, collapsing all utterances into one undifferentiated store. As reported in Appendix, Appendix Table 1, this configuration produces a noticeable decline in semantic correctness, with overall performance dropping to 46.56%. These results confirm that typed separation is not merely an architectural convenience but a key factor in concentrating relevant evidence and sustaining accuracy at long horizons.

为了更好地理解 ENGRAM 具体架构的影响,我们进行了一项移除类型化路由的消融,把所有话语折叠到一个未区分的存储中。 如附录的附表1所报告,该配置导致语义正确性明显下降,overall 性能降至 46.56%。 这些结果证实,类型化分离不只是架构便利,而是在长跨度上集中相关证据并维持准确率的关键因素。

5.2 Latency Analysis

表2:LoCoMo 数据集上各基线与 ENGRAM 的延迟比较。延迟以秒为单位报告 search 和 total response time 的 median(p50)和 95th percentile(p95)。Overall LLM-as-a-Judge(J)反映完整评估集上的 mean ± stdev accuracy。
MethodKSearch (s)Total (s)Overall J
p50p95p50p95
RAG, K=11280.2850.8250.7761.82847.78 ± 0.05
2560.2510.7130.7481.63150.23 ± 0.12
5120.2420.6410.7751.72346.23 ± 0.17
10240.2420.7210.8231.96141.02 ± 0.06
20480.2560.7540.9982.18438.02 ± 0.08
40960.2580.7191.0962.71436.09 ± 0.11
81920.2750.8411.4024.48243.57 ± 0.16
RAG, K=21280.2650.7680.7741.84560.03 ± 0.07
2560.2570.8040.8211.90960.45 ± 0.24
5120.2450.8330.8321.74558.27 ± 0.02
10240.2340.8620.8611.88050.34 ± 0.19
20480.2651.1061.1042.79449.16 ± 0.08
40960.2711.4611.4614.83251.82 ± 0.13
81920.2922.3672.3479.94961.26 ± 0.06
full-context------9.94017.83272.60 ± 0.07
langMem--16.3654.3418.4361.2255.28 ± 0.13
mem0200.1540.2100.7181.63064.73 ± 0.17
memOS201.8061.9834.9657.95772.99 ± 0.14
openAI------0.5240.91252.81 ± 0.14
zep200.5540.8121.3473.03142.29 ± 0.18
ENGRAM200.6030.8061.4871.81977.55 ± 0.13

Table 2 shows that ENGRAM achieves both low latency and high semantic correctness on LoCoMo. Its median search and total times are 0.603 s and 1.487 s, alongside an LLM-as-Judge score of 77.55. Relative to full-context, which reports a median total of 9.940 s and 72.60 J, ENGRAM is significantly faster ( 85%) while also outperforming in regards to accuracy. Because LoCoMo’s dialogues generally fit within modern context windows, full-context is a strong, model-specific reference for “best case” access to history (though not a strict upper bound due to distraction and lost-in-the-middle effects), which makes ENGRAM’s gains especially noteworthy.

表2显示,ENGRAM 在 LoCoMo 上同时实现低延迟和高语义正确性。 它的 median search 和 total time 分别为 0.603 s 和 1.487 s,同时 LLM-as-Judge 分数为 77.55。 相对于 full-context(其 median total 为 9.940 s,J 为 72.60),ENGRAM 明显更快(约 85%),同时准确率也更高。 由于 LoCoMo 的对话通常适合现代上下文窗口,full-context 是一种强的、模型特定的“最佳情况”历史访问参照(尽管由于干扰和 lost-in-the-middle 效应,它并不是严格上界),这使 ENGRAM 的增益尤其值得注意。

5.3 Testing with Scale

We probe scaling behavior on LongMemEvalS using the identical configuration validated on LoCoMo. As shown in Table 3, ENGRAM attains an overall LLM-as-Judge of 71.40%, surpassing the full-context control (56.20%) while consuming only 1.0–1.2K tokens per query versus 101K. This is approximately a 99% token reduction in input length. Similar compression-based strategies also preserve performance. Given that full-context affords the backbone maximal direct access to the dialogue, this comparison isolates the benefit of selective retrieval over indiscriminate inclusion: ENGRAM filters the history into a compact evidence set that the model can reliably act on, rather than relying on the model to sift through an order-of-magnitude larger prompt at inference time.

我们使用在 LoCoMo 上验证过的相同配置,探测 LongMemEvalS 上的缩放行为。 表3所示,ENGRAM 达到 71.40% 的 overall LLM-as-Judge,超过 full-context 对照(56.20%),同时每个查询只消耗约 1.0–1.2K token,而 full-context 为 101K。 这相当于输入长度约 99% 的 token 减少。 类似的基于压缩的策略也能保持性能。 鉴于 full-context 让 backbone 直接最大限度访问对话,这一比较隔离出选择性检索相对于不加区分地纳入历史的好处:ENGRAM 把历史过滤为紧凑证据集,使模型能够可靠地基于它行动,而不是在推理时依赖模型从大一个数量级的提示中筛选信息。

表3:LongMemEval benchmark 上的性能比较。我们报告 GPT-4o-mini 在各问题类型上的准确率,比较 full-context 基线与 ENGRAM。ENGRAM 使用约 99% 更少 token,同时保持更高准确率(71.40)。
Question TypeFull-context (101K tokens)ENGRAM (1.0K-1.2K tokens)
single-session-preference23.33%93.33%
single-session-assistant92.86%87.50%
temporal-reasoning37.59%55.64%
multi-session39.10%60.15%
knowledge-update79.49%74.36%
single-session-user82.86%97.14%
Overall J56.20%71.40%

The qualitative implication is that ENGRAM’s architectural bias, typed writes and dense set aggregation, acts as an effective information bottleneck at extreme horizons. Instead of amplifying “lost-in-the-middle” effects, the system consistently surfaces high-signal events, facts, and procedures that are causally relevant to the query. We view this as evidence that typed dense memory constitutes a scalable prior for long-horizon reasoning: the same design that produces competitive accuracy and latency on realistic, category-rich conversations also transfers to histories that are orders of magnitude longer, maintaining correctness while drastically reducing the contextual burden placed on the base model. This suggests that ENGRAM’s architectural bias is not only effective for benchmarks, but also promising for deployment in real-world systems where long histories and strict efficiency constraints are the norm.

定性含义是,ENGRAM 的架构偏置,即类型化写入和密集集合聚合,在极长跨度下充当有效的信息瓶颈。 系统没有放大 “lost-in-the-middle” 效应,而是持续浮现与查询有因果相关性的高信号事件、事实和程序。 我们把这视为证据,表明类型化密集记忆构成了长程推理的可扩展先验:同一设计在现实的、类别丰富的对话上产生有竞争力的准确率和延迟,也能迁移到长几个数量级的历史上,在大幅降低基础模型上下文负担的同时保持正确性。 这说明 ENGRAM 的架构偏置不仅对基准有效,也有希望部署到长历史和严格效率约束成为常态的真实系统中。

6. Discussion

Why a simple memory layer works. Across both LoCoMo and LongMemEvalS, the ENGRAM rows in Tables Table 1, Table 2, and Table 3 consistently show that a compact design—typed memories, minimal routing, and dense-only retrieval with set aggregation—outperforms substantially more elaborate systems in both accuracy and efficiency. These results challenge the assumption that long-term memory requires increasingly complex schedulers or heavy graph construction, and instead show that typed separation with straightforward retrieval is sufficient for state-of-the-art accuracy.

为什么简单记忆层有效。 在 LoCoMo 和 LongMemEvalS 上,表1表2表3中的 ENGRAM 行一致表明,一个紧凑设计,即类型化记忆、最小化路由以及带集合聚合的纯密集检索,在准确率和效率上都超过明显更复杂的系统。 这些结果挑战了长期记忆需要日益复杂调度器或繁重图构建的假设,反而表明类型化分离和直接检索足以达到最先进准确率。

Typed separation reduces competition at retrieval. A key ingredient is the typed partitioning of memory into episodic, semantic, and procedural stores. By performing per-type top-k retrieval and then merging with deduplication (Figure 1; Section 3), ENGRAM limits cross-type competition and avoids pitting unrelated items against each other in a single global ranking. The results show especially strong gains on multi-hop and open-domain questions, where reasoning benefits from heterogeneous evidence: event timelines (episodic), stable facts (semantic), and instructions or protocols (procedural). Typed retrieval ensures that each evidence mode is represented before the final truncation step, rather than being washed out by a monolithic scorer.

类型化分离减少检索竞争。 关键成分是把记忆类型化划分为情景、语义和程序存储。 通过执行按类型 top-k 检索,然后合并并去重(图1;第 3 节),ENGRAM 限制了跨类型竞争,并避免在单一全局排序中让无关条目彼此竞争。 结果显示,multi-hop 和 open-domain 问题上的增益尤其强,因为推理受益于异质证据:事件时间线(情景)、稳定事实(语义)以及指令或协议(程序)。 类型化检索确保每种证据模式在最终截断步骤之前都有表示,而不是被单一评分器冲淡。

Accuracy--efficiency frontier. The latency comparison in Table 2 shows that ENGRAM shifts the accuracy--efficiency frontier favorably. It achieves high semantic correctness while keeping response times low. This is not only a systems win; it is a modeling win. By constraining the prompt to a compact, high-signal subset, retrieval frees capacity for the answering model to reason, rather than forcing it to sift through long contexts that are prone to distraction and ``lost-in-the-middle'' effects.

准确率--效率前沿。 表2中的延迟比较显示,ENGRAM 有利地移动了准确率--效率前沿。 它在保持响应时间较低的同时达到高语义正确性。 这不仅是系统层面的胜利,也是建模层面的胜利。 通过把提示约束为紧凑、高信号子集,检索释放了回答模型的推理容量,而不是迫使它从容易受到干扰和 “lost-in-the-middle” 效应影响的长上下文中筛选信息。

Token economy without accuracy loss. On LoCoMo, ENGRAM operates with a smaller evidence budget than most baselines while still maintaining the strongest semantic correctness. On LongMemEvalS it preserves accuracy while reducing tokens by roughly two orders of magnitude. This reduction improves throughput, lowers inference cost, and reduces variance from prompt-length interactions.

不损失准确率的 token 经济性。 在 LoCoMo 上,ENGRAM 使用比多数基线更小的证据预算,同时仍保持最强语义正确性。 在 LongMemEvalS 上,它在保持准确率的同时把 token 减少约两个数量级。 这一减少提升吞吐、降低推理成本,并减少来自提示长度交互的方差。

The results suggest several promising directions. First, learning to route could be guided by weak supervision from Judge-derived gradients or distillation from stronger backbones, complementing reinforcement-learning approaches to memory management. Second, dynamic k and per-type budgets may be tuned to query uncertainty. Third, lightweight cross-type re-ranking could preserve the set-aggregation discipline while capturing dependencies across stores. Fourth, editable memory governance opens space for user-facing controls such as redaction and temporal decay. Finally, domain transfer to areas like tutoring, customer support, and on-device assistants would stress efficiency under strict token and latency constraints. Across these directions, the guiding principle remains the same: preserve the simplicity that makes ENGRAM fast, reproducible, and easy to adopt.

这些结果提示了几个有前景的方向。 第一,学习路由可以由来自 Judge 派生梯度的弱监督或更强 backbone 的蒸馏来引导,从而补充面向记忆管理的强化学习方法。 第二,动态 k 和按类型预算可以根据查询不确定性进行调节。 第三,轻量级跨类型重排序可以在捕获存储之间依赖关系的同时,保留集合聚合原则。 第四,可编辑记忆治理为用户侧控制打开空间,例如删除和时间衰减。 最后,到辅导、客户支持和端侧助手等领域的领域迁移,会在严格 token 和延迟约束下考验效率。 在这些方向中,指导原则保持不变:保留使 ENGRAM 快速、可复现且易于采用的简单性。

7. Conclusion

In this work, we introduced ENGRAM, a compact memory architecture that couples typed extraction with minimal routing and dense-only retrieval via set aggregation. Despite its simplicity, ENGRAM delivers strong empirical performance: it achieves state-of-the-art semantic correctness on LoCoMo (LLM-as-a-Judge 77.55%) under a shared backbone and prompt (Table 1), pairs this accuracy with low latency (median total 1.487 s; Table 2), and generalizes to LongMemEvalS where it surpasses a strong full-context control while using 99% fewer tokens (Table 3). These results indicate that careful memory typing and straightforward retrieval are sufficient to sustain long-horizon reasoning while improving efficiency.

在这项工作中,我们提出 ENGRAM,这是一个紧凑记忆架构,它把类型化抽取与最小化路由、以及通过集合聚合实现的纯密集检索结合起来。 尽管简单,ENGRAM 展现出强劲实证性能:在共享 backbone 和 prompt 下,它在 LoCoMo 上达到最先进语义正确性(LLM-as-a-Judge 77.55%表1),同时具有低延迟(median total 1.487 s;表2),并泛化到 LongMemEvalS,在那里它在使用 99% 更少 token 的同时超过一个强 full-context 对照表3)。 这些结果表明,细致的记忆类型划分和直接检索足以在提升效率的同时维持长程推理。

Limitations and future work. Like any system, ENGRAM has constraints. Its effectiveness depends on dense retrieval quality, and catastrophic misses (e.g., paraphrased facts outside the embedder’s neighborhood) can propagate directly to answers. The router is intentionally minimal, so complex utterances spanning multiple categories may require soft routing or overlapping writes. Our evaluation employs GPT-4o-mini as the judging model, and while we report mean ± standard deviation across runs, judge bias remains a limitation. The current formulation is also text-only and English-centric; extending to multilingual and multimodal settings will require type-aware encoders and language-specific extractors. Finally, typed separation improves interpretability but may under-represent cross-type interactions that benefit from joint modeling (e.g., procedural steps conditioned on evolving episodic context). Approaches that fine-tune explicit memory interfaces offer a complementary path. We see promising directions in learned routing, adaptive per-type budgets, lightweight cross-type re-ranking, and extending memory governance to support editing, temporal decay, and privacy constraints.

局限与未来工作。 与任何系统一样,ENGRAM 也有约束。 它的有效性依赖密集检索质量,灾难性漏检(例如处于嵌入器邻域之外的改写事实)可能直接传播到答案中。 路由器被有意保持最小化,因此跨多个类别的复杂话语可能需要软路由或重叠写入。 我们的评估使用 GPT-4o-mini 作为评判模型;虽然我们报告跨运行的均值 ± 标准差,但 judge 偏置仍是一个局限。 当前形式也仅限文本且以英语为中心;扩展到多语言和多模态设置需要类型感知编码器和特定语言抽取器。 最后,类型化分离提升可解释性,但可能低估受益于联合建模的跨类型交互(例如以演化情景上下文为条件的程序步骤)。 微调显式记忆接口的方法提供了一条互补路径。 我们看到一些有前景方向,包括学得的路由、自适应按类型预算、轻量级跨类型重排序,以及扩展记忆治理以支持编辑、时间衰减和隐私约束。

Taken together, the ENGRAM rows across Tables Table 1, Table 2, and Table 3 tell a coherent story: a simple, typed memory layer can be both accurate and efficient at long horizons. In doing so, we directly challenge the prevailing trend toward increasingly complex graph schedulers and multi-stage pipelines, showing instead that careful memory typing and straightforward dense retrieval suffice to deliver state-of-the-art long-term consistency. We hope ENGRAM and its accompanying artifacts serve the community as a transparent, reproducible baseline and a principled foundation for advancing long-term memory in language models.

综合来看,表1表2表3中的 ENGRAM 行讲述了一个一致故事:一个简单的类型化记忆层可以在长跨度上同时做到准确高效 这样一来,我们直接挑战了走向越来越复杂图调度器和多阶段流水线的主流趋势,转而表明细致记忆类型划分和直接密集检索足以提供最先进长期一致性。 我们希望 ENGRAM 及其配套 artifacts 能作为透明、可复现的基线和推进语言模型长期记忆的原则性基础服务社区。

Reproducibility Statement

We provide an anonymized repository at https://anonymous.4open.science/r/engram-68FF/ containing code for the ENGRAM system and its evaluation, experiment scripts to reproduce all main results, and detailed setup files that specify datasets and parameters. Implementation specifics and additional analyses are documented in the appendix. Together, these materials are intended to enable an independent reader to reproduce our results with ease and to encourage open discussion.

我们在 https://anonymous.4open.science/r/engram-68FF/ 提供一个匿名仓库,其中包含 ENGRAM 系统及其评估代码、用于复现所有主结果的实验脚本,以及指定数据集和参数的详细设置文件。 实现细节和额外分析记录在附录中。 这些材料共同旨在让独立读者能够轻松复现我们的结果,并鼓励开放讨论。

附表1:单存储变体消融。在 LoCoMo 上限制 ENGRAM 只使用单个记忆存储时的结果,独立测试每个存储对系统的贡献。
Episodic-Only Store
CategoryLLM Score
single hop72.06%
multi hop61.70%
open domain45.83%
temporal reasoning67.60%
overall66.60%
Semantic-Only Store
CategoryLLM Score
single hop65.40%
multi hop58.87%
open domain41.67%
temporal reasoning59.81%
overall61.56%
Procedural-Only Store
CategoryLLM Score
single hop65.28%
multi hop53.19%
open domain47.92%
temporal reasoning32.09%
overall55.06%
Single Memory Store
CategoryLLM Score
single hop46.10%
multi hop35.51%
open domain33.33%
temporal reasoning52.44%
overall46.56%
K-value scaling versus accuracy and memory token usage
附图1:K-value scaling vs accuracy。随着最终检索预算超参数 K 增大,LLM-as-a-Judge 分数提高,但检索证据 token 预算也快速增长;K = 25 在准确率与上下文成本之间取得最强平衡。