Skip to content

Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

MemoryAgentAutoresearchUNCUPennUCSCUC BerkeleyCisco

Liu J, Ling Z, Qiu S, et al. Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory[J]. arXiv preprint arXiv:2604.01007v2, 2026.

https://github.com/aiming-lab/SimpleMem


Omni-SimpleMem:由自动研究引导发现的终身多模态智能体记忆

Abstract

AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual exploration or traditional AutoML to explore effectively. We deploy an autonomous research pipeline to discover Omni-SimpleMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naive baseline (F1 = 0.117 on LoCoMo), the pipeline autonomously executes 50 experiments across two benchmarks, diagnosing failure modes, proposing architectural modifications, and repairing data pipeline bugs, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo and +214% on Mem-Gallery relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML. We provide a taxonomy of six discovery types and identify four properties that make multimodal memory particularly suited for autoresearch, offering guidance for applying autonomous research pipelines to other AI system domains.

AI 智能体越来越多地在更长时间跨度上运行,但它们保留、组织和回忆多模态经历的能力仍是关键瓶颈。 构建有效的终身记忆需要穿越一个庞大的设计空间,覆盖架构、检索策略、提示工程和数据流水线;这个空间过大且相互关联过强,人工探索或传统 AutoML 都难以有效覆盖。 我们部署一个自主研究流水线来发现 Omni-SimpleMem,这是一个面向终身 AI 智能体的统一多模态记忆框架。 从一个朴素基线(LoCoMo 上 F1 = 0.117)出发,该流水线在两个基准上自主执行约 50 次实验,诊断失败模式、提出架构修改,并修复数据流水线错误,整个内循环无需人工介入。 所得系统在两个基准上都达到最先进结果,相比初始配置,在 LoCoMo 上将 F1 提升 +411%,在 Mem-Gallery 上提升 +214%。 关键的是,影响最大的发现并不是超参数调整:错误修复(+175%)、架构变化(+44%)和提示工程(特定类别 +188%)各自都超过全部超参数调优的累积贡献,展示出传统 AutoML 根本无法达到的能力。 我们给出六类发现的分类,并识别出使多模态记忆特别适合自动研究的四个性质,为将自主研究流水线应用到其他 AI 系统领域提供指导。

Overview of the discovery process of Omni-SimpleMem
图1:Omni-SimpleMem 发现过程概览。(a) 发现得到的架构:多模态输入先经过新颖性过滤,再压缩为 MAU,并通过混合的密集-稀疏-图搜索与金字塔扩展进行检索。(b) Mem-Gallery 上的自主优化轨迹:39 次实验将 F1 从 0.254 提升到 0.793(+214%)。

1. Introduction

Recent advances in large language models have given rise to AI agents capable of tool use, multi-step reasoning, and cross-modal comprehension. These agents interact with users over extended time horizons, accumulating diverse streams of text, images, audio, and video throughout their operation. However, their ability to retain, organize, and recall past experiences remains a critical bottleneck. Building effective lifelong multimodal memory requires navigating a vast design space spanning architectural choices (how to structure storage), retrieval strategies (how to find relevant information), prompt engineering (how to present context to the LLM), and data pipeline configurations (how to ingest and process heterogeneous inputs).

大语言模型的近期进展催生了能够使用工具、进行多步推理和跨模态理解的 AI 智能体。 这些智能体会在长时间跨度上与用户交互,并在运行过程中积累文本、图像、音频和视频等多样化信息流。 然而,它们保留、组织和回忆过往经历的能力仍然是关键瓶颈。 构建有效的终身多模态记忆需要穿越一个庞大的设计空间,覆盖架构选择(如何组织存储)、检索策略(如何找到相关信息)、提示工程(如何向 LLM 呈现上下文)以及数据流水线配置(如何摄取和处理异构输入)。

Existing approaches to agent memory fall into two broad categories, each with notable limitations. The first stores raw inputs and retrieves them via embedding similarity, suffering from storage bloat and retrieval noise as the memory grows. The second introduces structured memory management with explicit operations, but typically operates on text alone, discarding rich visual and auditory signals. Crucially, both categories are products of manual research cycles: a human researcher hypothesizes an improvement, implements it, evaluates on a benchmark, and iterates. A single researcher may explore only a handful of configurations per day, and important interactions between tightly coupled components are easily missed. Traditional AutoML methods can search over predefined numerical hyperparameter spaces, but cannot perform the code comprehension, bug diagnosis, architectural redesign, and cross-component reasoning that account for the largest performance gains in complex systems. As a result, existing memory systems inherit the blind spots of their designers--limitations that a more systematic search could avoid.

现有智能体记忆方法大致分为两类,每一类都有显著局限。 第一类存储原始输入,并通过嵌入相似性检索;随着记忆增长,它们会遭遇存储膨胀和检索噪声。 第二类引入带有显式操作的结构化记忆管理,但通常只处理文本,丢弃了丰富的视觉和听觉信号。 关键的是,这两类方法都是人工研究循环的产物:人类研究者提出改进假设、实现它、在基准上评估,然后迭代。 单个研究者每天可能只能探索少数配置,而紧密耦合组件之间的重要相互作用很容易被遗漏。 传统 AutoML 方法可以在预定义的数值超参数空间中搜索,但无法执行代码理解、错误诊断、架构重设计和跨组件推理,而这些正是复杂系统中最大性能增益的来源。 因此,现有记忆系统继承了其设计者的盲点,也就是更系统化的搜索本可以避免的局限。

Recent work on autonomous scientific discovery has shown that LLM agents can autonomously discover novel algorithms that outperform human-designed baselines, provided the target domain admits well-defined, quantitative evaluation signals. We ask whether this paradigm extends to complex, multi-component AI systems and answer affirmatively. We deploy AutoResearchClaw, a 23-stage autonomous research pipeline, to discover Omni-SimpleMem, a unified multimodal memory framework for lifelong AI agents. Starting from a naive baseline (F1 = 0.117 on LoCoMo), the pipeline autonomously executes 50 experiments across two benchmarks, iteratively diagnosing failure modes, proposing architectural modifications, repairing data pipeline bugs, and validating improvements, all without human intervention in the inner loop. The resulting system achieves state-of-the-art on both benchmarks, improving F1 by +411% on LoCoMo (0.117 0.598) and +214% on Mem-Gallery (0.254 0.797) relative to the initial configurations. Critically, the most impactful discoveries are not hyperparameter adjustments: bug fixes (+175%), architectural changes (+44%), and prompt engineering (+188% on specific categories) each individually exceed the cumulative contribution of all hyperparameter tuning, demonstrating capabilities fundamentally beyond the reach of traditional AutoML.

自主科学发现方面的近期工作已经表明,只要目标领域具有定义明确的定量评估信号,LLM 智能体就能够自主发现优于人工设计基线的新算法。 我们追问这一范式是否能扩展到复杂的多组件 AI 系统,并给出肯定回答。 我们部署 AutoResearchClaw,这是一个 23 阶段的自主研究流水线,用来发现 Omni-SimpleMem,即一个面向终身 AI 智能体的统一多模态记忆框架。 从一个朴素基线(LoCoMo 上 F1 = 0.117)出发,该流水线在两个基准上自主执行约 50 次实验,迭代式诊断失败模式、提出架构修改、修复数据流水线错误并验证改进,整个内循环无需人工介入。 所得系统在两个基准上都达到最先进结果,相比初始配置,在 LoCoMo 上将 F1 提升 +411%(0.117 0.598),在 Mem-Gallery 上提升 +214%(0.254 0.797)。 关键的是,影响最大的发现并不是超参数调整:错误修复(+175%)、架构变化(+44%)和提示工程(特定类别 +188%)各自都超过全部超参数调优的累积贡献,展示出传统 AutoML 根本无法达到的能力。

Among the pipeline's most consequential discoveries are three architectural principles that define Omni-SimpleMem. First, selective ingestion: lightweight perceptual encoders measure the information novelty of each incoming signal and discard redundant content before storage, significantly reducing storage requirements. Second, unified representation: all memories, regardless of modality, are represented as Multimodal Atomic Units (MAUs) that separate lightweight metadata from heavy raw data, enabling fast search over compact metadata while preserving full-content access on demand. Third, progressive retrieval: a pyramid mechanism expands information in three stages (summaries, details, raw evidence), each gated by a token budget, backed by a hybrid search strategy combining dense vector retrieval with sparse keyword matching via set-union merging, a strategy autonomously discovered by the pipeline. Our key observation is that multimodal memory is particularly well-suited for autonomous research pipelines due to four properties: immediate scalar evaluation metrics enabling tight optimization loops, modular architecture allowing isolated component modification, fast iteration cycles (1--2 hours per experiment) supporting dozens of hypotheses within days, and version-controlled code modifications allowing failed experiments to be cleanly reverted.

该流水线最重要的发现之一,是定义 Omni-SimpleMem 的三条架构原则。 第一,选择性摄取:轻量级感知编码器衡量每个传入信号的信息新颖性,并在存储前丢弃冗余内容,从而显著降低存储需求。 第二,统一表示:无论模态如何,所有记忆都表示为 Multimodal Atomic Units(MAUs),它们把轻量级元数据与沉重的原始数据分离,使系统能够在紧凑元数据上快速搜索,同时按需保留完整内容访问。 第三,渐进式检索:金字塔机制分三阶段扩展信息(摘要、细节、原始证据),每一阶段都受 token 预算门控,并由一种混合搜索策略支持,该策略通过集合并合并把密集向量检索与稀疏关键词匹配结合起来,也是流水线自主发现的策略。 我们的关键观察是,多模态记忆由于四个性质而特别适合自主研究流水线:即时标量评估指标支持紧密优化循环,模块化架构允许隔离修改组件,快速迭代周期(每次实验 1--2 小时)支持在数天内测试数十个假设,版本控制的代码修改允许干净地回退失败实验。

In summary, our primary contribution is Omni-SimpleMem, a unified multimodal memory framework whose architecture and configuration are discovered through AutoResearchClaw and that achieves state-of-the-art results on both evaluated benchmarks. Beyond the system itself, we provide a comprehensive taxonomy of autonomous discoveries across 50 experiments, revealing that the highest-impact improvements lie beyond the reach of traditional AutoML, and we characterize the convergence behavior, failure modes, and automatic recovery patterns of the pipeline. Our analysis further identifies four properties that make multimodal memory a particularly suitable domain for autoresearch, providing guidance for future applications to other AI system domains.

总之,我们的主要贡献是 Omni-SimpleMem,这是一个统一的多模态记忆框架,其架构和配置由 AutoResearchClaw 发现,并且在两个评估基准上都达到最先进结果。 除系统本身之外,我们还给出约 50 次实验中自主发现的完整分类,揭示影响最大的改进超出了传统 AutoML 的能力范围,并刻画了流水线的收敛行为、失败模式和自动恢复模式。 我们的分析进一步识别出使多模态记忆成为特别适合自动研究领域的四个性质,为未来应用到其他 AI 系统领域提供指导。

Autonomous Scientific Discovery. The vision of AI-driven research has advanced rapidly. The AI Scientist demonstrated end-to-end paper generation at $15 per paper across three ML domains, with its successor AI Scientist v2 eliminating human-authored templates through agentic tree search. FunSearch pairs LLM creativity with programmatic evaluation to discover novel mathematical constructions. AutoResearchClaw introduces a 23-stage autonomous research pipeline with multi-agent debate and self-healing execution. AI-Researcher introduces collaborative multi-agent frameworks with structured iterative refinement, and Bilevel Autoresearch meta-optimizes the search strategy itself. A comprehensive survey by Tie et al. maps the evolution from foundational modules (2022--2023) through closed-loop systems (2024) to scalable human-AI collaboration (2025+). We apply the autoresearch paradigm to multi-component AI system optimization, where the challenge shifts from discovering isolated artifacts to diagnosing and improving interactions across tightly coupled modules.

自主科学发现。 AI 驱动研究的愿景正在快速推进。 The AI Scientist 展示了在三个机器学习领域中以每篇约 $15 的成本端到端生成论文,其后续 AI Scientist v2 通过智能体树搜索消除了人工撰写模板。 FunSearch 将 LLM 创造力与程序化评估配对,用于发现新的数学构造。 AutoResearchClaw 引入了一个 23 阶段自主研究流水线,包含多智能体辩论和自愈执行。 AI-Researcher 引入带有结构化迭代细化的协作式多智能体框架,而 Bilevel Autoresearch 对搜索策略本身进行元优化。 Tie 等人的综合综述梳理了从基础模块(2022--2023)、闭环系统(2024)到可扩展人机协作(2025+)的演进。 我们将自动研究范式应用到多组件 AI 系统优化中,在这里挑战从发现孤立产物转向诊断并改进紧密耦合模块之间的相互作用。

Multimodal Memory Systems. Memory-augmented LLM agents have evolved from text-only systems, including MemGPT with OS-inspired memory hierarchies, Generative Agents with recency-importance-relevance scoring, SimpleMem with efficient lifelong memory, and A-Mem with LLM-directed reorganization, to multimodal architectures. MemVerse combines episodic-semantic memory with multimodal knowledge graphs but requires three LLM calls per ingested item. Mem0 offers dynamic fact extraction with optional graph memory. VisRAG indexes visual pages directly, avoiding text extraction losses. Claude-Mem provides commercial embedding-based dialogue memory. These systems all require extensive manual tuning of retrieval strategies, ingestion pipelines, and prompt configurations, which is precisely the kind of optimization that autonomous research pipelines can accelerate.

多模态记忆系统。 增强记忆的 LLM 智能体已经从纯文本系统演进到多模态架构,包括采用类操作系统记忆层次结构的 MemGPT、带有新近性-重要性-相关性评分的 Generative Agents、具有高效终身记忆的 SimpleMem,以及由 LLM 引导重组的 A-Mem。 MemVerse 将情景-语义记忆与多模态知识图结合起来,但每个摄取条目需要三次 LLM 调用。 Mem0 提供动态事实抽取,并可选配图记忆。 VisRAG 直接索引视觉页面,避免文本抽取损失。 Claude-Mem 提供商业化的基于嵌入的对话记忆。 这些系统都需要对检索策略、摄取流水线和提示配置进行大量人工调优,而这正是自主研究流水线能够加速的优化类型。

Automated Machine Learning. Neural Architecture Search automates model design but operates primarily on well-defined architectural search spaces with differentiable or reinforcement-learning-based objectives. Hyperparameter optimization methods efficiently navigate continuous and categorical spaces, while systems like Auto-sklearn 2.0 automate full ML pipelines including preprocessing and model selection via meta-learning. More recently, LLM-based agents have been applied to ML tasks: MLAgentBench benchmarks LLM agents on ML research tasks involving code modification, demonstrating the potential of language-guided optimization. Our setting differs fundamentally: the "search space" includes not only hyperparameters and architectural choices, but also prompt engineering, data pipeline bug detection and repair, evaluation format alignment, and cross-component interaction diagnosis, all of which require natural language understanding and code modification capabilities beyond traditional AutoML.

自动化机器学习。 神经架构搜索会自动化模型设计,但主要运行在定义清晰的架构搜索空间中,并使用可微目标或基于强化学习的目标。 超参数优化方法能高效穿越连续空间和类别空间,而 Auto-sklearn 2.0 等系统通过元学习自动化完整机器学习流水线,包括预处理和模型选择。 更近期,基于 LLM 的智能体已经被应用到机器学习任务:MLAgentBench 在涉及代码修改的机器学习研究任务上评测 LLM 智能体,展示了语言引导优化的潜力。 我们的设定有根本不同:“搜索空间”不仅包括超参数和架构选择,还包括提示工程数据流水线错误检测与修复评估格式对齐以及跨组件相互作用诊断,这些都需要超越传统 AutoML 的自然语言理解和代码修改能力。

3. Autoresearch-Guided Discovery of Omni-SimpleMem

In this section, we describe the autonomous optimization process and the system it produces. We first overview the pipeline (Section 3.1), then present the discovered Omni-SimpleMem architecture (Section 3.2), followed by the benchmark-specific optimization strategy (Section 3.3).

在本节中,我们描述自主优化过程以及它产出的系统。 我们首先概述流水线(第 3.1 节),然后介绍发现得到的 Omni-SimpleMem 架构(第 3.2 节),最后说明面向具体基准的优化策略(第 3.3 节)。

3.1 Pipeline Overview

As discussed in Section 1, the design space of multimodal memory systems is too large and interconnected for manual exploration to cover effectively. To address this, we deploy AutoResearchClaw, a 23-stage autonomous research pipeline, to systematically optimize Omni-SimpleMem. The pipeline receives three inputs: (1) the SimpleMem codebase, a unimodal text-only lifelong memory framework, as a starting point, (2) two benchmark evaluation harnesses with quantitative metrics (F1), and (3) API access to LLM providers. It then enters an iterative loop: at each step, the pipeline analyzes prior results, generates a hypothesis for improvement, implements the change in code, evaluates on a benchmark, and decides whether to proceed (metric improved by 0.5%), iterate (ambiguous result; refine the current hypothesis), or pivot (two consecutive degradations; revert and try a new direction). Of 50 total experiments, most resulted in a proceed decision, with the remainder split between iterate and pivot. The full pipeline phases (scoping, literature discovery, multi-agent debate, experiment design, sandboxed execution, analysis, documentation, and finalization) are described in Liu et al.; here we focus on the discoveries it makes and the mechanisms that enable them.

如第 1 节所讨论,多模态记忆系统的设计空间过大且相互关联过强,人工探索难以有效覆盖。 为解决这一问题,我们部署 AutoResearchClaw,这是一个 23 阶段的自主研究流水线,用来系统性优化 Omni-SimpleMem。 该流水线接收三类输入:(1) SimpleMem 代码库,它是一个单模态纯文本终身记忆框架,作为起点;(2) 两个带有定量指标(F1)的基准评测框架;(3) 对 LLM 提供方的 API 访问。 随后它进入一个迭代循环:在每一步,流水线分析此前结果,生成改进假设,在代码中实现修改,在基准上评估,并决定是否继续(指标提升 0.5%)、迭代(结果有歧义;细化当前假设)或转向(连续两次退化;回退并尝试新方向)。 在总共约 50 次实验中,大多数得到继续决策,其余在迭代和转向之间分布。 完整流水线阶段(定范围、文献发现、多智能体辩论、实验设计、沙盒执行、分析、文档化和最终化)由 Liu 等人描述;这里我们聚焦于它做出的发现以及支撑这些发现的机制。

3.2 The Discovered Architecture

The pipeline takes SimpleMem, a unimodal text-only lifelong memory framework, as its starting point. We provide AutoResearchClaw with the SimpleMem codebase and instruct it to extend the system from text-only memory to full multimodal support, autonomously designing the necessary architectural components for ingesting, storing, and retrieving heterogeneous signals (text, images, audio, video). Through iterative experimentation, the pipeline converges to an architecture organized around three principles: selective ingestion, progressive retrieval, and structured knowledge (Figure 2).

该流水线以 SimpleMem 作为起点,SimpleMem 是一个单模态纯文本终身记忆框架。 我们向 AutoResearchClaw 提供 SimpleMem 代码库,并指示它将系统从纯文本记忆扩展到完整多模态支持,自主设计摄取、存储和检索异构信号(文本、图像、音频、视频)所需的架构组件。 通过迭代实验,该流水线收敛到一种围绕三条原则组织的架构:选择性摄取、渐进式检索和结构化知识(图2)。

Omni-SimpleMem architecture overview
图2:Omni-SimpleMem 架构概览。左:选择性摄取通过模态特定的新颖性检测器过滤多模态输入(文本、图像、音频、视频),并创建带有 LLM 生成摘要和嵌入的 MAU。中:MAU 存储在热存储(摘要、嵌入、元数据)和冷存储(原始内容)中,同时实体抽取会构建带有类型化实体和关系的知识图。右:检索通过集合并合并结合密集检索(FAISS)、稀疏检索(BM25)和图 h-hop 搜索,然后在 token 预算 B 下通过金字塔机制(摘要 全文 原始内容)渐进扩展结果。
3.2.1 Selective Ingestion

The first principle is selective ingestion: the system first filters redundant inputs, then encapsulates the retained signals into a unified multimodal representation.

第一条原则是选择性摄取:系统首先过滤冗余输入,然后把保留下来的信号封装为统一的多模态表示。

Novelty-Based Filtering. Before any data enters the memory store, lightweight perceptual encoders assess the novelty of incoming information and discard redundant content. For vision, CLIP embeddings are compared across consecutive frames to detect scene changes; for audio, VAD speech probability gates retention to reject silence; for text, Jaccard overlap with recent summaries filters near-duplicates. This filtering significantly reduces storage requirements without losing semantic content.

基于新颖性的过滤。 在任何数据进入记忆存储之前,轻量级感知编码器会评估传入信息的新颖性并丢弃冗余内容。 对于视觉,系统比较连续帧之间的 CLIP 嵌入以检测场景变化;对于音频,VAD 语音概率门控保留过程以拒绝静音;对于文本,近期摘要的 Jaccard 重叠用于过滤近重复内容。 这种过滤在不丢失语义内容的情况下显著降低了存储需求。

Multimodal Atomic Units. Signals that pass the novelty filter are encapsulated as Multimodal Atomic Units (MAUs), M=s,e,p,τ,m,, which decouple compact searchable metadata from heavy raw content. Here s is a text summary, eRd is its embedding, p points to raw content in cold storage, τ is the timestamp, m is modality, and stores structural links to other MAUs. This yields a two-tier design: hot storage keeps summaries, embeddings, and temporal/graph metadata for fast retrieval, while cold storage keeps large assets (images, audio, video) and is accessed lazily through p.

多模态原子单元。 通过新颖性过滤的信号会被封装为 Multimodal Atomic Units(MAUs)M=s,e,p,τ,m,,它们将紧凑、可搜索的元数据与沉重的原始内容解耦。 其中 s 是文本摘要,eRd 是其嵌入,p 指向冷存储中的原始内容,τ 是时间戳,m 是模态, 存储指向其他 MAU 的结构链接。 这形成两层设计:热存储保存摘要、嵌入以及时间/图元数据以支持快速检索,而冷存储保存大型资产(图像、音频、视频),并通过 p 延迟访问。

Once memories are ingested and stored as MAUs, the next challenge is how to retrieve them efficiently at query time. The second principle is progressive retrieval: rather than loading all retrieved content into the LLM context at once, Omni-SimpleMem expands information in stages under explicit token budgets.

一旦记忆被摄取并存储为 MAU,下一个挑战就是如何在查询时高效检索它们。 第二条原则是渐进式检索:Omni-SimpleMem 不会一次性把所有检索内容加载到 LLM 上下文中,而是在显式 token 预算下分阶段扩展信息。

Hybrid Dense-Sparse Search. Given a user query q, dense retrieval via FAISS, a library for efficient similarity search over high-dimensional vectors, yields semantically similar candidates D(q) by inner-product search over L2-normalized MAU embeddings. In parallel, BM25 scoring over MAU summaries yields keyword-matched candidates K(q). A key discovery of the autonomous pipeline is set-union merging: empirically, score-based re-ranking (the standard approach) disrupts semantic ordering and degrades performance. Instead, dense results retain their original ranking and BM25-only results are appended:

混合密集-稀疏搜索。 给定用户查询 q,通过 FAISS 进行的密集检索会在 L2 归一化的 MAU 嵌入上执行内积搜索,得到语义相似候选 D(q);FAISS 是一个面向高维向量高效相似性搜索的库。 与此同时,在 MAU 摘要上进行 BM25 评分会得到关键词匹配候选 K(q) 自主流水线的一个关键发现是集合并合并:经验上,基于分数的重排序(标准做法)会扰乱语义顺序并降低性能。 因此,密集结果保留其原始排序,只把 BM25 独有结果附加到后面:

R(q)=D(q)(K(q)D(q)).

Pyramid Retrieval. The hybrid search above produces a candidate set R(q), each candidate scored once by cosine similarity sim(q,Mi)=eqei during dense retrieval. The pyramid mechanism then progressively expands the content of these candidates across three levels, reusing this score to gate each transition: Level 1 returns only summaries ( 10 tokens each) for the top-k candidates by similarity; Level 2 loads full text or detailed captions for candidates whose similarity exceeds a threshold θ; Level 3 loads raw content (images, audio) from cold storage under an explicit token budget B, expanding items greedily in decreasing similarity-per-token order. All transitions are governed by deterministic rules rather than LLM judgment, avoiding additional latency while adapting context depth to each query's complexity.

金字塔检索。 上述混合搜索产生候选集合 R(q),每个候选在密集检索期间由余弦相似度 sim(q,Mi)=eqei 评分一次。 随后,金字塔机制在三个层级上渐进扩展这些候选的内容,并复用该分数来门控每次转换:Level 1 只返回相似度最高的 top-k 候选的摘要(每个约 10 个 token);Level 2 为相似度超过阈值 θ 的候选加载全文或详细描述;Level 3 在显式 token 预算 B 下从冷存储加载原始内容(图像、音频),并按每 token 相似度递减顺序贪心扩展条目。 所有转换都由确定性规则而不是 LLM 判断支配,从而避免额外延迟,同时让上下文深度适配每个查询的复杂度。

3.2.3 Knowledge Graph-Augmented Retrieval

While hybrid search and pyramid retrieval handle queries that can be answered from individual MAUs, many real-world queries require reasoning over multiple connected facts (e.g., "What gift did I give to the person I met at the conference in March?"). The third principle is therefore structured knowledge: Omni-SimpleMem maintains a knowledge graph G=(V,E) that captures entities and relationships across all MAUs.

虽然混合搜索和金字塔检索可以处理能由单个 MAU 回答的查询,但许多真实世界查询需要对多个相互连接的事实进行推理(例如,“我给三月在会议上遇到的那个人送了什么礼物?”)。 因此,第三条原则是结构化知识:Omni-SimpleMem 维护一个知识图 G=(V,E),用于捕获所有 MAU 中的实体和关系。

During MAU creation, an LLM extracts entities and directed relations from each summary, producing entity-relation triples. Each entity carries a type label from 7 categories (Person, Location, Event, Concept, Time, Organization, Object) and is linked back to its source MAU. As new MAUs are ingested, the same real-world entity may appear under different surface forms (e.g., "Dr. Smith" vs. "John Smith"). To prevent node fragmentation, entity resolution merges entities whose hybrid similarity, combining cosine similarity over name embeddings with Jaro-Winkler string similarity, exceeds a threshold.

在创建 MAU 时,LLM 会从每个摘要中抽取实体和有向关系,产出实体-关系三元组。 每个实体都带有来自 7 个类别(人物、地点、事件、概念、时间、组织、对象)的类型标签,并链接回其来源 MAU。 随着新 MAU 被摄取,同一个真实世界实体可能以不同表面形式出现(例如,“Dr. Smith”和“John Smith”)。 为防止节点碎片化,实体消解会合并混合相似度超过阈值的实体,该混合相似度结合了名称嵌入的余弦相似度与 Jaro-Winkler 字符串相似度。

At query time, the system identifies seed entities VqV mentioned in the query and performs bounded neighborhood expansion within h hops. Each reached entity is scored with distance-decayed relevance rG(v)=βd(v,Vq)conf(v), where d(v,Vq) is the shortest path distance to any seed entity and β(0,1) is a decay factor. MAUs linked to high-scoring graph entities are merged with the hybrid search results from R(q), providing both direct content matches and relationally connected evidence for answer generation.

在查询时,系统识别查询中提到的种子实体 VqV,并在 h 跳内执行有界邻域扩展。 每个到达的实体都会按距离衰减相关性 rG(v)=βd(v,Vq)conf(v) 评分,其中 d(v,Vq) 是到任一种子实体的最短路径距离,β(0,1) 是衰减因子。 链接到高分图实体的 MAU 会与来自 R(q) 的混合搜索结果合并,为答案生成同时提供直接内容匹配和关系连接证据。

3.3 Benchmark-Specific Optimization

Having described the multimodal architecture that the pipeline discovers from SimpleMem, we now turn to how it optimizes this architecture for each target benchmark. The pipeline employs a two-phase strategy: rapid iteration on a small training subset, followed by evaluation on the held-out test set.

在描述了流水线从 SimpleMem 发现的多模态架构之后,我们现在转向它如何针对每个目标基准优化该架构。 该流水线采用两阶段策略:先在小训练子集上快速迭代,然后在留出测试集上评估。

Development subset for fast iteration. For each benchmark, the pipeline selects a small representative subset for rapid experimentation during the optimization loop. On LoCoMo, a small subset of conversations is used for iterative development, enabling each experiment to complete in under 2 hours. On Mem-Gallery, a small subset of datasets is used, with each experiment completing in minutes. This design enables the pipeline to explore dozens of hypotheses within days. After the optimization trajectory converges, the final configuration is evaluated on the complete benchmark to ensure generalization and to maintain consistency with the evaluation protocols used by prior memory systems.

用于快速迭代的开发子集。 对于每个基准,流水线都会选择一个小型代表性子集,用于在优化循环中快速实验。 在 LoCoMo 上,系统使用一个小型对话子集进行迭代开发,使每个实验能在 2 小时内完成。 在 Mem-Gallery 上,系统使用一个小型数据集子集,每个实验可在数分钟内完成。 这种设计使流水线能够在数天内探索数十个假设。 在优化轨迹收敛后,最终配置会在完整基准上评估,以确保泛化,并保持与先前记忆系统所用评估协议的一致性。

Iterative diagnosis and repair. During each optimization cycle, the pipeline autonomously diagnoses and repairs failures at two levels. At the execution level, when an experiment fails or produces unexpected outputs, a self-healing module classifies the error (API error, dependency error, runtime exception, output format mismatch) and generates a targeted fix. For example, when the embedding service returned 403 errors due to an expired API key, the module detected the authentication failure pattern and switched to a local sentence-transformer backend without manual intervention. At the semantic level, when experiments succeed but produce unexpectedly poor metrics, the pipeline performs deeper analysis.

迭代诊断与修复。 在每个优化周期中,流水线会在两个层级自主诊断和修复失败。 执行层级,当实验失败或产出异常输出时,自愈模块会对错误进行分类(API 错误、依赖错误、运行时异常、输出格式不匹配),并生成定向修复。 例如,当嵌入服务由于 API key 过期返回 403 错误时,该模块检测到认证失败模式,并在没有人工介入的情况下切换到本地 sentence-transformer 后端。 语义层级,当实验成功运行但产生异常糟糕的指标时,流水线会执行更深入分析。

4. Experiments

We evaluate Omni-SimpleMem along two dimensions: (1) the autonomous optimization process, specifically whether the pipeline discovers meaningful improvements across diverse benchmarks, and (2) final system quality, examining whether the discovered architecture achieves state-of-the-art results and whether individual components contribute meaningfully.

我们从两个维度评估 Omni-SimpleMem:(1) 自主优化过程,尤其是流水线是否能在多样基准上发现有意义的改进;(2) 最终系统质量,考察发现得到的架构是否达到最先进结果,以及各个组件是否具有实质贡献。

4.1 Experimental Setup

Optimization trajectories on LoCoMo and Mem-Gallery
图3:LoCoMo(上,9 次迭代)和 Mem-Gallery(下,跨 7 个阶段的 39 次实验)上的优化轨迹。实线表示被接受的迭代;失败/回退实验用 × 标记。红色虚线表示先前 SOTA。每个阶段的关键发现已标注。

Benchmarks. We evaluate on two benchmarks spanning complementary types of memory-dependent reasoning: LoCoMo (1,986 QA pairs across multi-session dialogues, token-level F1) and Mem-Gallery (1,711 QA pairs from 240 multimodal dialogues with 1,003 grounded images, F1). Detailed benchmark descriptions are provided in the appendix.

基准。 我们在两个基准上评估,它们覆盖互补类型的依赖记忆推理:LoCoMo(跨多会话对话的 1,986 个 QA 对,token 级 F1)和 Mem-Gallery(来自 240 个多模态对话、含 1,003 张落地图像的 1,711 个 QA 对,F1)。 详细基准描述见附录。

Baselines. We compare against six memory systems representing diverse design philosophies: MemVerse (hierarchical episodic-semantic memory with multimodal knowledge graph), Mem0 (dynamic fact extraction with optional graph memory), Claude-Mem (commercial embedding-based dialogue memory), A-MEM (LLM-directed memory reorganization), MemGPT (OS-inspired memory hierarchies), and SimpleMem (efficient lifelong memory with atomization and adaptive pruning). All are evaluated under identical splits and protocols.

基线。 我们与六个代表不同设计理念的记忆系统比较:MemVerse(带有多模态知识图的层级情景-语义记忆)、Mem0(动态事实抽取,可选图记忆)、Claude-Mem(商业化的基于嵌入的对话记忆)、A-MEM(由 LLM 引导的记忆重组)、MemGPT(受操作系统启发的记忆层次结构)和 SimpleMem(带有原子化与自适应剪枝的高效终身记忆)。 所有系统都在相同划分和协议下评估。

Implementation. Dense retrieval uses FAISS with all-MiniLM-L6-v2 embeddings (384d); sparse retrieval uses BM25; visual novelty filtering uses frozen CLIP ViT-B/32. Knowledge graph extraction uses GPT-4o in JSON mode. Default configuration: top-k = 20, θ = 0.4, B = 6,000 tokens. Full per-benchmark configuration details are provided in the appendix.

实现。 密集检索使用 FAISS 和 all-MiniLM-L6-v2 嵌入(384d);稀疏检索使用 BM25;视觉新颖性过滤使用冻结的 CLIP ViT-B/32。 知识图抽取使用 JSON 模式下的 GPT-4o。 默认配置为:top-k = 20,θ = 0.4,B = 6,000 tokens。 完整的逐基准配置细节见附录。

4.2 Optimization Trajectories

表1:五个 LLM backbone 上 LoCoMo(左)和 Mem-Gallery(右)的比较。LoCoMo 列:MH = multi-hop,SH = single-hop,Tmp = temporal,Open = open-domain,Adv = adversarial,All = overall F1。Mem-Gallery 列:F1,EM = exact match,B/B-1/B-2 = BLEU/BLEU-1/BLEU-2。排除 Omni-SimpleMem 后的最佳基线以下划线表示;整体最佳以粗体表示。
BackboneMethodLoCoMoMem-Gallery
MHSHTmpOpenAdvAllF1EMBB-1B-2
GPT-4oMemVerse0.2600.1570.1960.1920.9440.3650.5050.3300.2700.4400.355
Mem00.3090.1560.2170.2950.8570.3970.2980.1920.1820.2680.224
Claude-Mem0.2940.1530.1670.2430.9150.3830.2100.1480.1480.1940.170
A-MEM0.2950.1740.2000.2660.8980.3940.3700.2520.2400.3320.285
MemGPT0.3050.1880.2460.3050.8430.4040.4350.2980.2750.3900.335
SimpleMem0.3180.1950.2350.3080.8020.4320.5350.3480.3100.4680.390
Omni-SimpleMem0.5560.3650.2550.6410.8350.5980.7970.4490.3660.6270.505
GPT-4o-miniMemVerse0.1470.0740.1060.0930.7470.2900.4500.2950.2480.3950.330
Mem00.2850.1120.1790.2970.7610.3640.2910.1880.1850.2650.223
Claude-Mem0.2450.1020.1220.2150.8450.3380.2720.1750.1720.2450.210
A-MEM0.2780.0910.1630.2600.8230.3570.3300.2220.2050.2980.252
MemGPT0.2830.1130.1820.2890.7760.3640.3980.2620.2420.3550.298
SimpleMem0.3000.1280.1780.3120.8910.4040.4980.3180.2900.4350.368
Omni-SimpleMem0.5440.1960.1770.5880.7790.5190.7490.4030.3340.5830.465
GPT-4.1-nanoMemVerse0.1460.0610.1690.1150.7110.2560.4700.3080.2550.4100.340
Mem00.2900.1340.1940.2770.5370.3100.2680.1760.1560.2380.199
Claude-Mem0.0870.0290.1190.0470.7050.2460.3030.1940.1720.2680.223
A-MEM0.0450.0160.1420.0500.7470.2160.3650.2420.2250.3250.275
MemGPT0.2870.1300.2340.2790.5560.3160.3600.2380.2180.3180.268
SimpleMem0.2980.1450.2100.2850.6480.3420.5180.3380.3000.4520.380
Omni-SimpleMem0.4770.2160.2440.5830.7220.4920.7800.4300.3530.6100.488
GPT-5.1MemVerse0.2870.1730.2770.2970.7800.3830.4780.3120.2620.4180.345
Mem00.2920.1600.2610.2980.8190.3900.2700.1750.1570.2400.200
Claude-Mem0.2890.1710.2640.2920.8140.3880.3050.2030.1880.2790.230
A-MEM0.2870.1640.2460.2840.8260.3850.4080.2680.2420.3650.302
MemGPT0.2880.1650.2490.2940.8060.3850.4250.2750.2500.3780.315
SimpleMem0.3050.1780.2720.3050.8070.4180.5380.3500.3120.4700.395
Omni-SimpleMem0.5980.3670.3070.6760.7470.6130.8100.4600.3740.6390.515
GPT-5-nanoMemVerse0.2080.2030.1680.2520.7410.3660.4780.3150.2620.4200.345
Mem00.2640.1430.2370.2700.7370.3520.2830.1760.1650.2500.210
Claude-Mem0.0910.0630.0920.0880.7360.2750.3500.2490.2170.3150.264
A-MEM0.2600.1490.2230.2570.7450.3480.5050.3320.2900.4450.368
MemGPT0.2670.1510.2260.2710.7440.3550.3880.2550.2300.3450.288
SimpleMem0.2780.2000.2450.2820.8240.3880.5220.3400.3020.4580.385
Omni-SimpleMem0.3570.3710.2530.5610.7190.5220.7870.4370.3570.6170.494

Figure 3 visualizes the optimization trajectories. The pipeline completed 50 experiments across both benchmarks in 72 hours of wall-clock time, a coverage that would require approximately 4 weeks for a human researcher at 3 experiments per day. We highlight the most impactful discoveries; detailed per-iteration tables are provided in the appendix.

图3 展示了优化轨迹。 该流水线在两个基准上用约 72 小时墙钟时间完成约 50 次实验;若人类研究者每天约 3 次实验,要达到这种覆盖大约需要 4 周。 我们强调影响最大的发现;详细的逐迭代表见附录。

LoCoMo (9 iterations, F1: 0.117 0.598). The pipeline executed 9 successful iterations over 48 hours, with 2 additional experiments automatically reverted (full trajectory in Appendix Table). The most impactful discovery (Iter 1, +175%) was identifying that the API call lacked a response_format parameter, a one-line bug causing 9 × verbosity that destroyed F1 precision. In Iter 5, the pipeline discovered that all 4,277 MAU timestamps had been corrupted to the ingestion date and autonomously generated a keyword-matching script that corrected 99.98% of them without re-ingestion. The pipeline also discovered that set-union merging of FAISS and BM25 results (Iter 2) significantly outperforms score-based fusion, a finding confirmed by ablation (Section 4.4).

LoCoMo(9 次迭代,F1:0.117 0.598)。 该流水线在 48 小时内执行了 9 次成功迭代,另有 2 次实验被自动回退(完整轨迹见附录表)。 影响最大的发现(第 1 次迭代,+175%)是识别出 API 调用缺少 response_format 参数,这是一个一行错误,会导致输出冗长度增加 9 ×,从而破坏 F1 精度。 在第 5 次迭代中,流水线发现全部 4,277 个 MAU 时间戳都被破坏为摄取日期,并自主生成了一个关键词匹配脚本,在不重新摄取的情况下修正了其中 99.98%。 流水线还发现,FAISS 与 BM25 结果的集合并合并(第 2 次迭代)显著优于基于分数的融合,这一发现由消融实验(第 4.4 节)确认。

Mem-Gallery (39 experiments, F1: 0.254 0.797). Optimization spanned 7 phases (full trajectory in Appendix Table). The single largest improvement (+53%) came from discovering that returning full original dialogue text instead of LLM-generated summaries dramatically improves token-overlap F1, a non-obvious finding, since summaries are traditionally preferred for efficiency. The pipeline also found that prompt constraint positioning (before vs. after the question) matters more than constraint content, with one category improving +188% from this change alone. After Phase 7, four independent runs yielded F1 in [0.791, 0.797], confirming the performance ceiling and triggering the pipeline's decision to stop.

Mem-Gallery(39 次实验,F1:0.254 0.797)。 优化跨越 7 个阶段(完整轨迹见附录表)。 单次最大改进(+53%)来自一个发现:返回完整原始对话文本而不是 LLM 生成摘要,会显著提升 token-overlap F1;这是一个并不显然的发现,因为传统上摘要更受效率考虑青睐。 流水线还发现,提示约束的位置(问题之前还是之后)比约束内容更重要,仅这一变化就让一个类别提升 +188%。 第 7 阶段后,四次独立运行得到的 F1 位于 [0.791, 0.797],确认了性能上限,并触发流水线决定停止。

4.3 Main Results

To contextualize these results against existing memory systems, we conduct a controlled comparison of Omni-SimpleMem against six baselines across five LLM backbones (GPT-4o, GPT-4o-mini, GPT-4.1-nano, GPT-5.1, and GPT-5-nano). Table 1 reports per-category F1 on LoCoMo and five evaluation metrics on Mem-Gallery across all backbones.

为了将这些结果置于现有记忆系统的背景中,我们在五个 LLM backbone(GPT-4o、GPT-4o-mini、GPT-4.1-nano、GPT-5.1 和 GPT-5-nano)上,对 Omni-SimpleMem 与六个基线进行了受控比较。 表1 报告了所有 backbone 上 LoCoMo 的逐类别 F1,以及 Mem-Gallery 的五个评估指标。

On LoCoMo, Omni-SimpleMem achieves the highest overall F1 across all backbones, ranging from 0.492 (GPT-4.1-nano) to 0.613 (GPT-5.1), substantially outperforming SimpleMem (0.342--0.432), the current state-of-the-art on LoCoMo. Omni-SimpleMem dominates on multi-hop, single-hop, and open-domain categories, with particularly large margins on open-domain questions.

在 LoCoMo 上,Omni-SimpleMem 在所有 backbone 上都取得最高 overall F1,范围从 0.492(GPT-4.1-nano)到 0.613(GPT-5.1),显著优于当前 LoCoMo 最先进系统 SimpleMem(0.342--0.432)。 Omni-SimpleMem 在 multi-hop、single-hop 和 open-domain 类别上占据主导,尤其是在 open-domain 问题上优势很大。

On Mem-Gallery, Omni-SimpleMem achieves F1 ranging from 0.749 to 0.810, consistently outperforming all memory baselines by a wide margin. SimpleMem is again the strongest baseline (F1 up to 0.538 with GPT-5.1), but still trails Omni-SimpleMem by over 25 percentage points. These patterns confirm that Omni-SimpleMem's gains come from its architectural design (hybrid search, pyramid retrieval, knowledge graph augmentation) rather than from a single dominant component.

在 Mem-Gallery 上,Omni-SimpleMem 的 F1 位于 0.749 到 0.810 之间,始终以较大幅度优于所有记忆基线。 SimpleMem 再次是最强基线(使用 GPT-5.1 时 F1 最高 0.538),但仍落后 Omni-SimpleMem 超过 25 个百分点。 这些模式确认,Omni-SimpleMem 的增益来自其架构设计(混合搜索、金字塔检索、知识图增强),而不是来自某一个占主导的单一组件。

4.4 Analysis

4.4.1 Ablation Studies
表2:LoCoMo 上的组件消融(4 个 backbone 上的平均 Δ F1 × 100)。
Component Removed

Δ F1

Rel.
w/o Pyramid Expansion−10.2−17%
w/o BM25 Hybrid−8.5−14%
w/o LLM Summarization−7.3−12%

Reduced top-k (5 vs 20)

−4.2−7%
w/o Metadata Context−1.4−2%

Table 2 presents an ablation study on LoCoMo that validates key design choices discovered by the pipeline. Specifically, we remove individual components and report the mean Δ F1 across 4 LLM backbones. Pyramid expansion (-17%) and BM25 hybrid search (-14%) are the most critical, and both were significantly refined by the autonomous pipeline during optimization. LLM summarization contributes -12%, confirming that compact MAU summaries are essential for retrieval quality. Reducing top-k from 20 to 5 costs -7%, while metadata context has a modest effect (-2%). Notably, the two most impactful components (pyramid expansion and hybrid search) are precisely those that received the most optimization iterations, suggesting the pipeline correctly allocated its search budget.

表2 给出了 LoCoMo 上的消融研究,验证了流水线发现的关键设计选择。 具体而言,我们移除单个组件,并报告 4 个 LLM backbone 上的平均 Δ F1。 金字塔扩展(-17%)和 BM25 混合搜索(-14%)最关键,而且二者都在优化期间被自主流水线显著细化。 LLM 摘要贡献 -12%,确认紧凑的 MAU 摘要对检索质量至关重要。 将 top-k 从 20 降到 5 会损失 -7%,而元数据上下文影响较温和(-2%)。 值得注意的是,影响最大的两个组件(金字塔扩展和混合搜索)恰好也是获得最多优化迭代的组件,这表明流水线正确分配了其搜索预算。

4.4.2 Efficiency
Throughput versus F1
图4:吞吐量与 F1。Omni-SimpleMem 使用 8 个 worker 时实现 3.5 × 更高吞吐量。
表3:延迟分解。Ret./Gen. 的单位为毫秒。
Methodq/sRet.Gen.
SimpleMem1.6845550
MemVerse1.4870596
Mem01.4618665
Omni-SimpleMem (w=1)1.05118846
Omni-SimpleMem (w=8)5.81461821

Omni-SimpleMem achieves 5.81 queries/sec with 8 parallel workers (3.5 × faster than the fastest baseline), enabled by read-only FAISS and BM25 indices supporting concurrent lookup (Figure 4, Table 3). All baselines are bottlenecked by sequential LLM generation (85--97% of per-query time), while Omni-SimpleMem parallelizes the retrieval-generation pipeline via thread-safe read-only indices.

Omni-SimpleMem 使用 8 个并行 worker 时达到 5.81 queries/sec(比最快基线快 3.5 ×),这得益于只读 FAISS 和 BM25 索引支持并发查找(图4表3)。 所有基线都受顺序 LLM 生成瓶颈限制(占每查询时间的 85--97%),而 Omni-SimpleMem 通过线程安全的只读索引并行化检索-生成流水线。

4.4.3 Case Study: Multi-Hop Retrieval

We illustrate Omni-SimpleMem's retrieval pipeline on a real multi-hop query from LoCoMo that requires synthesizing facts across separate conversation sessions. The query asks: "What subject have Caroline and Melanie both painted?" The gold answer is "sunsets," but answering correctly requires retrieving each person's painting history from different sessions and identifying the overlap.

我们用 LoCoMo 中一个真实的 multi-hop 查询来说明 Omni-SimpleMem 的检索流水线,该查询需要综合不同对话会话中的事实。 查询是:“Caroline 和 Melanie 都画过什么主题?” 标准答案是“sunsets”,但正确回答需要从不同会话中检索每个人的绘画历史,并识别二者的重叠。

Hybrid search. Dense retrieval returns MAUs mentioning Caroline's paintings (e.g., "Caroline painted a sunset") and Melanie's art projects (e.g., "Mel and her kids painted a sunset with a tree"), but these appear in separate sessions with different surrounding context. BM25 recovers additional MAUs containing the keyword "paint" that rank lower in dense results. Set-union merging preserves the dense ordering and appends BM25-only matches.

混合搜索。 密集检索返回提到 Caroline 画作的 MAU(例如,“Caroline painted a sunset”)以及 Melanie 艺术项目的 MAU(例如,“Mel and her kids painted a sunset with a tree”),但它们出现在不同会话中,周围上下文也不同。 BM25 找回了包含关键词 “paint” 的额外 MAU,这些条目在密集结果中的排名较低。 集合并合并保留密集排序,并附加 BM25 独有匹配。

Knowledge graph expansion. The query processor extracts seed entities Caroline (Person) and Melanie (Person). Neighborhood expansion links both entities to painting (Concept) and sunset (Concept) through separate relation paths, surfacing MAUs that mention each person's painting activities even when the surface text does not co-mention both names.

知识图扩展。 查询处理器抽取种子实体 Caroline(Person)和 Melanie(Person)。 邻域扩展通过不同关系路径把两个实体都连接到 painting(Concept)和 sunset(Concept),从而浮现提到每个人绘画活动的 MAU,即便表层文本没有同时提到两个名字。

Pyramid retrieval and answer. Level 1 summaries from both relation paths are loaded; their similarity scores exceed θ, triggering Level 2 expansion of the full conversation text. The LLM identifies "sunsets" as the common subject and produces the correct answer (F1 = 1.0). In contrast, MemGPT, which lacks cross-session entity linking, hallucinates "Horses" (F1 = 0.0).

金字塔检索与答案。 两条关系路径的 Level 1 摘要被加载;它们的相似度分数超过 θ,触发 Level 2 对完整对话文本的扩展。 LLM 识别出 “sunsets” 是共同主题,并产出正确答案(F1 = 1.0)。 相比之下,缺少跨会话实体链接的 MemGPT 幻觉出 “Horses”(F1 = 0.0)。

5. Conclusion

We have presented Omni-SimpleMem, a unified multimodal memory framework whose architecture and configuration are discovered through AutoResearchClaw, an autonomous research pipeline. Starting from a naive baseline, the pipeline autonomously executed 50 experiments in 72 hours, achieving state-of-the-art on both LoCoMo and Mem-Gallery. The highest-impact discoveries, including bug fixes, architectural changes, and prompt engineering, require code comprehension and cross-component reasoning beyond the reach of traditional AutoML. Our taxonomy of six discovery types, together with the observation that multimodal memory is well-suited for autoresearch due to its scalar metrics, modular architecture, and fast iteration cycles, provides a roadmap for applying autonomous research pipelines to other complex AI system domains.

我们提出了 Omni-SimpleMem,这是一个统一的多模态记忆框架,其架构和配置通过自主研究流水线 AutoResearchClaw 发现。 从一个朴素基线出发,该流水线在约 72 小时内自主执行约 50 次实验,并在 LoCoMo 和 Mem-Gallery 上都达到最先进结果。 影响最大的发现包括错误修复、架构变化和提示工程,它们需要代码理解和跨组件推理,超出了传统 AutoML 的能力范围。 我们提出的六类发现分类,加上多模态记忆因其标量指标、模块化架构和快速迭代周期而适合自动研究这一观察,为将自主研究流水线应用到其他复杂 AI 系统领域提供了路线图。