Skip to content


Abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems which provide the illusion of an extended virtual memory via paging between physical memory and disk. Using this technique, we introduce MemGPT (MemoryGPT), a system that intelligently manages different storage tiers in order to effectively provide extended context within the LLM's limited context window. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users.

大型语言模型(LLM)彻底改变了人工智能,但其有限的上下文窗口限制了它们在扩展对话和文档分析等任务中的实用性。为了能够使用超出有限上下文窗口的上下文,我们提出了虚拟上下文管理,这一技术借鉴了传统操作系统中分层内存系统的思路,后者通过物理内存和磁盘之间的分页,提供了扩展虚拟内存的假象。利用这一技术,我们引入了MemGPT(MemoryGPT),一个智能管理不同存储层级的系统,以在LLM有限的上下文窗口内有效提供扩展的上下文。我们在两个领域评估了这种受操作系统启发的设计,其中现代LLM有限的上下文窗口严重阻碍了它们的性能:文档分析,MemGPT能够分析远超底层LLM上下文窗口的大型文档;以及多会话聊天,MemGPT可以创建能够通过与用户的长期交互而记住、反思并动态演变的对话代理。

Introduction

In recent years, large language models (LLMs) and their underlying transformer architecture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications. Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their applicability to long conversations or reasoning about long documents. For example, the most widely used open-source LLMs can only support a few dozen back-and-forth messages or reason about a short document before exceeding their maximum input length (Touvron et al., 2023).

近年来,大型语言模型及其底层的Transformer架构已成为对话式人工智能的基石,并催生了广泛的消费级和企业级应用。尽管取得了这些进展,但LLM使用的有限定长上下文窗口严重阻碍了它们在长对话或长文档推理中的适用性。例如,最广泛使用的开源LLM在超过其最大输入长度之前,仅能支持几十轮来回的消息或对短文档进行推理。

Directly extending the context length of transformers incurs a quadratic increase in computational time and memory cost due to the transformer architecture's self-attention mechanism, making the design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev et al., 2020; Beltagy et al., 2020). While developing longer models is an active area of research (Dong et al., 2023), even if we could overcome the computational challenges of context scaling, recent research shows that long-context models struggle to utilize additional context effectively (Liu et al., 2023a). As consequence, given the considerable resources needed to train state-of-the-art LLMs and diminishing returns of context scaling, there is a critical need for alternative techniques to support long context.

直接扩展Transformer的上下文长度会因Transformer架构的自注意力机制而导致计算时间和内存成本呈二次方增长,这使得设计新的长上下文架构成为一个紧迫的研究挑战。虽然开发更长模型是一个活跃的研究领域,但即使我们能克服上下文扩展的计算挑战,最新研究表明,长上下文模型也难以有效利用额外的上下文。因此,鉴于训练最先进LLM需要大量资源且上下文扩展的收益递减,迫切需要替代技术来支持长上下文。

In this paper, we study how to provide the illusion of an infinite context while continuing to use fixed-context models. Our approach borrows from the idea of virtual memory paging that was developed to enable applications to work on datasets that far exceed the available memory by paging data between main memory and disk. We leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al., 2023b) to design MemGPT, an OS-inspired LLM system for virtual context management. Using function calls, LLM agents can read and write to external data sources, modify their own context, and choose when to return responses to the user. These capabilities allow LLMs to effective "page" in and out information between context windows (analogous to "main memory" in operating systems) and external storage, similar to hierarchical memory in traditional OSes. In addition, function calls can be leveraged to manage control flow between context management, response generation, and user interactions. This allows for an agent to choose to iteratively modify what is in its context for a single task, thereby more effectively utilizing its limited context.

本文研究如何在继续使用固定上下文模型的同时,提供无限上下文的假象。我们的方法借鉴了虚拟内存分页的思想,该思想通过在主内存和磁盘之间分页数据,使应用程序能够处理远超可用内存的数据集。我们利用LLM智能体函数调用能力的最新进展,设计了MemGPT,一个受操作系统启发的、用于虚拟上下文管理的LLM系统。通过函数调用,LLM智能体可以读写外部数据源、修改自身上下文,并选择何时返回给用户响应。这些能力使LLM能够在上下文窗口(类似于操作系统中的“主内存”)和外部存储之间有效地进行信息“分页”换入换出,类似于传统操作系统的分层内存。此外,函数调用还可用于管理上下文管理、响应生成和用户交互之间的控制流。这使得智能体能够为单个任务迭代地修改其上下文中的内容,从而更有效地利用其有限的上下文。

In MemGPT, we treat context windows as a constrained memory resource, and design a memory hierarchy for LLMs analogous to memory tiers used in traditional OSes (Patterson et al., 1988). Applications in traditional OSes interact with virtual memory, which provides an illusion of there being more memory resources than are actually available in physical (i.e., main) memory by the OS paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed by applications. To provide a similar illusion of longer context length (analogous to virtual memory), we allow the LLM to manage what is placed in its own context (analogous to physical memory) via an 'LLM OS', which we call MemGPT. MemGPT enables the LLM to retrieve relevant historical data missing from what is placed in-context, and also evict less relevant data from context and into external storage systems. Figure 3 illustrates the components of MemGPT.

在MemGPT中,我们将上下文窗口视为受限的内存资源,并为LLM设计了一个类似于传统操作系统中内存层级结构的内存层次。传统操作系统中的应用程序与虚拟内存交互,虚拟内存通过操作系统将溢出数据分页到磁盘,并在应用程序访问时(通过缺页中断)将数据取回内存,从而提供了一种比物理内存实际可用资源更多的假象。为了提供类似的更长上下文长度的假象(类似于虚拟内存),我们允许LLM通过一个名为MemGPT的“LLM操作系统”来管理其自身上下文(类似于物理内存)中放置的内容。MemGPT使LLM能够检索不在上下文中但相关的历史数据,并将不太相关的数据从上下文中驱逐到外部存储系统。图3展示了MemGPT的组件。

The combined use of a memory-hierarchy, OS functions and event-based control flow allow MemGPT to handle unbounded context using LLMs that have finite context windows. To demonstrate the utility of our new OS-inspired LLM system, we evaluate MemGPT on two domains where the performance of existing LLMs is severely limited by finite context: document analysis, where the length of standard text files can quickly exceed the input capacity of modern LLMs, and conversational agents, where LLMs bound by limited conversation windows lack context awareness, persona consistency, and long-term memory during extended conversations. In both settings, MemGPT is able to overcome the limitations of finite context to outperform existing LLM-based approaches.

内存层次结构、操作系统功能和基于事件的控制流的结合使用,使得MemGPT能够利用具有有限上下文窗口的LLM来处理无界上下文。为了展示我们受操作系统启发的新LLM系统的实用性,我们在两个现有LLM性能受有限上下文严重限制的领域评估了MemGPT:文档分析(标准文本文件的长度可能很快超过现代LLM的输入容量)和对话智能体(受限于有限对话窗口的LLM在扩展对话中缺乏上下文感知、角色一致性和长期记忆)。在这两种情况下,MemGPT都能克服有限上下文的限制,优于现有的基于LLM的方法。

MemGPT

MemGPT's OS-inspired multi-level memory architecture delineates between two primary memory types: main context (analogous to main memory/physical memory/RAM) and external context (analogous to disk memory/disk storage). Main context consists of the LLM prompt tokens—anything in main context is considered in-context and can be accessed by the LLM processor during inference. External context refers to any information that is held outside of the LLMs fixed context window. This out-of-context data must always be explicitly moved into main context in order for it to be passed to the LLM processor during inference. MemGPT provides function calls that allow the LLM processor to manage its own memory without any user intervention.

MemGPT 受操作系统启发的多级内存架构区分了两种主要的内存类型:主上下文(类似于主内存/物理内存/RAM)和外部上下文(类似于磁盘内存/磁盘存储)。主上下文由 LLM 的提示词元组成——主上下文中的任何内容都被视为上下文内,可以在推理期间由 LLM 处理器访问。外部上下文指保存在 LLM 固定上下文窗口之外的任何信息。这些上下文外数据必须始终被显式地移入主上下文,才能在推理期间传递给 LLM 处理器。MemGPT 提供了函数调用,允许 LLM 处理器在无需用户干预的情况下管理自己的内存。

2.1. Main context (prompt tokens)

The prompt tokens in MemGPT are split into three contiguous sections: the system instructions, working context, and FIFO Queue. The system instructions are read-only (static) and contain information on the MemGPT control flow, the intended usage of the different memory levels, and instructions on how to use the MemGPT functions (e.g. how to retrieve out-of-context data). Working context is a fixed-size read/write block of unstructured text, writeable only via MemGPT function calls. In conversational settings, working context is intended to be used to store key facts, preferences, and other important information about the user and the persona the agent is adopting, allowing the agent to converse fluently with the user. The FIFO queue stores a rolling history of messages, including messages between the agent and user, as well as system messages (e.g. memory warnings) and function call inputs and outputs. The first index in the FIFO queue stores a system message containing a recursive summary of messages that have been evicted from the queue.

2.1. 主上下文(提示词元)

MemGPT 中的提示词元被划分为三个连续的部分:系统指令工作上下文FIFO 队列。系统指令是只读(静态)的,包含有关 MemGPT 控制流、不同内存层级的预期用途以及如何使用 MemGPT 函数的说明(例如如何检索上下文外数据)。工作上下文是一个固定大小的读写非结构化文本块,只能通过 MemGPT 函数调用进行写入。在对话场景中,工作上下文旨在存储有关用户以及智能体所采用角色的关键事实、偏好和其他重要信息,从而使智能体能够与用户流畅对话。FIFO 队列存储消息的滚动历史记录,包括智能体与用户之间的消息,以及系统消息(例如内存警告)和函数调用的输入与输出。FIFO 队列中的第一个索引存储一条系统消息,其中包含已从队列中逐出的消息的递归摘要。

2.2. Queue Manager

The queue manager manages messages in recall storage and the FIFO queue. When a new message is received by the system, the queue manager appends the incoming messages to the FIFO queue, concatenates the prompt tokens and triggers the LLM inference to generate LLM output (the completion tokens). The queue manager writes both the incoming message and the generated LLM output to recall storage (the MemGPT message database). When messages in recall storage are retrieved via a MemGPT function call, the queue manager appends them to the back of the queue to reinsert them into the LLM's context window.

2.2. 队列管理器

队列管理器管理 召回存储FIFO 队列 中的消息。当系统收到新消息时,队列管理器将传入消息附加到 FIFO 队列,拼接提示词元并触发 LLM 推理以生成 LLM 输出(完成词元)。队列管理器将传入消息和生成的 LLM 输出都写入召回存储(MemGPT 消息数据库)。当通过 MemGPT 函数调用检索召回存储中的消息时,队列管理器会将这些消息附加到队列末尾,以将它们重新插入 LLM 的上下文窗口。

The queue manager is also responsible for controlling context overflow via a queue eviction policy. When the prompt tokens exceed the 'warning token count' of the underlying LLM's context window (e.g. 70% of the context window), the queue manager inserts a system message into the queue warning the LLM of an impending queue eviction (a 'memory pressure' warning) to allow the LLM to use MemGPT functions to store important information contained in the FIFO queue to working context or archival storage (a read/write database storing arbitrary length text objects). When the prompt tokens exceed the 'flush token count' (e.g. 100% of the context window), the queue manager flushes the queue to free up space in the context window: the queue manager evicts a specific count of messages (e.g. 50% of the context window), generates a new recursive summary using the existing recursive summary and evicted messages. Once the queue is flushed, the evicted messages are no longer in-context and immediately viewable to the LLM, however they are stored indefinitely in recall storage and readable via MemGPT function calls.

队列管理器还负责通过队列逐出策略控制上下文溢出。当提示词元超过底层 LLM 上下文窗口的“警告词元数”(例如上下文窗口的 70%)时,队列管理器会在队列中插入一条系统消息,警告 LLM 即将发生队列逐出(“内存压力”警告),以允许 LLM 使用 MemGPT 函数将 FIFO 队列中包含的重要信息存储到工作上下文或存档存储(一个存储任意长度文本对象的读写数据库)中。当提示词元超过“刷新词元数”(例如上下文窗口的 100%)时,队列管理器会刷新队列以释放上下文窗口中的空间:队列管理器逐出特定数量的消息(例如上下文窗口的 50%),使用现有的递归摘要和被逐出的消息生成一个新的递归摘要。一旦队列被刷新,被逐出的消息就不再处于上下文中且立即可供 LLM 查看,但它们会被无限期地存储在召回存储中,并可通过 MemGPT 函数调用读取。

2.3. Function executor (handling of completion tokens)

MemGPT orchestrates data movement between main context and external context via function calls that are generated by the LLM processor. Memory edits and retrieval are entirely self-directed: MemGPT autonomously updates and searches through its own memory based on the current context. For instance, it can decide when to move items between contexts (e.g. when the conversation history is becoming too long, as show in Figure 1) and modify its main context to better reflect its evolving understanding of its current objectives and responsibilities (as shown in Figure 3). We implement self-directed editing and retrieval by providing explicit instructions within the system instructions that guide the LLM on how to interact with the MemGPT memory systems. These instructions comprise two main components: (1) a detailed description of the memory hierarchy and their respective utilities, and (2) a function schema (complete with their natural language descriptions) that the system can call to access or modify its memory.

2.3. 函数执行器(完成词元的处理)

MemGPT 通过由 LLM 处理器生成的函数调用,来协调主上下文与外部上下文之间的数据移动。内存编辑和检索完全是自主的:MemGPT 根据当前上下文自主地更新和搜索其自身的内存。例如,它可以决定何时在上下文之间移动项目(例如,当对话历史变得过长时,如图 1 所示),并修改其主上下文以更好地反映其对当前目标和职责不断演变的理解(如图 3 所示)。我们通过在系统指令中提供明确的指令来实现自主编辑和检索,这些指令引导 LLM 如何与 MemGPT 内存系统交互。这些指令包含两个主要部分:(1)内存层次结构及其各自用途的详细描述,以及(2)系统可以调用的、用于访问或修改其内存的函数模式(包含其自然语言描述)。

During each inference cycle, LLM processor takes main context (concatenated into a single string) as input, and generates an output string. This output string is parsed by MemGPT to ensure correctness, and if the parser validates the function arguments the function is executed. The results, including any runtime errors that occur (e.g. trying to add to main context when it is already at maximum capacity), are then fed back to the processor by MemGPT. This feedback loop enables the system to learn from its actions and adjust its behavior accordingly. Awareness of context limits is a key aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the processor with warnings regarding token limitations to guide its memory management decisions. Additionally, our memory retrieval mechanisms are designed to be cognizant of these token constraints and implement pagination to prevent retrieval calls from overflowing the context window.

在每个推理周期中,LLM 处理器将主上下文(拼接成一个字符串)作为输入,并生成一个输出字符串。MemGPT 会解析此输出字符串以确保正确性,如果解析器验证了函数参数,则执行该函数。执行结果,包括发生的任何运行时错误(例如,当主上下文已达最大容量时尝试向其中添加内容),随后由 MemGPT 反馈给处理器。此反馈循环使系统能够从其行动中学习并相应地调整其行为。对上下文限制的感知是使自编辑机制有效工作的一个关键方面,为此,MemGPT 会向处理器发出关于词元限制的警告,以指导其内存管理决策。此外,我们的内存检索机制被设计为能够感知这些词元约束,并实现分页以防止检索调用溢出上下文窗口。

2.4. Control flow and function chaining

In MemGPT, events trigger LLM inference: events are generalized inputs to MemGPT and can consist of user messages (in chat applications), system messages (e.g. main context capacity warnings), user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading a document), and timed events that are run on a regular schedule (allowing MemGPT to run 'unprompted' without user intervention). MemGPT processes events with a parser to convert them into plain text messages that can be appended to main context and eventually be fed as input into the LLM processor.

2.4. 控制流与函数链式调用

在 MemGPT 中,事件触发 LLM 推理:事件是 MemGPT 的广义输入,可以包括用户消息(在聊天应用中)、系统消息(例如主上下文容量警告)、用户交互(例如用户刚登录或完成文档上传的提醒)以及按计划运行的定时事件(允许 MemGPT 在无需用户干预的情况下“无提示”运行)。MemGPT 通过解析器处理事件,将它们转换为可以附加到主上下文中并最终作为输入馈送给 LLM 处理器的纯文本消息。

Many practical tasks require calling multiple functions in sequence, for example, navigating through multiple pages of results from a single query or collating data from different documents in main context from separate queries. Function chaining allows MemGPT to execute multiple function calls sequentially before returning control to the user. In MemGPT, functions can be called with a special flag that requests control be immediately returned to the processor after the requested function completes execution. If this flag is present, MemGPT will add the function output to main context and (as opposed to pausing processor execution). If this flag is not present (a yield), MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message or scheduled interrupt).

许多实际任务需要按顺序调用多个函数,例如,在单个查询的结果中浏览多个页面,或者从不同查询中整理主上下文中的不同文档的数据。函数链式调用允许 MemGPT 在将控制权返回给用户之前顺序执行多个函数调用。在 MemGPT 中,可以用一个特殊标志来调用函数,该标志请求在请求的函数执行完毕后立即将控制权返回给处理器。如果存在此标志,MemGPT 会将函数输出添加到主上下文(而不是暂停处理器执行)。如果不存在此标志(即执行 yield 让出控制权),MemGPT 将不会运行 LLM 处理器,直到下一个外部事件触发(例如用户消息或定时中断)。

Long-context LLMs. Several lines of work have improved the context length of LLMs. For instance, more efficient transformer architectures via sparsifying the attention (Child et al., 2019; Beltagy et al., 2020), low-rank approximations (Wang et al., 2020), and neural memory (Lee et al., 2019). Another line of work aims to extend context windows beyond the length they were original trained for, their training size, such as Press et al. (2021); Chen et al. (2023). MemGPT builds upon these improvements in context length as they improve the size of the main memory in MemGPT. Our main contribution is a hierarchical tiered memory that uses a long-context LLM as the implementation of main memory.

长上下文LLM。 多个方向的研究改进了LLM的上下文长度。例如,通过稀疏化注意力实现更高效的Transformer架构、低秩近似以及神经记忆。另一项工作旨在将上下文窗口扩展到超过其原始训练长度,例如Press等人和Chen等人的研究。MemGPT建立在这些上下文长度的改进之上,因为它们增加了MemGPT中主内存的大小。我们的主要贡献是一个分层的层级化内存,它使用长上下文LLM作为主内存的实现。

Retrieval-Augmented Models. The design of the external memory of MemGPT builds upon much prior work augmenting LLMs with relevant inputs from external retrievers (Ram et al., 2023; Borgeaud et al., 2022; Karpukhin et al., 2020; Lewis et al., 2020; Guu et al., 2020; Lin et al., 2023). In particular, Jiang et al. (2023) propose FLARE, a method that allows the LLM to actively decide when and what to retrieve during the course of generation. Trivedi et al. (2022) interleave retrieval with Chain-of-Thoughts reasoning to improve multi-step question answering.

检索增强模型。 MemGPT外部内存的设计建立于许多先前工作的基础之上,这些工作通过来自外部检索器的相关输入来增强LLM。特别是,Jiang等人提出了FLARE,一种允许LLM在生成过程中主动决定何时检索以及检索什么的方法。Trivedi等人将检索与思维链推理交错进行,以改进多步问答。

LLMs as agents. Recent work has explored augmenting LLMs with additional capabilities to act as agents in interactive environments. Park et al. (2023) propose adding memory to LLMs and using the LLM as a planner, and observe emergent social behaviors in a multi-agent sandbox environment (inspired by The Sims video game) where agents can perform basic activities such as doing chores/hobbies, going to work, and conversing with other agents. Nakano et al. (2021) train models to search the web before answering questions, and use similar pagination concepts to MemGPT to control the underlying context size in their web-browsing environment. Yao et al. (2022) showed that interleaving chain-of-thought reasoning (Wei et al., 2022) can further improve the planning ability of interactive LLM-based agents; similarly in MemGPT, LLM is able to 'plan out loud' when executing functions. Liu et al. (2023b) introduced a suite of LLM-as-an-agent benchmarks to evaluate LLMs in interactive environments, including video games, thinking puzzles, and web shopping. In contrast, our work focuses on tackling the problem of equipping agents with long-term memory of user inputs.

LLM作为智能体。 近期工作探索了为LLM增强额外能力,使其能够在交互式环境中充当智能体。Park等人提出为LLM添加内存并用作规划器,并在一个多智能体沙盒环境中观察到了涌现的社会行为,其中智能体可以执行做家务/爱好、上班以及与其他智能体交谈等基本活动。Nakano等人训练模型在回答问题前先搜索网络,并使用与MemGPT类似的分页概念来控制其网络浏览环境中的底层上下文大小。Yao等人表明,交错使用思维链推理可以进一步提高基于LLM的交互式智能体的规划能力;类似地,在MemGPT中,LLM在执行函数时能够“大声规划”。Liu等人引入了一套LLM作为智能体的基准测试套件,用于在交互式环境中评估LLM,包括视频游戏、思维谜题和网络购物。相比之下,我们的工作专注于解决为智能体配备用户输入的长期记忆的问题。

Conclusion

In this paper, we introduced MemGPT, a novel LLM system inspired by operating systems to manage the limited context windows of large language models. By designing a memory hierarchy and control flow analogous to traditional OSes, MemGPT provides the illusion of larger context resources for LLMs. This OS-inspired approach was evaluated in two domains where existing LLM performance is constrained by finite context lengths: document analysis and conversational agents. For document analysis, MemGPT could process lengthy texts well beyond the context limits of current LLMs by effectively paging relevant context in and out of memory. For conversational agents, MemGPT enabled maintaining long-term memory, consistency, and evolvability over extended dialogues. Overall, MemGPT demonstrates that operating system techniques like hierarchical memory management and interrupts can unlock the potential of LLMs even when constrained by fixed context lengths. This work opens numerous avenues for future exploration, including applying MemGPT to other domains with massive or unbounded contexts, integrating different memory tier technologies like databases or caches, and further improving control flow and memory management policies. By bridging concepts from OS architecture into AI systems, MemGPT represents a promising new direction for maximizing the capabilities of LLMs within their fundamental limits.

在本文中,我们介绍了MemGPT,这是一个受操作系统启发的、用于管理大型语言模型有限上下文窗口的新型LLM系统。通过设计类似于传统操作系统的内存层次结构和控制流,MemGPT为LLM提供了更大上下文资源的假象。这种受操作系统启发的方法在两个现有LLM性能受限于有限上下文长度的领域进行了评估:文档分析和对话智能体。对于文档分析,MemGPT通过有效地将相关上下文分页换入换出内存,能够处理远超当前LLM上下文限制的长篇文本。对于对话智能体,MemGPT能够在扩展对话中维持长期记忆、一致性和演化能力。总的来说,MemGPT证明了像分层内存管理和中断这样的操作系统技术,即使在受限于固定上下文长度的情况下,也能释放LLM的潜力。这项工作为未来的探索开辟了众多途径,包括将MemGPT应用于其他具有海量或无界上下文的领域,集成数据库或缓存等不同的内存层级技术,以及进一步改进控制流和内存管理策略。通过将操作系统架构的概念引入AI系统,MemGPT代表了在LLM基本限制内最大化其能力的一个有前景的新方向。