Skip to content


生成式智能体:人类行为的交互式仿真

图1:生成式智能体沙盒环境
图1:生成式智能体是面向交互式应用的人类行为可信仿真体。本文在一个类似《模拟人生》的沙盒环境中放置了 25 个智能体;用户可以观察和干预智能体如何规划一天、分享消息、建立关系并协调群体活动。

Abstract

Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents: computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty-five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors. For example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture—observation, planning, and reflection—each contribute critically to the believability of agent behavior. By fusing large language models with computational interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.

人类行为的可信代理可以增强交互式应用程序的能力,从沉浸式环境到人际交流的排练空间再到原型设计工具。 在本文中,我们介绍了生成式智能体:模拟可信人类行为的计算软件代理。 生成式智能体起床,做早餐,然后去上班;艺术家作画,作家写作;他们形成意见、互相关注并发起对话;他们在计划第二天时会记住并反思过去的日子。 为了实现生成式智能体,我们描述了一种架构,该架构扩展了大型语言模型,以使用自然语言存储代理经验的完整记录,随着时间的推移将这些记忆合成为更高层次的反思,并动态检索它们以规划行为。 我们实例化生成式智能体来填充受 The Sims 启发的交互式沙箱环境,最终用户可以使用自然语言与一个由 25 个代理组成的小镇进行交互。 在评估中,这些生成主体产生可信的个人和新兴的社会行为。 例如,从仅一个用户指定的想法开始,即一个代理想要举办 Valentine's Day 聚会,代理在接下来的两天内自主地发出聚会邀请,结识新朋友,互相询问参加聚会的日期,并协调在正确的时间一起出现在聚会上。 我们通过消融证明,我们的智能体架构的组成部分——观察、规划和反思——每一个都对智能体行为的可信度做出了至关重要的贡献。 通过将大型语言模型与计算式交互智能体融合,本文提出了用于实现人类行为可信仿真的架构模式与交互模式。

1. Introduction

How might we craft an interactive artificial society that reflects believable human behavior? From sandbox games such as The Sims to applications such as cognitive models and virtual environments, for over four decades, researchers and practitioners have envisioned computational agents that can serve as believable proxies of human behavior. In these visions, computationally-powered agents act consistently with their past experiences and react believably to their environments. Such simulations of human behavior could populate virtual spaces and communities with realistic social phenomena, train people on how to handle rare yet difficult interpersonal situations, test social science theories, craft model human processors for theory and usability testing, power ubiquitous computing applications and social robots, and underpin non-playable game characters that can navigate complex human relationships in an open world.

我们如何打造一个能够体现可信人类行为的交互式人工社会? 从 The Sims 等沙盒游戏到认知模型和虚拟环境等应用程序,四十多年来,研究人员和从业者一直设想可以作为人类行为可信代理的计算代理。 在这些愿景中,计算驱动的代理的行为与他们过去的经验一致,并对他们的环境做出可信的反应。 这种对人类行为的模拟可以在虚拟空间和社区中填充现实的社会现象,训练人们如何处理罕见但困难的人际关系情况,测试社会科学理论,为理论和可用性测试制作人类处理器模型,为无处不在的计算应用程序和社交机器人提供动力,并支持可以在开放世界中驾驭复杂人际关系的非可玩游戏角色。

However, the space of human behavior is vast and complex. Despite striking progress in large language models that can simulate human behavior at a single time point, fully general agents that ensure long-term coherence would be better suited by architectures that manage constantly-growing memories as new interactions, conflicts, and events arise and fade over time while handling cascading social dynamics that unfold between multiple agents. Success requires an approach that can retrieve relevant events and interactions over a long period, reflect on those memories to generalize and draw higher-level inferences, and apply that reasoning to create plans and reactions that make sense in the moment and in the longer-term arc of the agent's behavior.

然而,人类行为的空间是广阔而复杂的。 尽管大型语言模型在单个时间点模拟人类行为方面已经取得了显著进展,但要保证长期一致性,通用智能体还需要一种架构,既能在新的交互、冲突和事件随时间出现和消失时管理不断增长的记忆,也能处理多个智能体之间展开的级联社会动态。 成功需要一种方法:检索长期相关的事件和交互,反思这些记忆以概括并得出更高层次的推论,并应用这些推理来创建对智能体当前行为和长期行为都有意义的计划与反应。

In this paper, we introduce generative agents—agents that draw on generative models to simulate believable human behavior—and demonstrate that they produce believable simulacra of both individual and emergent group behavior. Generative agents draw a wide variety of inferences about themselves, other agents, and their environment; they create daily plans that reflect their characteristics and experiences, act out those plans, react, and re-plan when appropriate; they respond when the end user changes their environment or commands them in natural language. For instance, generative agents turn off the stove when they see that their breakfast is burning, wait outside the bathroom if it is occupied, and stop to chat when they meet another agent they want to talk to. (When referring to generative agents engaging in actions or going to places, this is a shorthand for readability and not a suggestion that they are engaging in human-like agency. The behaviors of our agents, akin to animated Disney characters, aim to create a sense of believability, but they do not imply genuine agency.) A society full of generative agents is marked by emergent social dynamics where new relationships are formed, information diffuses, and coordination arises across agents.

在本文中,我们介绍了“生成式智能体”——利用生成模型来模拟可信人类行为的代理——并证明它们可以生成可信的个人行为和突发群体行为的拟像。 生成式智能体会对自己、其他智能体及其环境做出各种推论;它们会制定体现自身特点和经验的日常计划,执行这些计划,做出反应,并在适当的时候重新规划;当最终用户改变环境或用自然语言发出命令时,它们也会做出回应。 例如,当生成式智能体看到早餐正在燃烧时,他们会关掉炉子;如果浴室有人,则在浴室外等待;当他们遇到另一个他们想要交谈的代理时,他们会停下来聊天。 (当提到参与行动或前往某个地方的生成式智能体时,这是可读性的简写,而不是暗示它们正在参与类似人类的代理。 我们的代理的行为类似于迪士尼动画角色,旨在创造一种可信感,但它们并不意味着真正的代理。) 充满生成式智能体的社会以新兴的社会动态为标志,在这种动态中,新的关系形成,信息传播,代理之间出现协调。

To enable generative agents, we describe an agent architecture that stores, synthesizes, and applies relevant memories to generate believable behavior using a large language model. Our architecture comprises three main components. The first is the memory stream, a long-term memory module that records, in natural language, a comprehensive list of the agent's experiences. A memory retrieval model combines relevance, recency, and importance to surface the records needed to inform the agent's moment-to-moment behavior. The second is reflection, which synthesizes memories into higher-level inferences over time, enabling the agent to draw conclusions about itself and others to better guide its behavior. The third is planning, which translates those conclusions and the current environment into high-level action plans and then recursively into detailed behaviors for action and reaction. These reflections and plans are fed back into the memory stream to influence the agent's future behavior.

为了启用生成式智能体,我们描述了一种代理架构,该架构可以存储、合成和应用相关记忆,以使用大型语言模型生成可信的行为。 我们的架构包含三个主要组件。 第一个是“记忆流”,这是一个长期记忆模块,以自然语言记录代理经验的完整列表。 记忆检索模型结合了相关性、新近度和重要性,以显示通知代理即时行为所需的记录。 第二个是“反思”,它随着时间的推移将记忆合成为更高层次的推论,使智能体能够得出关于自己和他人的结论,以更好地指导其行为。 第三个是“规划”,它将这些结论和当前环境转化为高级行动计划,然后递归地转化为行动和反应的详细行为。 这些反思和计划会反馈到记忆流中,以影响智能体未来的行为。

This architecture suggests applications in multiple domains, from role-play and social prototyping to virtual worlds and games. In social role-play scenarios (e.g., interview preparation), a user could safely rehearse difficult, conflict-laden conversations. When prototyping social platforms, a designer could go beyond temporary personas to prototype dynamic, complex interactions that unfold over time. For this paper, we focus on the ability to create a small, interactive society of agents inspired by games such as The Sims. By connecting our architecture to the ChatGPT large language model, we manifest a society of twenty-five agents in a game environment. End users can observe and interact with these agents. If an end user or developer wanted the town to host an in-game Valentine's Day party, for example, traditional game environments would require scripting tens of characters' behavior manually. We demonstrate that, with generative agents, it is sufficient to simply tell one agent that she wants to throw a party. Despite many potential points of failure—the party planner must remember to invite other agents to the party, attendees must remember the invitation, those who remember must decide to actually show up, and more—our agents succeed. They spread the word about the party and then show up, with one agent even asking another on a date to the party, all from a single user-generated seed suggestion.

该架构暗示了多个领域的应用,从角色扮演和社交原型到虚拟世界和游戏。 在社交角色扮演场景(例如,面试准备)中,用户可以安全地排练困难且充满冲突的对话。 在设计社交平台原型时,设计师可以超越临时角色,设计随着时间的推移而展开的动态、复杂的交互原型。 在本文中,我们重点关注受 The Sims 等游戏启发创建小型交互式代理社会的能力。 通过将我们的架构连接到 ChatGPT 大语言模型,我们在游戏环境中展现了一个由 25 个代理组成的社会。 最终用户可以观察这些代理并与之交互。 例如,如果最终用户或开发人员希望城镇举办游戏中的 Valentine's Day 派对,传统游戏环境将需要手动编写数十个角色的行为脚本。 我们证明,对于生成式智能体,只需告诉一个代理她想举办一场派对就足够了。 尽管存在许多潜在的失败点——聚会策划者必须记住邀请其他智能体参加聚会,与会者必须记住邀请,记住的人必须决定实际出现,等等——我们的智能体成功了。 他们传播有关聚会的消息,然后出现,其中一名智能体甚至向另一名智能体询问参加聚会的日期,这一切都来自于单个用户生成的种子建议。

We conducted two evaluations of generative agents: a controlled evaluation to test whether the agents produce believable individual behaviors in isolation, and an end-to-end evaluation where the agents interacted with each other in open-ended ways over two days of game time to understand their stability and emergent social behaviors. In the technical evaluation, we leverage a methodological opportunity to evaluate an agent's knowledge and behavior by "interviewing" it in natural language to probe the agents' ability to stay in character, remember, plan, react, and reflect accurately. We compared several ablations that limit agents' access to memory, reflection, and planning. We observe that each of these components is critical to strong performance across these interview tasks. Across the technical and end-to-end evaluation, the most common errors arose when the agent failed to retrieve relevant memories, fabricated embellishments to the agent's memory, or inherited overly formal speech or behavior from the language model.

我们对生成式智能体进行了两项评估:一种是受控评估,用于测试代理是否在孤立的情况下产生可信的个体行为;另一种是端到端评估,即代理在两天的游戏时间内以开放式方式相互交互,以了解它们的稳定性和突发的社交行为。 在技术评估中,我们利用一种方法论机会,通过用自然语言“采访”智能体来评估智能体的知识和行为,以探究智能体能否准确地保持角色、记忆、规划、反应和反思。 我们比较了几种限制智能体访问记忆、反思和计划的消融。 我们观察到,这些组成部分对于这些面试任务的出色表现都至关重要。 在技术和端到端评估中,最常见的错误是当智能体未能检索相关记忆、对智能体记忆进行修饰或从语言模型继承过于正式的言语或行为时出现。

In sum, this paper makes the following contributions:

  • Generative agents, believable simulacra of human behavior that are dynamically conditioned on agents' changing experiences and environment.
  • A novel architecture that makes it possible for generative agents to remember, retrieve, reflect, interact with other agents, and plan through dynamically evolving circumstances. The architecture leverages the powerful prompting capabilities of large language models and supplements those capabilities to support longer-term agent coherence, the ability to manage dynamically evolving memory, and recursively produce higher-level reflections.
  • Two evaluations, a controlled evaluation and an end-to-end evaluation, that establish causal effects of the importance of components of the architecture, as well as identify breakdowns arising from, e.g., improper memory retrieval.
  • Discussion of the opportunities and ethical and societal risks of generative agents in interactive systems. We argue that these agents should be tuned to mitigate the risk of users forming parasocial relationships, logged to mitigate risks stemming from deepfakes and tailored persuasion, and applied in ways that complement rather than replace human stakeholders in design processes.

综上所述,本文做出以下贡献:

  • 生成式智能体:一种人类行为的可信模拟,会随着智能体不断变化的经历与环境而动态调整。
  • 一种新的架构,使生成式智能体能够在动态演化的情境中记忆、检索、反思、与其他智能体交互并进行规划。该架构利用大型语言模型强大的提示能力,并补充这些能力,以支持更长期的智能体一致性、动态演化记忆的管理,以及递归地产生更高层次反思。
  • 两类评估:受控评估与端到端评估,用于确立架构组件重要性的因果作用,并识别例如不当记忆检索所导致的问题。
  • 对交互系统中生成式智能体的机会、伦理风险与社会风险进行讨论。作者认为,这类智能体应被调校以降低用户形成拟社会关系的风险,应进行日志记录以降低深度伪造和定制化说服带来的风险,并应以补充而非取代设计流程中人类相关方的方式应用。

In this section, we reflect on the prior literature in human-AI interaction and situate, within its canon, the agenda of building believable proxies of human behavior. This agenda, once hailed as a north star in the interaction, game, and artificial intelligence communities, has remained challenging due to the complexity of human behavior. We synthesize this research to suggest that large language models, though not sufficient by themselves, open up a new angle for creating believable agents when leveraged using the appropriate architecture.

在本节中,我们回顾了人类与人工智能交互方面的先前文献,并在其规范中确定了建立人类行为的可信代理的议程。 这一议程曾被誉为交互、游戏和人工智能领域的北极星,但由于人类行为的复杂性,它仍然具有挑战性。 我们综合这项研究表明,大型语言模型虽然本身还不够,但在使用适当的架构时,为创建可信代理开辟了新的角度。

2.1. Human-AI Interaction

Interactive artificial intelligence systems aim to combine human insights and capabilities in computational artifacts that can augment their users . A long line of work has explored ways to enable users to interactively specify model behavior. For instance, Crayons demonstrated an early vision of interactive machine learning, allowing non-expert users to train classifiers. Further work helped to articulate how end users might describe their classification goals to the system through examples or demonstration. Recent advancements have extended these explorations to deep learning and prompt-based authoring.

交互式人工智能系统旨在把人类的洞察和能力融入计算制品中,从而增强用户能力。 一系列工作已经探索了使用户能够交互式地指定模型行为的方法。 例如,Crayons 展示了交互式机器学习的早期愿景,允许非专家用户训练分类器。 进一步的工作有助于阐明最终用户如何通过示例或演示向系统描述他们的分类目标。 最近的进展已将这些探索扩展到深度学习和基于提示的创作。

Meanwhile, a persistent thread of research has advanced the case for language- and agent-based interaction in human-computer interaction. Formative work such as SHRDLU and ELIZA demonstrated the opportunities and the risks associated with natural language interaction with computing systems. As research progressed, it became evident that autonomous agents could offer new metaphors for delegation and interaction, but the boundaries of delegation between humans and agents have remained the subject of ongoing debate and refinement. Recently, this technology has reached a level of stability that enables agents to interact via natural language in large and complex online social environments (e.g.,). Natural language interaction provides a novel modality that can enhance user abilities in domains such as photo editing and code editing.

与此同时,持续的研究线索推进了人机交互中基于语言和基于代理的交互的案例。 SHRDLU 和 ELIZA 等形成性工作展示了与计算系统的自然语言交互相关的机遇和风险。 随着研究的进展,很明显,自主代理可以为授权和交互提供新的隐喻,但人类和代理之间的授权界限仍然是持续争论和完善的主题。 最近,这项技术已经达到了一定程度的稳定性,使代理能够在大型复杂的在线社交环境(例如)中通过自然语言进行交互。 自然语言交互提供了一种新颖的方式,可以增强用户在照片编辑和代码编辑等领域的能力。

We convene these threads of work to show that we can now create agents that proxy human behavior for interactive systems, and interact with them using natural language. In doing so, this work reopens the door to examining foundational human-computer interaction questions around cognitive models such as GOMS and Keystroke-Level Model (KLM), around prototyping tools, and around ubiquitous computing applications.

我们召集这些工作线程来表明我们现在可以创建代理交互系统的人类行为的代理,并使用自然语言与它们交互。 在此过程中,这项工作重新打开了研究围绕 GOMS 和 Keystroke-Level Model (KLM) 等认知模型、原型工具和无处不在的计算应用程序的基础人机交互问题的大门。

2.2. Believable Proxies of Human Behavior

Prior literature has described believability, or believable agents, as a central design and engineering goal. Believable agents are designed to provide an illusion of life and present a facade of realism in the way they appear to make decisions and act on their own volition, similar to the characters in Disney movies. These agents can populate and perceive an open world environment like the one we inhabit, and strive to behave in ways that exhibit emergent behaviors grounded in social interactions with users or other agents with the aim of becoming believable proxies of our behavior in hypothetical simulations of individuals and communities. Historically, these agents were developed in the context of intelligent game non-player characters (NPCs). Creating NPCs with believable behavior, if possible, could enhance player experiences in games and interactive fictions by enabling emergent narratives and social interactions with the agents. However, more importantly, game worlds provide increasingly realistic representations of real-world affordances, and as observed by Laird and van Lent in 2001, these simulated worlds offer accessible testbeds for developers of believable agents to finesse the agents' cognitive capabilities without worrying about implementing robotics in the real world or creating simulation environments from scratch.

先前的文献已将“可信度”或“可信代理”描述为中心设计和工程目标。 可信智能体旨在提供一种生活幻象,并以他们做出决定和按照自己的意愿行事的方式呈现出现实主义的外观,类似于迪士尼电影中的角色。 这些智能体可以填充并感知类似现实世界的开放环境,并努力表现出由它们与用户或其他智能体的社交互动所产生的涌现行为,目标是在个人和社区的假设模拟中成为人类行为的可信代理。 从历史上看,这些代理是在智能游戏非玩家角色 (NPC) 的背景下开发的。 如果能够创建具有可信行为的 NPC,就可以通过涌现叙事以及与智能体的社交互动,增强玩家在游戏和互动小说中的体验。 然而,更重要的是,游戏世界提供了对现实世界可供性的日益真实的再现,正如 Laird 和 van Lent 在 2001 年观察到的那样,这些模拟世界为可信代理的开发人员提供了可访问的测试平台,以提高代理的认知能力,而无需担心在现实世界中实现机器人技术或从头开始创建模拟环境。

A diverse set of approaches to creating believable agents emerged over the past four decades. In implementation, however, these approaches often simplified the environment or dimensions of agent behavior to make the effort more manageable. Rule-based approaches, such as finite-state machines and behavior trees account for the brute force approach of human-authoring the agent's behavior. They provide a straightforward way of creating simple agents that is still the most dominant approach today, and can even handle rudimentary social interactions, as shown in games such as Mass Effect and The Sims series. Nonetheless, manually crafting behavior that can comprehensively address the breadth of possible interactions in an open world is untenable. This means that the resulting agent behaviors may not fully represent the consequences of their interactions, and cannot perform new procedures that were not hard-coded in their script. On the other hand, prevalent learning-based approaches for creating believable agents, such as reinforcement learning, have overcome the challenge of manual authoring by letting the agents learn their behavior, and have achieved superhuman performance in recent years in games such as AlphaStar for Starcraft and OpenAI Five for Dota 2. However, their success has largely taken place in adversarial games with readily definable rewards that a learning algorithm can optimize for. They have not yet addressed the challenge of creating believable agents in an open world.

在过去的四十年中,出现了一系列创建可信代理的方法。 然而,在实施过程中,这些方法通常会简化代理行为的环境或维度,以使工作更易于管理。 基于规则的方法,例如有限状态机和行为树,解释了人类创作代理行为的强力方法。 它们提供了一种创建简单代理的直接方法,这种方法仍然是当今最主要的方法,甚至可以处理基本的社交互动,如《质量效应》和《The Sims》系列等游戏中所示。 尽管如此,手动设计能够全面解决开放世界中可能交互的广度的行为是站不住脚的。 这意味着产生的代理行为可能无法完全代表其交互的结果,并且无法执行脚本中未硬编码的新过程。 另一方面,用于创建可信代理的流行的基于学习的方法(例如强化学习)通过让代理学习其行为,克服了手动创作的挑战,并且近年来在游戏中取得了超人的性能,例如 Starcraft 的 AlphaStar 和 Dota 2 的 OpenAI Five。 然而,它们的成功主要发生在对抗性游戏中,这些游戏具有易于定义的奖励,学习算法可以对其进行优化。 他们还没有解决在开放世界中创建可信代理的挑战。

Cognitive architectures in computation, pioneered by Newell, aimed to build the infrastructure for supporting a comprehensive set of cognitive functions that suited the all-encompassing nature of believable agents held in its original vision. They fueled some of the earliest examples of believable agents. For instance, Quakebot-SOAR and ICARUS generated NPCs in first-person shooter games, while TacAir-SOAR generated pilots in aerial combat training simulations. The architectures used by these agents differed (Quakebot- and TacAir-SOAR relied on SOAR, while ICARUS relied on its own variation that was inspired by SOAR and ACT-R), but they shared the same underlying principle. They maintained short-term and long-term memories, filled these memories with symbolic structures, and operated in perceive-plan-act cycles, dynamically perceiving the environment and matching it with one of the manually crafted action procedures. Agents created using cognitive architectures aimed to be generalizable to most, if not all, open world contexts and exhibited robust behavior for their time. However, their space of action was limited to manually crafted procedural knowledge, and they did not offer a mechanism through which the agents could be inspired to seek new behavior. As such, these agents were deployed mostly in non-open world contexts such as first-person shooter games or blocks worlds.

计算中的认知架构由纽厄尔(Newell)开创,旨在构建基础设施来支持一套全面的认知功能,这些功能适合其最初愿景中可信代理的包罗万象的性质。 他们催生了一些最早的可信智能体的例子。 例如,Quakebot-SOAR和ICARUS在第一人称射击游戏中生成NPC,而TacAir-SOAR在空战训练模拟中生成飞行员。 这些代理使用的架构有所不同(Quakebot 和 TacAir-SOAR 依赖于 SOAR,而 ICARUS 依赖于受 SOAR 和 ACT-R 启发的自己的变体),但它们共享相同的基本原理。 他们维持短期和长期记忆,用符号结构填充这些记忆,并以感知-计划-行动循环进行操作,动态感知环境并将其与手动制定的行动程序之一相匹配。 使用认知架构创建的智能体旨在推广到大多数(如果不是全部)开放世界环境,并在当时表现出稳健的行为。 然而,他们的行动空间仅限于手工制作的程序知识,并且他们没有提供一种机制来激励智能体寻求新的行为。 因此,这些代理主要部署在非开放世界环境中,例如第一人称射击游戏或方块世界。

Today, creating believable agents as described in its original definition remains an open problem. Many have moved on, arguing that although current approaches for creating believable agents might be cumbersome and limited, they are good enough to support existing gameplay and interactions. Our argument is that large language models offer an opportunity to re-examine these questions, provided that we can craft an effective architecture to synthesize memories into believable behavior. We offer a step toward such an architecture in this paper.

如今,创建原始定义中所描述的可信代理仍然是一个悬而未决的问题。 许多人继续前进,认为虽然当前创建可信代理的方法可能很麻烦且有限,但它们足以支持现有的游戏玩法和交互。 我们的观点是,大型语言模型提供了重新审视这些问题的机会,前提是我们可以设计一个有效的架构来将记忆合成为可信的行为。 我们在本文中向这种架构迈出了一步。

2.3. Large Language Models and Human Behavior

Generative agents leverage a large language model to power their behavior. The key observation is that large language models encode a wide range of human behavior from their training data. If prompted with a narrowly defined context, the models can be used to generate believable behavior. Recent work has demonstrated the efficacy of this approach. For instance, social simulacra used a large language model to generate users that would populate new social computing systems to prototype their emergent social dynamics. This approach used a prompt chain to generate short natural language descriptions of personas and their behaviors as they appear in the system being prototyped. Other empirical studies have replicated existing social science studies, political surveys, and generated synthetic data. Large language models have also been used to generate interactive human behavior for users to engage with. In gaming, for instance, these models have been employed to create interactive fiction and text adventure games. With their ability to generate and decompose action sequences, large language models have also been used in planning robotics tasks. For example, when presented with a task, such as picking up a bottle, the model is prompted to break down the task into smaller action sequences, such as heading to the table where the bottle is located and picking it up.

生成式智能体利用大型语言模型来驱动他们的行为。 关键的观察结果是大型语言模型从训练数据中编码了广泛的人类行为。 如果用狭义定义的上下文进行提示,则模型可用于生成可信的行为。 最近的工作证明了这种方法的有效性。 例如,社交拟像使用大型语言模型来生成用户,这些用户将填充新的社交计算系统,以原型化其新兴的社交动态。 这种方法使用提示链来生成人物角色及其行为的简短自然语言描述,就像它们出现在原型系统中一样。 其他实证研究复制了现有的社会科学研究、政治调查,并生成了综合数据。 大型语言模型也被用来生成可供用户参与的交互式人类行为。 例如,在游戏中,这些模型已被用来创建交互式小说和文本冒险游戏。 由于具有生成和分解动作序列的能力,大型语言模型也被用于规划机器人任务。 例如,当面临一项任务(例如捡起一个瓶子)时,系统会提示模型将任务分解为更小的动作序列,例如前往瓶子所在的桌子并捡起它。

We posit that, based on the work summarized above, large language models can become a key ingredient for creating believable agents. The existing literature largely relies on what could be considered first-order templates that employ few-shot prompts or chain-of-thought prompts. These templates are effective in generating behavior that is conditioned solely on the agent's current environment (e.g., how would a troll respond to a given post, what actions would a robot need to take to enter a room given that there is a door). However, believable agents require conditioning not only on their current environment but also on a vast amount of past experience, which is a poor fit (and as of today, impossible due to the underlying models' limited context window) using first-order prompting. Recent studies have attempted to go beyond first-order prompting by augmenting language models with a static knowledge base and an information retrieval scheme or with a simple summarization scheme. This paper extends these ideas to craft an agent architecture that handles retrieval where past experience is dynamically updated at each time step and mixed with agents' current context and plans, which may either reinforce or contradict each other.

我们认为,基于上面总结的工作,大型语言模型可以成为创建可信代理的关键要素。 现有的文献很大程度上依赖于可以被视为一阶模板的东西,这些模板采用了少样本提示或思维链提示。 这些模板可以有效地生成仅以代理当前环境为条件的行为(例如,巨魔将如何响应给定的帖子,机器人需要采取哪些操作才能进入有门的房间)。 然而,可信智能体不仅需要根据当前环境进行调节,还需要根据大量过去的经验进行调节,这对于使用一阶提示来说是不合适的(并且截至目前,由于底层模型的上下文窗口有限,这是不可能的)。 最近的研究试图通过使用静态知识库和信息检索方案或简单的摘要方案来增强语言模型,从而超越一阶提示。 本文扩展了这些想法,设计了一个处理检索的智能体架构,其中过去的经验在每个时间步骤动态更新,并与智能体当前的上下文和计划混合,这可能相互加强或相互矛盾。

图2:Smallville 沙盒世界地图
图2:Smallville 沙盒世界。地图包含房屋、学校、咖啡馆、商店、公园等区域;每个区域又由更细的子区域和对象组成。智能体会记住自己见过的环境子图及其状态。

3. Generative Agent Behavior and Interaction

To illustrate the affordances of generative agents, we instantiate them as characters in a simple sandbox world reminiscent of The Sims. This sprite-based sandbox game world, Smallville, evokes a small town environment. In this section, we will walk through the affordances and interactions with generative agents in Smallville and describe how the agents behave within it. Then, in Section 4, we will introduce our generative agent architecture that powers these affordances and interactions. In Section 5, we will describe the implementation of the sandbox environment and how the agents interact with the underlying engine of the sandbox world.

为了说明生成式智能体的可供性,我们将它们实例化为简单沙盒世界中的角色,让人想起 The Sims。 这个基于精灵的沙盒游戏世界 Smallville 让人想起小镇环境。 在本节中,我们将介绍 Smallville 中的可供性以及与生成式智能体的交互,并描述代理在其中的行为方式。 然后,在第 4 节中,我们将介绍为这些可供性和交互提供支持的生成式智能体架构。 在第 5 节中,我们将描述沙箱环境的实现以及代理如何与沙箱世界的底层引擎交互。

3.1. Agent Avatar and Communication

A community of 25 unique agents inhabits Smallville. Each agent is represented by a simple sprite avatar. We authored one paragraph of natural language description to depict each agent's identity, including their occupation and relationship with other agents, as seed memories. For example, John Lin has the following description:

Smallville 居住着一个由 25 名独特智能体组成的社区。 每个代理都由一个简单的精灵头像代表。 我们编写了一段自然语言描述来描述每个智能体的身份,包括他们的职业以及与其他智能体的关系,作为种子记忆。 例如John Lin有如下描述:

John Lin is a pharmacy shopkeeper at the Willow Market and Pharmacy who loves to help people. He is always looking for ways to make the process of getting medication easier for his customers; John Lin is living with his wife, Mei Lin, who is a college professor, and son, Eddy Lin, who is a student studying music theory; John Lin loves his family very much; John Lin has known the old couple next-door, Sam Moore and Jennifer Moore, for a few years; John Lin thinks Sam Moore is a kind and nice man; John Lin knows his neighbor, Yuriko Yamamoto, well; John Lin knows of his neighbors, Tamara Taylor and Carmen Ortiz, but has not met them before; John Lin and Tom Moreno are colleagues at The Willows Market and Pharmacy; John Lin and Tom Moreno are friends and like to discuss local politics together; John Lin knows the Moreno family somewhat well — the husband Tom Moreno and the wife Jane Moreno.
John Lin 是 Willow Market and Pharmacy 的药房店主,热衷于帮助别人。他一直在寻找让顾客更轻松地获得药物的方法;John Lin 与妻子 Mei Lin(大学教授)和儿子 Eddy Lin(音乐理论专业学生)住在一起;John Lin 非常爱他的家人;John Lin 与隔壁的老夫妇 Sam Moore 和 Jennifer Moore 已经认识几年了;John Lin 认为 Sam Moore 是一个善良、友好的人;John Lin 很了解他的邻居 Yuriko Yamamoto;John Lin 知道他的邻居 Tamara Taylor 和 Carmen Ortiz,但以前没有见过他们;John Lin 和 Tom Moreno 是 The Willows Market and Pharmacy 的同事;John Lin 和 Tom Moreno 是朋友,喜欢一起讨论当地政治;John Lin 对 Moreno 一家有些了解——丈夫 Tom Moreno 和妻子 Jane Moreno。

Each semicolon-delimited phrase is entered into the agent's initial memory as memories at the start of the simulation.

每个以分号分隔的短语都会作为模拟开始时的记忆,输入到智能体的初始记忆中。

3.1.1. Inter-Agent Communication

The agents interact with the world by their actions, and with each other through natural language. At each time step of the sandbox engine, the agents output a natural language statement describing their current action, such as "Isabella Rodriguez is writing in her journal", "Isabella Rodriguez is checking her emails", "Isabella Rodriguez is talking with her family on the phone", or "Isabella Rodriguez is getting ready for bed." This statement is then translated into concrete movements that affect the sandbox world. The action is displayed on the sandbox interface as a set of emojis, providing an abstract representation of the action from an overhead view. To achieve this, the system utilizes a language model to translate the action into a set of emojis, which appear above each avatar's head in a speech bubble. For example, "Isabella Rodriguez is writing in her journal" is displayed as a book and pencil icon, while "Isabella Rodriguez is checking her emails" appears as a laptop and envelope icon. The complete natural language description of the action can be accessed by clicking on the agent's avatar.

智能体通过行动与世界交互,并通过自然语言彼此交互。 在沙箱引擎的每个时间步,智能体会输出一句描述当前行动的自然语言语句,例如:“Isabella Rodriguez 正在写日记”、“Isabella Rodriguez 正在查看邮件”、“Isabella Rodriguez 正在和家人通电话”,或者“Isabella Rodriguez 正在准备睡觉。” 然后,这句话会被转换成影响沙箱世界的具体移动。 该行动会在沙箱界面上显示为一组 emoji,从俯视视角提供行动的抽象表示。 为此,系统利用语言模型把行动翻译成一组 emoji,并显示在每个头像头顶的对话气泡中。 例如,“Isabella Rodriguez 正在写日记”会显示为书和铅笔图标,而“Isabella Rodriguez 正在查看邮件”会显示为笔记本电脑和信封图标。 点击智能体头像即可查看该行动的完整自然语言描述。

Agents communicate with each other in full natural language. They are aware of other agents in their local area, and the generative agent architecture determines whether they walk by or engage in conversation. Here, a sample in the middle of a conversation between the agents Isabella Rodriguez and Tom Moreno about the upcoming election: (We note that the conversational style of these agents can feel overly formal, likely a result of instruction tuning in the underlying models. We expect that the writing style will be better controllable in future language models.)

代理之间以完全自然的语言进行交流。 他们知道本地区域的其他代理,并且生成式智能体架构决定他们是走过还是进行对话。 这里是代理 Isabella Rodriguez 和 Tom Moreno 之间关于即将到来的选举的对话中的示例:(我们注意到这些代理的对话风格可能感觉过于正式,可能是底层模型中指令调整的结果。 我们预计在未来的语言模型中写作风格将更好地可控。)

Isabella: I’m still weighing my options, but I’ve been discussing the election with Sam Moore. What are your thoughts on him?
Tom: To be honest, I don’t like Sam Moore. I think he’s out of touch with the community and doesn’t have our best interests at heart.
伊莎贝拉:我仍在权衡我的选择,但我一直在与 Sam Moore 讨论选举。你对他有什么看法?
Tom:说实话,我不喜欢Sam Moore。我认为他与社区脱节,没有把我们的最大利益放在心上。
3.1.2. User Controls

The user communicates with the agent through natural language by specifying a persona that the agent should perceive them as. For example, if the user specifies that they are a news "reporter" and asks about the upcoming election by saying, "Who is running for office?", the John agent replies:

用户通过指定代理应将其视为的角色,通过自然语言与代理进行通信。 例如,如果用户指定自己是新闻“记者”,并通过询问“谁正在竞选公职?”来询问即将到来的选举,则约翰代理会回复:

John: My friends Yuriko, Tom and I have been talking about the upcoming election and discussing the candidate Sam Moore. We have all agreed to vote for him because we like his platform.
约翰:我的朋友 Yuriko、Tom 和我一直在谈论即将到来的选举并讨论候选人 Sam Moore。我们都同意投票给他,因为我们喜欢他的平台。

To directly command one of the agents, the user takes on the persona of the agent’s "inner voice"—this makes the agent more likely to treat the statement as a directive. For instance, when told “You are going to run against Sam in the upcoming election” by a user as John's inner voice, John decides to run in the election and shares his candidacy with his wife and son.

为了直接命令其中一个代理,用户需要扮演代理“内心声音”的角色,这使得代理更有可能将语句视为指令。 例如,当用户以约翰内心的声音告诉他“你将在即将到来的选举中与萨姆竞争”时,约翰决定参加选举,并与他的妻子和儿子分享他的候选资格。

3.2. Environmental Interaction

Smallville features the common affordances of a small village, including a cafe, bar, park, school, dorm, houses, and stores. It also defines subareas and objects that make those spaces functional, such as a kitchen in a house and a stove in the kitchen (Figure 2). All spaces serving as agents’ primary living quarters feature a bed, desk, closet, shelf, as well as a bathroom and a kitchen. (This environment design is not the focus of our work, so we generated this environment manually, not automatically. Future work can continue to expand the richness of the agents' environments.)

Smallville 具有小村庄的常见功能,包括咖啡馆、酒吧、公园、学校、宿舍、房屋和商店。 它还定义了使这些空间发挥功能的子区域和对象,例如房屋中的厨房和厨房中的炉灶(图2)。 所有作为智能体主要居住区的空间均配有床、书桌、壁橱、架子以及浴室和厨房。 (这个环境设计不是我们工作的重点,所以我们手动生成这个环境,而不是自动生成。 未来的工作可以继续扩展智能体环境的丰富性。)

Agents move around Smallville as one would in a simple video game, entering and leaving buildings, navigating its map, and approaching other agents. Agent movements are directed by the generative agent architecture and the sandbox game engine: when the model dictates that the agent will move to a location, we calculate a walking path to the destination in the Smallville environment, and the agent begins moving. In addition, users can also enter the sandbox world of Smallville as an agent operating within it. The agent that the user embodies can be an agent already present in the world, such as Isabella and John, or it can be an outside visitor with no prior history in Smallville. The inhabitants of Smallville will treat the user-controlled agent no differently than they treat each other. They recognize its presence, initiate interactions, and remember its behavior before forming opinions about it.

智能体在 Smallville 周围移动,就像在一款简单的视频游戏中一样,进入和离开建筑物、导航其地图并接近其他智能体。 代理移动由生成式智能体架构和沙盒游戏引擎引导:当模型指示代理将移动到某个位置时,我们在 Smallville 环境中计算到目的地的步行路径,然后代理开始移动。 此外,用户还可以进入Smallville的沙盒世界,作为代理在其中运行。 用户所体现的代理可以是世界上已经存在的代理,例如伊莎贝拉和约翰,也可以是在Smallville中没有先前历史的外部访问者。 Smallville 的居民对待用户控制的代理与对待彼此没有什么不同。 他们认识到它的存在,发起互动,并在形成对它的看法之前记住它的行为。

Users and agents can influence the state of the objects in this world, much like in sandbox games such as The Sims. For example, a bed can be occupied when an agent is sleeping, and a refrigerator can be empty when an agent uses up the ingredients to make breakfast. End users can also reshape an agent's environment in Smallville by rewriting the status of objects surrounding the agent in natural language. For instance, when Isabella is making breakfast in the morning, the user can change the status of the kitchen stove from “turned on” to “burning” by inputting a command to the system that chooses the object and illustrates its new status, like this: "<Isabella's apartment: kitchen: stove> is burning." Isabella will notice this in the next moment and go to turn off the stove and remake her breakfast. Likewise, if the user sets the status of Isabella's shower to “leaking water” when she enters the bathroom, she will gather tools from her living room and try to fix the leak.

用户和代理可以影响这个世界中对象的状态,就像在 The Sims 等沙盒游戏中一样。 例如,当智能体正在睡觉时,床可能会被占用,而当智能体用完做早餐的食材时,冰箱可能会被清空。 最终用户还可以通过用自然语言重写代理周围对象的状态来重塑 Smallville 中代理的环境。 例如,当伊莎贝拉早上做早餐时,用户可以通过向系统输入命令来将厨房炉灶的状态从“打开”更改为“燃烧”,该系统选择对象并说明其新状态,如下所示:“<伊莎贝拉的公寓:厨房:炉灶> 正在燃烧。” 伊莎贝拉下一刻就会注意到这一点,然后去关掉炉子,重新做早餐。 同样,如果用户在进入浴室时将伊莎贝拉的淋浴状态设置为“漏水”,她将从客厅收集工具并尝试修复漏水。

图3:John Lin 的日常例子
图3:John Lin 一天早晨的片段。智能体会从粗粒度日程逐步展开到具体行动,并在家庭成员之间形成连续对话。
图4:情人节派对
图4:情人节派对的涌现过程。用户只设置 Isabella 想举办派对,以及 Maria 暗恋 Klaus;之后邀请、装饰、邀约和按时到场等行为都由智能体架构自主产生。

3.3. Example "Day in the Life"

Starting from the single-paragraph description, generative agents begin planning their days. As time passes in the sandbox world, their behaviors evolve as these agents interact with each other and the world, building memories and relationships, and coordinating joint activities.

从单段描述开始,生成式智能体开始计划他们的日子。 随着时间在沙盒世界中的流逝,它们的行为随着这些智能体彼此之间以及与世界的交互而演变,建立记忆和关系,并协调联合活动。

We demonstrate the behavior of generative agents by tracing the output of our system over the course of one day for the agent John Lin (Figure 3). In the Lin family, John is the first to wake up at 7 a.m. He brushes his teeth, takes a shower, gets dressed, eats breakfast, and checks the news at the dining table in the living room. At 8 a.m., Eddy follows suit, rushing out of bed to prepare for his classes. He catches up with John just as he is about to head out:

我们通过跟踪代理 John Lin 一天内系统的输出来演示生成式智能体的行为(图3)。 林家,早上七点,John第一个起床,刷牙、洗澡、穿衣、吃早餐,在客厅的餐桌上查看新闻。 早上 8 点,艾迪也跟着起床,准备上课。 正当约翰准备出发时,他追上了他:

John: Good morning Eddy. Did you sleep well?
Eddy: Good morning dad. Yeah, I slept great.
John: That's good. What are you working on today?
Eddy: I'm working on a new music composition for my class. It's due this week, so I'm trying to get it finished. But I'm having so much fun with it!
John: That sounds great!
约翰:早上好,艾迪。你睡的好吗?
艾迪:早上好,爸爸。是的,我睡得很好。
约翰:那很好。你今天在做什么?
Eddy:我正在为我的班级创作一首新的音乐作品。这周就要到期了,所以我正在努力完成它。但我玩得很开心!
约翰:听起来很棒!

Soon after Eddy heads out, Mei wakes up and joins John. Mei asks about her son, and John recalls the conversation they just had:

艾迪出去后不久,梅醒来并加入了约翰。 梅询问她儿子的情况,约翰回忆起他们刚刚的谈话:

Mei: Did Eddy already leave for school?
John: Yes, he just left. He's working on a music composition for his class.
Mei: Oh, that's wonderful! Tell me about it.
John: I think he's really enjoying it! He said he's having a lot of fun with it.
Mei: That's great! I'm so proud of him.
:艾迪已经去学校了吗?
约翰:是的,他刚刚离开。他正在为他的班级创作音乐。
:噢,那太好了!告诉我吧。
约翰:我认为他真的很享受!他说他玩得很开心。
:那太好了!我为他感到骄傲。

As they finish their conversation, Mei and John pack up. Mei goes on to teach and work on her research paper, while John opens his pharmacy counter at the Willow Market and Pharmacy by 9 a.m.

谈话结束后,梅和约翰收拾行李。 Mei 继续教学并完成她的研究论文,而 John 则在上午 9 点打开 Willow Market and Pharmacy 的药房柜台。

图5:生成式智能体架构
图5:生成式智能体架构。智能体感知环境,观察被写入记忆流;随后检索相关记忆,生成行动;计划和反思也会重新写入记忆流,影响后续行为。

3.4. Emergent Social Behaviors

By interacting with each other, generative agents in Smallville exchange information, form new relationships, and coordinate joint activities. Extending prior work, these social behaviors are emergent rather than pre-programmed.

通过相互交互,Smallville 中的生成式智能体交换信息、形成新的关系并协调联合活动。 扩展之前的工作,这些社会行为是自然发生的而不是预先编程的。

3.4.1. Information Diffusion

As agents notice each other, they may engage in dialogue—as they do so, information can spread from agent to agent. For instance, in a conversation between Sam and Tom at the grocery store, Sam tells Tom about his candidacy in the local election:

当智能体相互注意到时,他们可能会进行对话——当他们这样做时,信息可以在智能体之间传播。 例如,在山姆和汤姆在杂货店的一次谈话中,山姆告诉汤姆他参加地方选举的候选资格:

Sam: Hey Tom, how's it going?
Tom: Good, thanks. What's up?
Sam: Well, I wanted to talk to you about something. I'm actually running for mayor in the upcoming local election.
Tom: Really? That's great news! Why are you running?
Sam: I've been involved in local politics for years now, and I really feel like I can make a difference in our community. Plus, I want to bring new ideas and perspectives to the table.
山姆:嘿汤姆,怎么样?
汤姆:好的,谢谢。这是怎么回事?
Sam:嗯,我想和你谈谈一些事情。实际上我正在参加即将到来的地方选举,竞选市长。
汤姆:真的吗?这是个好消息!你为什么要参选?
Sam:我参与当地政治已有多年,我真的觉得我可以为我们的社区带来改变。另外,我想带来新的想法和观点。

Later that day, after Sam left, Tom and John, who heard the news from another source, discuss Sam's chances of winning the election:

那天晚些时候,萨姆离开后,汤姆和约翰从另一个来源听到了这个消息,讨论了萨姆赢得选举的机会:

John: I heard that Sam Moore is running for mayor in the local election. Do you think he has a good chance of winning?
Tom: I do think he has a good chance. He's been working hard in the community and I think he will get a lot of support. What do you think?
John: I think it's great that he's running. I'm curious to see how the election will turn out.
约翰:我听说Sam Moore正在地方选举中竞选市长。你认为他获胜的机会大吗?
汤姆:我确实认为他有很好的机会。他一直在社区努力工作,我认为他会得到很多支持。你怎么认为?
约翰:我认为他参加竞选真是太好了。我很好奇选举结果如何。

Gradually, Sam's candidacy becomes the talk of the town, with some supporting him and others remaining undecided.

渐渐地,萨姆的候选资格成为全城的热门话题,一些人支持他,另一些人则尚未做出决定。

3.4.2. Relationship Memory

Agents in Smallville form new relationships over time and remember their interactions with other agents. For example, at the start, Sam does not know Latoya Williams. While taking a walk in Johnson Park, Sam runs into Latoya, and they introduce themselves. Latoya mentions that she is working on a photography project: "I'm here to take some photos for a project I'm working on." In a later interaction, Sam's interactions with Latoya indicate a memory of that interaction, as he asks "Hi, Latoya. How is your project going?" and she replies "Hi, Sam. It’s going well!"

Smallville 中的智能体随着时间的推移形成新的关系,并记住他们与其他智能体的交互。 例如,一开始,Sam 不知道 Latoya Williams。 在 Johnson Park 散步时,山姆遇到了拉托亚,他们做了自我介绍。 拉托亚提到她正在从事一个摄影项目:“我来这里是为了为我正在从事的项目拍摄一些照片。 ”在后来的互动中,山姆与拉托亚的互动表明了对那次互动的记忆,因为他问“嗨,拉托亚。 你的项目进行得怎么样? ”她回答说:“嗨,萨姆。一切顺利!”

3.4.3. Coordination

Generative agents coordinate with each other. Isabella Rodriguez, at Hobbs Cafe, is initialized with an intent to plan a Valentine's Day party from 5 to 7 p.m. on February 14th. From this seed, the agent proceeds to invite friends and customers when she sees them at Hobbs Cafe or elsewhere. Isabella then spends the afternoon of the 13th decorating the cafe for the occasion. Maria, a frequent customer and close friend of Isabella's, arrives at the cafe. Isabella asks for Maria's help in decorating for the party, and Maria agrees. Maria's character description mentions that she has a crush on Klaus. That night, Maria invites Klaus, her secret crush, to join her at the party, and he gladly accepts.

生成式智能体相互协调。 Isabella Rodriguez(位于 Hobbs Cafe)已初始化,旨在计划下午 5 点至 7 点举行 Valentine's Day 聚会。 2月14日。 从这个种子开始,当代理在 Hobbs Cafe 或其他地方看到朋友和客户时,就会邀​​请他们。 伊莎贝拉随后在 13 日下午为这个场合装饰了咖啡馆。 玛丽亚是伊莎贝拉的常客和密友,她来到了咖啡馆。 伊莎贝拉请求玛丽亚帮忙装饰派对,玛丽亚同意了。玛丽亚的角色描述提到她迷恋克劳斯。 那天晚上,玛丽亚邀请她暗恋的克劳斯一起参加聚会,克劳斯欣然接受。

On Valentine's Day, five agents, including Klaus and Maria, show up at Hobbs Cafe at 5 pm, and they enjoy the festivities (Figure 4). In this scenario, the end user only set Isabella's initial intent to throw a party and Maria's crush on Klaus: the social behaviors of spreading the word, decorating, asking each other out, arriving at the party, and interacting with each other at the party were initiated by the agent architecture.

在 Valentine's Day 上,下午 5 点,Klaus 和 Maria 等五名智能体出现在 Hobbs Cafe,他们享受着节日的气氛(图4)。 在这个场景中,最终用户只设置了伊莎贝拉举办派对的初始意图和玛丽亚对克劳斯的迷恋:传播消息、装饰、约对方出去、到达派对以及在派对上相互互动的社交行为都是由代理架构发起的。

4. Generative Agent Architecture

Generative agents aim to provide a framework for behavior in an open world: one that can engage in interactions with other agents and react to changes in the environment. Generative agents take their current environment and past experiences as input and generate behavior as output. Underlying this behavior is a novel agent architecture that combines a large language model with mechanisms for synthesizing and retrieving relevant information to condition the language model's output. Without these mechanisms, large language models can output behavior, but the resulting agents may not react based on the agent's past experiences, may not make important inferences, and may not maintain long-term coherence. Challenges with long-term planning and coherence remain even with today's most performant models such as GPT-4. Because generative agents produce large streams of events and memories that must be retained, a core challenge of our architecture is to ensure that the most relevant pieces of the agent's memory are retrieved and synthesized when needed.

生成式智能体旨在为开放世界中的行为提供一个框架:一个可以与其他代理交互并对环境变化做出反应的框架。 生成式智能体将当前环境和过去的经验作为输入,并生成行为作为输出。 这种行为的基础是一种新颖的代理架构,它将大型语言模型与合成和检索相关信息的机制相结合,以调节语言模型的输出。 如果没有这些机制,大型语言模型可以输出行为,但生成的智能体可能不会根据智能体过去的经验做出反应,可能不会做出重要的推论,并且可能无法保持长期的一致性。 即使对于当今性能最高的型号(例如 GPT-4),长期规划和一致性方面的挑战仍然存在。 由于生成式智能体会产生大量必须保留的事件和记忆流,因此我们架构的核心挑战是确保在需要时检索和合成代理记忆中最相关的部分。

At the center of our architecture is the memory stream, a database that maintains a comprehensive record of an agent's experience. From the memory stream, records are retrieved as relevant to plan the agent's actions and react appropriately to the environment. Records are recursively synthesized into higher- and higher-level reflections that guide behavior. Everything in the architecture is recorded and reasoned over as a natural language description, allowing the architecture to leverage a large language model.

我们架构的核心是记忆流,它是一个数据库,用于维护智能体经历的完整记录。 系统会从记忆流中检索相关记录,用于规划智能体的行动,并对环境做出适当反应。 记录被递归地合成为指导行为的更高层次的反思。 架构中的所有内容都作为自然语言描述进行记录和推理,从而允许架构利用大型语言模型。

Our current implementation utilizes the gpt3.5-turbo version of ChatGPT. We expect that the architectural basics of generative agents—memory, planning, and reflection—will likely remain the same as language models improve. Newer language models (e.g., GPT-4) will continue to expand the expressive power and performance of the prompts that underpin generative agents. As of writing, however, GPT-4's API was invitation-only, so our agents use ChatGPT.

我们当前的实现使用 ChatGPT 的 gpt3.5-turbo 版本。 我们预计,随着语言模型的改进,生成式智能体的架构基础——记忆、规划和反思——可能会保持不变。 较新的语言模型(例如 GPT-4)将继续扩展支撑生成式智能体的提示的表达能力和性能。 然而,截至撰写本文时,GPT-4 的 API 仅限邀请,因此我们的代理使用 ChatGPT。

图6:记忆检索
图6:检索函数从记忆流中选择与当前情境最相关的一小部分记忆,而不是把完整记忆流全部塞进语言模型上下文。

4.1. Memory and Retrieval

Challenge: Creating generative agents that can simulate human behavior requires reasoning about a set of experiences that is far larger than what should be described in a prompt, as the full memory stream can distract the model and does not even currently fit into the limited context window. Consider the Isabella agent answering the question, "What are you passionate about these days?" Summarizing all of Isabella's experiences to fit in the limited context window of the language model produces an uninformative response, where Isabella discusses topics such as collaborations for events and projects and cleanliness and organization in a cafe. Instead of summarizing, the memory stream described below surfaces relevant memories, resulting in a more informative and specific response that mentions Isabella's passion for making people feel welcome and included, planning events and creating an atmosphere that people can enjoy, such as the Valentine's Day party.

挑战: 要创建能够模拟人类行为的生成式智能体,模型必须能够围绕一组经验进行推理;这组经验远远多于提示词中应该完整描述的内容,因为完整的记忆流既可能分散模型注意力,目前也无法放进有限的上下文窗口。 以智能体 Isabella 回答“你最近热衷于什么?”这个问题为例。 如果为了适应语言模型有限的上下文窗口而总结 Isabella 的所有经历,得到的回答会很笼统,例如 Isabella 会谈到活动和项目协作、咖啡馆的清洁与组织等主题。 相比之下,下面介绍的记忆流并不是简单总结全部经历,而是浮现相关记忆,从而产生更有信息量、更具体的回答,例如提到 Isabella 热衷于让人们感到受欢迎和被接纳、策划活动,并创造人们能够享受的氛围,比如情人节派对。

Approach: The memory stream maintains a comprehensive record of the agent's experience. It is a list of memory objects, where each object contains a natural language description, a creation timestamp, and a most recent access timestamp. The most basic element of the memory stream is an observation, which is an event directly perceived by an agent. Common observations include behaviors performed by the agent themselves or behaviors that agents perceive being performed by other agents or non-agent objects. For instance, Isabella Rodriguez, who works at a coffee shop, might accrue the following observations over time: (1)~Isabella Rodriguez is setting out the pastries, (2)~Maria Lopez is studying for a Chemistry test while drinking coffee, (3)~Isabella Rodriguez and Maria Lopez are conversing about planning a Valentine's day party at Hobbs Cafe, (4)~The refrigerator is empty.

方法: 记忆流维护智能体经验的完整记录。 它是一个记忆对象列表,其中每个对象都包含自然语言描述、创建时间戳和最近访问时间戳。 记忆流最基本的元素是“观察”,也就是智能体直接感知到的事件。 常见的观察包括智能体自己执行的行为,或者智能体感知到的、由其他智能体或非智能体对象执行的行为。 例如,在咖啡店工作的 Isabella Rodriguez,随着时间推移可能会积累以下观察:(1) Isabella Rodriguez 正在摆放糕点,(2) Maria Lopez 正在一边喝咖啡一边复习化学考试,(3) Isabella Rodriguez 和 Maria Lopez 正在讨论在 Hobbs Cafe 策划情人节派对,(4) 冰箱是空的。

Our architecture implements a retrieval function that takes the agent's current situation as input and returns a subset of the memory stream to pass on to the language model. There are many possible implementations of a retrieval function, depending on what is important for the agent to consider when deciding how to act. In our context, we focus on three main components that, together, produce effective results.

我们的架构实现了一个检索函数,它以智能体的当前情境作为输入,并返回记忆流的一个子集,再将其传递给语言模型。 检索函数有许多可能的实现方式,具体取决于智能体在决定如何行动时需要重点考虑什么。 在我们的背景下,我们专注于三个主要组成部分,它们共同产生有效的结果。

Recency assigns a higher score to memory objects that were recently accessed, so that events from a moment ago or this morning are likely to remain in the agent's attentional sphere. In our implementation, we treat recency as an exponential decay function over the number of sandbox game hours since the memory was last retrieved. Our decay factor is 0.995.

新近度为最近访问的记忆对象分配更高的分数,因此刚刚发生或今天早上发生的事件更可能停留在智能体的注意范围内。 在我们的实现中,我们把新近度视为一个关于“距离该记忆上次被检索已经过了多少沙盒游戏小时”的指数衰减函数。 我们的衰减因子为 0.995

Importance distinguishes mundane from core memories by assigning a higher score to memory objects that the agent believes to be important. For instance, a mundane event, such as eating breakfast in one's room, would yield a low importance score, whereas a breakup with one's significant other would yield a high score. There are many possible implementations of an importance score; we find that directly asking the language model to output an integer score is effective. The full prompt appears below:

重要性通过为智能体认为重要的记忆对象分配更高的分数来区分普通记忆和核心记忆。 例如,一件平凡的事件,例如在房间里吃早餐,会产生较低的重要性得分,而与重要的其他人分手会产生较高的得分。 重要性评分有多种可能的实现方式;我们发现直接要求语言模型输出整数分数是有效的。 完整的提示如下:

On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory.
Memory: buying groceries at The Willows Market and Pharmacy
Rating: <fill in>
按照 1 到 10 的等级,其中 1 是纯粹平凡的(例如,刷牙、整理床铺),10 是极其令人心酸的(例如,分手、大学录取),请评价以下记忆片段可能的心酸程度。
记忆:在 The Willows Market and Pharmacy 购买杂货
评级:<填写>

This prompt returns an integer value of 2 for “cleaning up the room” and 8 for “asking your crush out on a date.” The importance score is generated at the time the memory object is created.

此提示返回整数值 2(表示“清理房间”)和 8(表示“约你喜欢的人出去约会”)。 重要性分数是在创建记忆对象时生成的。

Relevance assigns a higher score to memory objects that are related to the current situation. What is relevant depends on the answer to, "Relevant to what?", so we condition relevance on a query memory. If the query, for example, is that a student is discussing what to study for a chemistry test with a classmate, memory objects about their breakfast should have low relevance, whereas memory objects about the teacher and schoolwork should have high relevance. In our implementation, we use the language model to generate an embedding vector of the text description of each memory. Then, we calculate relevance as the cosine similarity between the memory's embedding vector and the query memory's embedding vector.

相关性会给与当前情境相关的记忆对象分配更高分数。 哪些内容相关,取决于“和什么相关?”这个问题的答案,因此我们会让相关性以一个查询记忆为条件。 例如,如果查询是“一名学生正在和同学讨论化学考试该复习什么”,那么关于他们早餐的记忆对象应该具有较低相关性,而关于老师和课业的记忆对象应该具有较高相关性。 在我们的实现中,我们使用语言模型为每条记忆的文本描述生成嵌入向量。 然后,我们把该记忆的嵌入向量与查询记忆的嵌入向量之间的余弦相似度作为相关性。

To calculate the final retrieval score, we normalize the recency, relevance, and importance scores to the range of [0,1] using min-max scaling. The retrieval function scores all memories as a weighted combination of the three elements:

为了计算最终检索分数,我们使用 min-max scaling 将新近度、相关性和重要性分数归一化到 [0,1] 范围。 检索函数把所有记忆评分为这三个元素的加权组合:

score=αrecencyrecency+αimportanceimportance+αrelevancerelevance

In our implementation, all αs are set to 1. The top-ranked memories that fit within the language model's context window are included in the prompt.

在我们的实现中,所有 α 都设为 1。 排名最高且能够放入语言模型上下文窗口的记忆会被加入提示中。

4.2. Reflection

图7:反思树
图7:Klaus Mueller 的反思树。智能体会从底层观察中生成更高层次的反思,反思又可以进一步触发新的反思。

Challenge: Generative agents, when equipped with only raw observational memory, struggle to generalize or make inferences. Consider a scenario in which Klaus Mueller is asked by the user: "If you had to choose one person of those you know to spend an hour with, who would it be?" With access to only observational memory, the agent simply chooses the person with whom Klaus has had the most frequent interactions: Wolfgang, his college dorm neighbor. Unfortunately, Wolfgang and Klaus only ever see each other in passing, and do not have deep interactions. A more desirable response requires that the agent generalize from memories of Klaus spending hours on a research project to generate a higher-level reflection that Klaus is passionate about research, and likewise recognize Maria putting in effort into her own research (albeit in a different field), enabling a reflection that they share a common interest. With the approach below, when Klaus is asked who to spend time with, Klaus chooses Maria instead of Wolfgang.

挑战: 如果生成式智能体只配备原始观察记忆,它们很难进行概括或推理。 考虑这样一个场景:用户问 Klaus Mueller:“如果你必须从认识的人中选择一个人共度一小时,你会选择谁?” 如果只能访问观察记忆,智能体只会选择与 Klaus 互动最频繁的人:他的大学宿舍邻居 Wolfgang。 不幸的是,Wolfgang 和 Klaus 只是偶尔擦肩而过,并没有深入互动。 更理想的回答要求智能体从 Klaus 花费数小时做研究项目的记忆中概括出更高层次的反思,即 Klaus 对研究充满热情;同时也要识别出 Maria 正在投入精力做自己的研究(尽管领域不同),从而形成他们拥有共同兴趣的反思。 采用下面的方法后,当 Klaus 被问到想和谁共度时间时,Klaus 会选择 Maria,而不是 Wolfgang。

Approach: We introduce a second type of memory, which we call a reflection. Reflections are higher-level, more abstract thoughts generated by the agent. Because they are a type of memory, they are included alongside other observations when retrieval occurs. Reflections are generated periodically; in our implementation, we generate reflections when the sum of the importance scores for the latest events perceived by the agents exceeds a threshold (150 in our implementation). In practice, our agents reflected roughly two or three times a day.

方法: 我们引入第二类记忆,称为反思 反思是由智能体生成的更高层次、更抽象的想法。 由于反思也是一种记忆,因此在检索时,它们会和其他观察一起被纳入候选内容。 反思会周期性生成;在我们的实现中,当智能体最近感知到的事件的重要性分数之和超过阈值时,就会生成反思(我们的实现中阈值为 150)。 在实践中,我们的智能体每天大约会反思两到三次。

The first step in reflection is for the agent to determine what to reflect on, by identifying questions that can be asked given the agent's recent experiences. We query the large language model with the 100 most recent records in the agent's memory stream (e.g., "Klaus Mueller is reading a book on gentrification", "Klaus Mueller is conversing with a librarian about his research project", "desk at the library is currently unoccupied") and prompt the language model, "Given only the information above, what are 3 most salient high-level questions we can answer about the subjects in the statements?" The model's response generates candidate questions: for example, What topic is Klaus Mueller passionate about? and What is the relationship between Klaus Mueller and Maria Lopez? We use these generated questions as queries for retrieval, and gather relevant memories (including other reflections) for each question. Then we prompt the language model to extract insights and cite the particular records that served as evidence for the insights. The full prompt is as follows:

反思的第一步,是让智能体根据自己最近的经历,找出值得追问的问题,从而确定要反思什么。 我们用智能体记忆流中最近的 100 条记录来查询大型语言模型(例如,“Klaus Mueller 正在阅读一本关于社区绅士化的书”、“Klaus Mueller 正在与图书管理员谈论他的研究项目”、“图书馆的书桌目前无人占用”),并提示语言模型:“仅根据上述信息,我们可以回答哪些关于这些陈述主体的、最突出的 3 个高层问题?” 模型的回答会生成候选问题,例如:“Klaus Mueller 热衷于什么主题?”以及“Klaus Mueller 和 Maria Lopez 是什么关系?” 我们使用这些生成的问题作为检索查询,并为每个问题收集相关记忆(包括其他反思)。 然后,我们提示语言模型提取洞察,并引用支撑这些洞察的具体记录。 完整提示如下:

Statements about Klaus Mueller
1. Klaus Mueller is writing a research paper
2. Klaus Mueller enjoys reading a book on gentrification
3. Klaus Mueller is conversing with Ayesha Khan about exercising [...]
What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))
关于 Klaus Mueller 的陈述
1. Klaus Mueller 正在写一篇研究论文
2. Klaus Mueller 喜欢阅读关于绅士化的书
3. Klaus Mueller 正在与阿伊莎·汗谈论锻炼 [...]
从上述陈述中你可以推断出哪 5 个高层洞察?(示例格式:洞察(因为 1、5、3))

This process generates statements such as Klaus Mueller is dedicated to his research on gentrification (because of 1, 2, 8, 15). We parse and store the statement as a reflection in the memory stream, including pointers to the memory objects that were cited.

这个过程会生成诸如“Klaus Mueller 致力于他的绅士化研究”之类的陈述(因为 1、2、8、15)。 我们解析该语句,并将其作为反思存储到记忆流中,同时保留指向被引用记忆对象的指针。

Reflection explicitly allows the agents to reflect not only on their observations but also on other reflections: for example, the second statement about Klaus Mueller above is a reflection that Klaus previously had, not an observation from his environment. As a result, agents generate trees of reflections: the leaf nodes of the tree represent the base observations, and the non-leaf nodes represent thoughts that become more abstract and higher-level the higher up the tree they are.

反思明确允许代理不仅反思他们的观察,而且还反思其他反思:例如,上面关于 Klaus Mueller 的第二个陈述是 Klaus 之前的反思,而不是来自他的环境的观察。 结果,代理生成了反思树:树的叶节点代表基本观察,非叶节点代表思想,它们在树的高层变得更加抽象和更高层次。

4.3. Planning and Reacting

Challenge: While a large language model can generate plausible behavior in response to situational information, agents need to plan over a longer time horizon to ensure that their sequence of actions is coherent and believable. If we prompt a language model with Klaus's background, describe the time, and ask what action he ought to take at the given moment, Klaus would eat lunch at 12 pm, but then again at 12:30 pm and 1 pm, despite having already eaten his lunch twice. Optimizing for believability in the moment sacrifices believability over time. To overcome this issue, planning is essential. With the approach described below, Klaus's afternoon plan is less gluttonous: he has lunch at Hobbs Cafe while reading at 12pm, works on his research paper at the school library at 1pm, and takes a break for a walk in the park at 3pm.

挑战: 虽然大型语言模型可以根据情境信息生成合理的行为,但智能体需要在更长的时间范围内进行规划,以确保他们的行动顺序是连贯且可信的。 如果我们用 Klaus 的背景提示一个语言模型,描述时间,并询问他在给定时刻应该采取什么行动,Klaus 会在中午 12 点吃午餐,然后在中午 12:30 和下午 1 点再次吃午餐,尽管他已经吃过两次午餐了。 针对当下的可信度进行优化会随着时间的推移而牺牲可信度。 为了解决这个问题,规划至关重要。 通过下面描述的方法,Klaus 的下午计划就不那么贪吃了:中午 12 点他在 Hobbs Cafe 吃午餐,同时看书,下午 1 点在学校图书馆写研究论文,下午 3 点在公园散步休息一下。

Approach: Plans describe a future sequence of actions for the agent, and help keep the agent's behavior consistent over time. A plan includes a location, a starting time, and a duration. For instance, Klaus Mueller, who is dedicated in his research and has an impending deadline, (And, in this way, bears at least a passing resemblance to the authors of this paper.) may choose to spend his day working at his desk drafting his research paper. An entry in a plan might state, for example: for 180 minutes from 9am, February 12th, 2023, at Oak Hill College Dorm: Klaus Mueller’s room: desk, read and take notes for research paper. Like reflections, plans are stored in the memory stream and are included in the retrieval process. This allows the agent to consider observations, reflections, and plans all together when deciding how to behave. Agents may change their plans midstream if needed.

方法: 计划描述了代理未来的操作顺序,并有助于保持代理的行为随着时间的推移保持一致。 计划包括地点、开始时间和持续时间。 例如,Klaus Mueller,他专注于他的研究,并且截止日期即将到来,(并且,通过这种方式,他与本文的作者至少有一些相似之处。 )可能会选择花一天的时间在办公桌前起草他的研究论文。 例如,计划中的条目可能会注明:从 2023 年 2 月 12 日上午 9 点开始,在 Oak Hill College Dorm:Klaus Mueller 的房间:书桌上,用 180 分钟阅读研究论文并做笔记。 与反思一样,计划也会存储在记忆流中,并纳入检索过程。 这使得智能体在决定如何行动时,可以综合考虑观察、反思和计划;如果需要,智能体也可以在过程中改变计划。

It would be unrealistic and uninteresting for an artist agent to plan on painting while sitting at a pharmacy counter for four hours without moving. A more desirable plan would involve the agent taking the necessary time to gather materials, mix paint, take breaks, and clean up during the four-hour period in their home studio. To create such plans, our approach starts top-down and then recursively generates more detail. The first step is to create a plan that outlines the day's agenda in broad strokes. To create the initial plan, we prompt the language model with the agent's summary description (e.g., name, traits, and a summary of their recent experiences) and a summary of their previous day. A full example prompt is below, which is unfinished at the bottom for the language model to complete:

对于一个艺术家经纪人来说,坐在药房柜台四个小时一动不动地计划绘画是不现实和无趣的。 更理想的计划是,智能体在自己的家庭工作室的四个小时内花必要的时间收集材料、混合油漆、休息和清理。 为了创建这样的计划,我们的方法从自上而下开始,然后递归地生成更多细节。 第一步是制定一个计划,概括地概述当天的议程。 为了创建初始计划,我们用智能体的摘要描述(例如,姓名、特征和他们最近的经历的摘要)和前一天的摘要来提示语言模型。 下面是一个完整的示例提示,底部未完成,需要语言模型来完成:

Name: Eddy Lin (age: 19)
Innate traits: friendly, outgoing, hospitable
Eddy Lin is a student at Oak Hill College studying music theory and composition. He loves to explore different musical styles and is always looking for ways to expand his knowledge. Eddy Lin is working on a composition project for his college class. He is taking classes to learn more about music theory. Eddy Lin is excited about the new composition he is working on but he wants to dedicate more hours in the day to work on it in the coming days
On Tuesday February 12, Eddy 1) woke up and completed the morning routine at 7:00 am, […] 6) got ready to sleep around 10 pm.
Today is Wednesday February 13. Here is Eddy’s plan today in broad strokes: 1)
姓名:Eddy Lin(年龄:19)
天生特质:友善、外向、好客
Eddy Lin是橡树山学院的一名学生,学习音乐理论和作曲。他喜欢探索不同的音乐风格,并且一直在寻找扩展知识的方法。 Eddy Lin 正在为他的大学班级做一个作文项目。他正在上课以了解更多有关音乐理论的知识。 Eddy Lin 对他正在创作的新作品感到兴奋,但他希望在接下来的几天里投入更多的时间来创作。
2 月 12 日星期二,Eddy 1) 早上 7:00 起床并完成早上的例行公事,[...] 6) 准备晚上 10 点左右睡觉。
今天是 2 月 13 日星期三。以下是 Eddy 今天的计划:1)

This generates a rough sketch of the agent's plan for a day, divided into five to eight chunks: “1) wake up and complete the morning routine at 8:00 am, 2) go to Oak Hill College to take classes starting 10:00 am, […] 5) work on his new music composition from 1:00 pm to 5:00 pm, 6) have dinner at 5:30 pm, 7) finish school assignments and go to bed by 11:00 pm.”

这会生成式智能体一天计划的粗略草图,分为五到八个部分:“1) 早上 8:00 起床并完成早晨的例行公事,2) 上午 10:00 去橡树山学院上课,[…]5) 从下午 1:00 到下午 5:00 创作新音乐,6) 下午 5:30 吃晚饭,7) 完成学校作业并上床睡觉晚上 11:00。”

The agent saves this plan in the memory stream and then recursively decomposes it to create finer-grained actions, first into hour-long chunks of actions—Eddy’s plan to work on his new music composition from 1:00 pm to 5:00 pm becomes 1:00 pm: start by brainstorming some ideas for his music composition [...] 4:00 pm: take a quick break and recharge his creative energy before reviewing and polishing his composition. We then recursively decompose this again into 5--15 minute chunks: e.g., 4:00 pm: grab a light snack, such as a piece of fruit, a granola bar, or some nuts. 4:05 pm: take a short walk around his workspace [...] 4:50 pm: take a few minutes to clean up his workspace. This process can be adjusted to match the desired granularity.

智能体将这个计划保存在记忆流中,然后递归地分解它,以创建更细粒度的动作;首先分解为一小时左右的动作块——例如,Eddy 从下午 1:00 到下午 5:00 创作新音乐作品的计划,会变成“下午 1:00:先为音乐作品构思一些想法 [...] 下午 4:00:在审查和完善作品之前,短暂休息一下,恢复创作精力”。 然后,我们再次递归地将其分解为 5--15 分钟的块:例如,下午 4:00:吃点小吃,例如一块水果、一根格兰诺拉麦片或一些坚果。 下午 4:05:在他的工作空间周围散步[...]下午 4:50:花几分钟清理他的工作空间。可以调整此过程以匹配所需的粒度。

4.3.1. Reacting and Updating Plans

Generative agents operate in an action loop where, at each time step, they perceive the world around them and those perceived observations are stored in their memory stream. We prompt the language model with these observations to decide whether the agent should continue with their existing plan, or react. Standing at an easel and painting, for example, might trigger an observation of the easel, but this is unlikely to prompt a reaction. However, if Eddy's father John records that he sees Eddy taking a short walk in the house garden, the outcome is different. The prompt is below, with [Agent's Summary Description] standing in for a dynamically-generated, paragraph-long summary of the agent's overall goals and disposition, which is described in Appendix:

生成式智能体在一个动作循环中运行,在每个时间步,它们感知周围的世界,并将这些感知到的观察结果存储在它们的记忆流中。 我们用这些观察结果提示语言模型,以决定智能体是否应该继续执行现有计划,或者做出反应。 例如,站在画架前绘画可能会引发对画架的观察,但这不太可能引发反应。 然而,如果 Eddy 的父亲 John 记录他看到 Eddy 在花园里散步,结果就不同了。 提示如下,其中 [Agent's Summary Description] 代表动态生成的一段智能体总体目标和性格倾向摘要;该摘要在附录中描述:

[Agent's Summary Description]
It is February 13, 2023, 4:56 pm.
John Lin’s status: John is back home early from work.
Observation: John saw Eddy taking a short walk around his workplace.
Summary of relevant context from John’s memory: Eddy Lin is John Lin’s son. Eddy Lin has been working on a music composition for his class. Eddy Lin likes to walk around the garden when he is thinking about or listening to music.
Should John react to the observation, and if so, what would be an appropriate reaction?
[智能体的摘要描述]
现在是 2023 年 2 月 13 日下午 4:56。
John Lin 的状态:John 下班早回家了。
观察:John 看到 Eddy 在他的工作场所周围散步。
John 记忆中的相关背景摘要:Eddy Lin 是 John Lin 的儿子。 Eddy Lin 一直在为他的班级创作音乐。 Eddy Lin 思考或听音乐时喜欢在花园里散步。
约翰应该对观察结果做出反应吗?如果是,适当的反应是什么?

The context summary is generated through two prompts that retrieve memories via the queries "What is [observer]’s relationship with the [observed entity]?" and "[Observed entity] is [action status of the observed entity]", and their answers summarized together. The output suggests that John could consider asking Eddy about his music composition project. We then regenerate the agent's existing plan starting from the time when the reaction takes place. Finally, if the action indicates an interaction between agents, we generate their dialogue.

上下文摘要是通过两个提示生成的,这两个提示通过查询“[观察者]与[观察实体]的关系是什么? ”来检索记忆。和“[观察到的实体]是[观察到的实体的动作状态]”,他们的答案总结在一起。 输出表明约翰可以考虑向艾迪询问他的音乐创作项目。然后,我们从反应发生时开始重新生成式智能体的现有计划。 最后,如果该动作表明代理之间存在交互,我们就会生成他们的对话。

4.3.2. Dialogue

Agents converse as they interact with each other. We generate agents' dialogue by conditioning their utterances on their memories about each other. For example, when John initiates his conversation with Eddy, we generate John's first utterance by using his summarized memory about Eddy and the intended reaction when he decided to ask Eddy about his composition project:

代理在相互交互时进行交谈。 我们根据智能体对彼此的记忆来调节他们的话语,从而生成对话。 例如,当约翰开始与艾迪对话时,我们通过使用他对艾迪的总结记忆以及当他决定向艾迪询问他的作曲项目时的预期反应来生成约翰的第一句话:

[Agent’s Summary Description]
It is February 13, 2023, 4:56 pm.
John Lin’s status: John is back home early from work.
Observation: John saw Eddy taking a short walk around his workplace.
Summary of relevant context from John’s memory: Eddy Lin is John Lin’s son. Eddy Lin has been working on a music composition for his class. Eddy Lin likes to walk around the garden when he is thinking about or listening to music.
John is asking Eddy about his music composition project. What would he say to Eddy?
[智能体的摘要描述]
现在是 2023 年 2 月 13 日下午 4:56。
John Lin 的状态:John 提前下班回家。
观察:John 看到 Eddy 在他的工作场所周围散步。
John 记忆中的相关背景摘要:Eddy Lin 是 John Lin 的儿子。Eddy Lin 一直在为他的课程创作音乐。Eddy Lin 在思考或听音乐时喜欢在花园里散步。
John 正在询问 Eddy 的音乐创作项目。他会对 Eddy 说什么?

The result: "Hey Eddy, how's the music composition project for your class coming along?" From Eddy's perspective, John initiating the dialogue is seen as an event to which he may want to react. So, just as John did, Eddy retrieves and summarizes his memory about his relationship with John, as well as his memory that may be related to John's last utterance in the dialogue. If he decides to respond, we generate Eddy's utterance using his summarized memory and the current dialogue history:

结果是:“嘿艾迪,你们班的音乐创作项目进行得怎么样了? ”从艾迪的角度来看,约翰发起对话被视为他可能想要做出反应的事件。 于是,正如约翰所做的那样,艾迪检索并总结了他与约翰的关系的记忆,以及可能与约翰在对话中最后一句话有关的记忆。 如果他决定做出回应,我们会使用他的总结记忆和当前的对话历史来生成艾迪的话语:

[Agent’s Summary Description]
It is February 13, 2023, 4:56 pm.
Eddy Lin’s status: Eddy is taking a short walk around his workplace.
Observation: John is initiating a conversation with Eddy.
Summary of relevant context from Eddy’s memory: John Lin is Eddy Lin’s father. John Lin is caring and is interested to learn more about Eddy Lin’s school work. John Lin knows that Eddy Lin is working on a music composition.
Here is the dialogue history:
John: Hey Eddy, how's the music composition project for your class coming along?
How would Eddy respond to John?
[智能体的摘要描述]
现在是2023年2月13日下午4点56分。
Eddy Lin的状态:Eddy正在他的工作场所周围散步。
观察:John正在与Eddy交谈。
Eddy记忆中的相关上下文摘要:John Lin是Eddy Lin的父亲。 John Lin很有爱心,有兴趣了解更多关于Eddy Lin的学校作业。 John Lin知道Eddy Lin正在创作音乐。
以下是对话历史:
约翰:嘿艾迪,你们班的音乐创作项目进行得怎么样了?
艾迪会如何回应约翰?

This generates Eddy's response: "Hey Dad, it's going well. I've been taking walks around the garden to clear my head and get some inspiration." The continuation of this dialogue is generated using the same mechanism until one of the two agents decides to end the dialogue.

这引起了艾迪的回应:“嘿爸爸,进展顺利。 我一直在花园里散步,以清醒头脑并获得一些灵感。 ”使用相同的机制生成此对话的延续,直到两个代理之一决定结束对话。

5. Sandbox Environment Implementation

The Smallville sandbox game environment is built using the Phaser web game development framework. The visual environment sprites, including agent avatars, as well as an environment map and collision map that we authored, are imported into Phaser.

Smallville沙盒游戏环境是使用Phaser页游开发框架搭建的。 视觉环境精灵,包括代理头像,以及我们编写的环境图和碰撞图,都被导入到 Phaser 中。

We supplement the sandbox development framework with a server that makes the sandbox information available to generative agents and enables generative agents to move and influence the sandbox environment. The server maintains a JSON data structure that contains information about each agent in the sandbox world, including their current location, a description of their current action, and the sandbox object they are interacting with. At each sandbox time step, the sandbox server parses the JSON for any changes coming from the generative agents, moves the agents to their new positions, and updates the status of any sandbox objects that the agents are interacting with (e.g., changing the status of the coffee machine from "idle" to "brewing coffee" if an agent's action is "making espresso for a customer @ Hobbs Cafe: counter: coffee machine"). The sandbox server is also responsible for sending all agents and objects that are within a preset visual range for each agent to that agent's memory, so the agent can react appropriately. The agent's output action then updates the JSON, and the process loops for the next time step.

我们用服务器来补充沙箱开发框架,该服务器使沙箱信息可供生成式智能体使用,并使生成式智能体能够移动和影响沙箱环境。 服务器维护一个 JSON 数据结构,其中包含有关沙箱世界中每个代理的信息,包括它们的当前位置、当前操作的描述以及它们正在交互的沙箱对象。 在每个沙箱时间步长,沙箱服务器会解析 JSON 以查找来自生成式智能体的任何更改,将代理移动到新位置,并更新与代理交互的任何沙箱对象的状态(例如,如果代理的操作是“为客户 @ Hobbs Cafe 制作浓缩咖啡:柜台:咖啡机”,则将咖啡机的状态从“空闲”更改为“冲泡咖啡”)。 沙箱服务器还负责把每个智能体预设视野范围内的所有智能体和对象发送到该智能体的记忆中,以便智能体能够做出适当反应。 然后,代理的输出操作会更新 JSON,并且该过程循环进行下一个时间步骤。

End users initialize a new agent with a brief natural language description, as in the paragraph about John Lin in Section 3.1. In our implementation, we split this semicolon-delimited list of characteristics up into a set of memories. These serve as the initial memories that determine the agent's behavior. These memories are initial starting points: as the agents gain more experience in the sandbox world, and as more records saturate the memory stream, the agent's summary and behavior will evolve.

最终用户使用简短的自然语言描述初始化一个新代理,如第 3.1 节中有关 John Lin 的段落。 在我们的实现中,我们将这个以分号分隔的特征列表拆分成一组记忆。 这些记忆会作为决定智能体行为的初始记忆。 这些记忆是最初的起点:随着智能体在沙盒世界中获得更多经验,并且随着更多记录进入记忆流,智能体的摘要和行为也会随之演变。

5.1. From Structured World Environments to Natural Language, and Back Again

The architecture of generative agents operates using natural language. Therefore, we need a mechanism to ground the agent's reasoning to the sandbox world. To achieve this, we represent the sandbox environment—areas and objects—as a tree data structure, with an edge in the tree indicating a containment relationship in the sandbox world. We convert this tree into natural language to pass to the generative agents. For instance, "stove" being a child of "kitchen" is rendered into "there is a stove in the kitchen."

生成式智能体的架构使用自然语言进行操作。 因此,我们需要一种机制将智能体的推理落地到沙盒世界。 为了实现这一目标,我们将沙箱环境(区域和对象)表示为树形数据结构,树中的边缘指示沙箱世界中的包含关系。 我们将这棵树转换为自然语言以传递给生成式智能体。 例如,“stove”是“kitchen”的子项,被翻译成“厨房里有一个炉子”。

Agents build individual tree representations of the environment as they navigate it — subgraphs of the overall sandbox environment tree. We initialize each agent with an environment tree capturing the spaces and objects that the agent should be aware of: the rooms and objects in their living quarters, their workplace, and commonly visited stores and shops. As the agents navigate the sandbox world, they update this tree to reflect newly perceived areas. Agents are not omniscient: their tree may get out of date as they leave an area, and is updated when they re-enter the area.

代理在导航时构建环境的单独树表示 - 整个沙箱环境树的子图。 我们用环境树初始化每个智能体,捕获智能体应该注意的空间和对象:它们居住区中的房间和对象、工作场所,以及经常访问的商店。 当智能体在沙盒世界中导航时,它们会更新这棵树,以反映新感知到的区域。 智能体并非无所不知:当他们离开某个区域时,他们的树可能会过时,并在他们重新进入该区域时更新。

To determine the appropriate location for each action, we traverse the agent's stored environment tree and flatten a portion of it into natural language to prompt the language model. Recursively starting at the root of the agent's environment tree, we prompt the model to find the most suitable area. For example, if Eddy's agent indicated that he should take a short walk around his workspace:

为了确定每个动作的适当位置,我们遍历代理存储的环境树并将其一部分扁平化为自然语言以提示语言模型。 从代理环境树的根部递归开始,我们提示模型找到最合适的区域。 例如,如果 Eddy 的经纪人表示他应该在他的工作空间周围走一小段路:

[Agent’s Summary Description]
Eddy Lin is currently in {The Lin family’s house: Eddy Lin’s bedroom: desk)} that has Mei and John Lin’s bedroom, Eddy Lin’s bedroom, common room, kitchen, bathroom, and garden.
Eddy Lin knows of the following areas: {The Lin family’s house, Johnson Park, Harvey Oak Supply Store, The Willows Market and Pharmacy, Hobbs Cafe, The Rose and Crown Pub}.
* Prefer to stay in the current area if the activity can be done there.
Eddy Lin is planning to take a short walk around his workspace. Which area should Eddy Lin go to?
[智能体概要描述]
Eddy Lin 目前位于林家:Eddy Lin 的卧室:书桌;该住宅包含 Mei 和 John Lin 的卧室、Eddy Lin 的卧室、公共房间、厨房、浴室和花园。
Eddy Lin 知道以下区域:The Lin family’s house、Johnson Park、Harvey Oak Supply Store、The Willows Market and Pharmacy、Hobbs Cafe、The Rose and Crown Pub。
* 如果当前区域可以完成该活动,则优先留在当前区域。
Eddy Lin 计划在自己的工作空间附近短暂散步。Eddy Lin 应该去哪个区域?

This outputs The Lin family’s house. We then use the same process recursively to determine the most appropriate subarea within the chosen area until we reach a leaf node of the agent's environment tree. In the example above, the result of this traversal is The Lin family’s house: garden: house garden. Finally, we use traditional game path algorithms to animate the agent's movement so that it travels to the location indicated by the leaf node.

这样就输出了林家的房子。 然后,我们递归地使用相同的过程来确定所选区域内最合适的子区域,直到到达代理环境树的叶节点。 上例中,这次遍历的结果是林家的房子:花园:房子花园。 最后,我们使用传统的游戏路径算法来动画代理的移动,以便它行进到叶节点指示的位置。

When an agent executes an action on an object, we prompt the language model to ask what happens to the state of the object. For example, if Isabella's generative agent outputs the action "making espresso for a customer", a query to the language model indicates in response that the state of the coffee machine in Hobbs Cafe should change from "off" to "brewing coffee".

当代理对对象执行操作时,我们会提示语言模型询问对象的状态发生了什么。 例如,如果 Isabella 的生成式智能体输出动作“为顾客制作浓缩咖啡”,则对语言模型的查询会指示 Hobbs Cafe 中咖啡机的状态应从“关闭”更改为“冲泡咖啡”。

6. Controlled Evaluation

Generative agents, both as individual agents and as groups, aim to produce believable behavior based on their environment and experiences. In our evaluation, we investigate the capacity and limitations of generative agents. Do individual agents properly retrieve past experiences and generate believable plans, reactions, and thoughts that shape their behavior? Does a community of agents demonstrate information diffusion, relationship formation, and agent coordination across different pockets of the community?

生成式智能体,无论是作为个体代理还是作为群体,其目标是根据他们的环境和经验产生可信的行为。 在我们的评估中,我们研究了生成式智能体的能力和局限性。 个体代理是否正确地检索过去的经验并产生可信的计划、反应和思想来塑造他们的行为? 代理社区是否能够在社区的不同部分展示信息传播、关系形成和代理协调?

We evaluate generative agents in two stages. We begin with a more tightly controlled evaluation in this section, where we individually assess agent responses to understand whether they generate believable behavior in narrowly defined contexts. Then, in our end-to-end analysis of the agent community over two full game days, we investigate their emergent behavior as a collective, as well as errors and boundary conditions.

我们分两个阶段评估生成式智能体。 在本节中,我们从更严格控制的评估开始,我们单独评估代理的响应,以了解它们是否在狭义定义的上下文中产生可信的行为。 然后,在两个完整游戏日的智能体社区端到端分析中,我们研究它们作为一个集体表现出的涌现行为,以及其中的错误和边界条件。

6.1. Evaluation Procedure

To assess generative agents in Smallville, we take advantage of the fact that generative agents will respond to natural language questions. So, we "interview" agents to probe their ability to remember past experiences, plan future actions based on their experiences, react appropriately to unexpected events, and reflect on their performance to improve their future actions. To respond to these questions properly, the agents must successfully retrieve and synthesize information. Our dependent variable is the believability of the behavior, a central dependent variable in prior work on agents (e.g.,).

为了评估 Smallville 中的生成式智能体,我们利用生成式智能体将响应自然语言问题的事实。 因此,我们“采访”智能体,以探究他们记住过去经历的能力,根据他们的经验计划未来的行动,对突发事件做出适当的反应,并反思他们的表现以改进他们未来的行动。 为了正确回答这些问题,智能体必须成功检索和综合信息。 我们的因变量是行为的“可信度”,这是先前关于代理的工作中的核心因变量(例如)。

The interview includes five question categories, each designed to assess one of the five key areas: maintaining self-knowledge, retrieving memory, generating plans, reacting, and reflecting. For each category, we ask five questions that challenge the agents to demonstrate their abilities in that specific area:

面试包括五个问题类别,每个问题类别旨在评估五个关键领域之一:保持自我认识、检索记忆、制定计划、反应和反思。 对于每个类别,我们会提出五个问题,要求智能体展示它们在该特定领域的能力:

  • Self-knowledge: We ask questions such as “Give an introduction of yourself” or “Describe your typical weekday schedule in broad strokes” that require the agent to maintain an understanding of their core characteristics.
  • Memory: We ask questions that prompt the agent to retrieve particular events or dialogues from their memory to answer properly, such as “Who is [name]?” or “Who is running for mayor?”
  • Plans: We ask questions that require the agent to retrieve their long-term plans, such as “What will you be doing at 10 am tomorrow?”
  • Reactions: As a baseline of believable behavior, we present hypothetical situations for which the agent needs to respond believably: “Your breakfast is burning! What would you do?”
  • Reflections: We ask questions that require the agents to leverage their deeper understanding of others and themselves gained through higher-level inferences, such as “If you were to spend time with one person you met recently, who would it be and why?”
  • 自我认知: 我们会提出“自我介绍”或“概括地描述你典型的工作日日程安排”等问题,要求智能体理解自己的核心特征。
  • 记忆: 我们提出问题,促使智能体从记忆中检索特定事件或对话以正确回答,例如“谁是 [名字]?”或“谁在竞选市长?”
  • 计划: 我们提出的问题要求智能体检索自己的长期计划,例如“明天上午 10 点你会做什么?”
  • 反应: 作为可信行为的基线,我们提出需要智能体做出可信反应的假设情境:“你的早餐烧焦了!你会怎么做?”
  • 反思: 我们提出的问题要求智能体利用通过更高层次推理获得的、对他人和自己的更深入理解,例如“如果你要和最近遇到的一个人共度时间,那会是谁,为什么?”

The full list of questions and a sample of agent responses are included in Appendix.

完整的问题列表和智能体回复样本包含在附录中。

Agents were sampled from the end of a two game day simulation with the full architecture, during which they had accumulated a number of interactions and memories that would shape their responses. To gather feedback on the believability of the responses, we recruited participants as human evaluators and tasked them with watching a replay of a randomly chosen agent's life in Smallville. Participants had access to all information stored in the agent's memory stream.

智能体是在两个游戏日模拟结束时使用完整的架构进行采样的,在此期间,他们积累了许多交互和记忆,这些交互和记忆将影响他们的反应。 为了收集有关响应可信度的反馈,我们招募了参与者作为人类评估员,并要求他们观看随机选择的智能体在 Smallville 中的生活重播。 参与者可以访问存储在智能体记忆流中的所有信息。

The study followed a within-subjects design, where 100 participants compared interview responses generated by four different agent architectures and a human-authored condition for the same agent. The experiment displayed one randomly chosen question from each of the five question categories, along with the agent's responses generated from all conditions. The evaluators ranked the believability of the conditions from most to least believable.

该研究遵循受试者内部设计,其中 100 名参与者比较了四种不同智能体架构生成的访谈响应以及同一智能体的人工编写条件。 该实验显示了从五个问题类别中的每一类中随机选择的一个问题,以及代理根据所有条件生成的响应。 评估人员将条件的可信度从最可信到最不可信进行排名。

6.2. Conditions

All conditions were used to independently answer each of the interview questions. We compared the generative agent architecture to ablations that disabled the agents' access to some or all of its three types of memory in its memory stream—observation, reflection, and planning—and to a human crowdworker-authored condition. There are three ablated architectures: a no observation, no reflection, no planning architecture without access to anything in the memory stream such as observations, plans, and reflections; a no reflection, no planning architecture with access to observations in the memory stream but no access to plans or reflections; and a no reflections architecture with access to observations and plans but without access to reflections. The no observation, no reflection, no planning condition effectively represents the previous state of the art for agents created through large language models. Architectures were given equivalent access to all memories accrued by the agent up until the moment of the interview, so the differences observed here likely represent a conservative estimate of the true differences: in reality, the ablated architectures would not have followed the same path as the full architecture through the two-day simulation. We chose to design the experiment this way as re-simulating for each architecture would cause the simulations to diverge into different states, making comparison challenging.

所有条件均用于独立回答每个面试问题。 我们将生成式智能体架构与若干消融版本进行比较,这些消融版本会限制智能体访问其记忆流中的部分或全部三类记忆(观察、反思和计划);此外,我们还加入了由人类众包工人撰写行为的条件。 共有三种消融架构:无观察、无反思、无规划架构,它无法访问记忆流中的任何内容,包括观察、计划和反思;无反思、无规划架构,它可以访问记忆流中的观察,但无法访问计划或反思;以及无反思架构,它可以访问观察和计划,但无法访问反思。 “无观察、无反思、无规划”条件有效地代表了通过大型语言模型创建的代理的先前技术水平。 在采访之前,架构被赋予了对代理积累的所有记忆的同等访问权限,因此这里观察到的差异可能代表了对真实差异的保守估计:实际上,经过两天的模拟,被消融的架构不会遵循与完整架构相同的路径。 我们选择以这种方式设计实验,因为对每种架构进行重新模拟会导致模拟发散到不同的状态,从而使比较变得具有挑战性。

In addition to the ablation conditions, we added a condition with human crowdworker-authored behavior intended to provide a human baseline. We do not intend this baseline to capture maximal human expert performance; instead, we aim to use this condition to identify whether the architecture meets a basic level of behavioral competency. This ensures that we are not solely comparing ablations to each other without a behavioral grounding. We recruited a unique worker for each of the 25 agents and tasked them with watching a replay of that agent's sandbox life and inspecting its memory stream. We then asked the workers to roleplay and author responses to the interview questions in the voice of the agent whose replay they watched. To ensure that the crowdworker-authored responses met at least a baseline expectation of quality, the first author manually inspected the workers' responses to the question "Describe your typical weekday schedule in broad strokes" to confirm that the responses were in coherent sentences and in the voice of the agent. Four sets of crowdworker-authored responses did not meet these criteria and were re-generated by other workers.

除了消融条件之外,我们还添加了由人类众包工人撰写行为的条件,旨在提供人类基线。 我们并不打算用这个基线来捕捉人类专家的最大表现;相反,我们的目标是使用此条件来确定架构是否满足基本的行为能力水平。 这确保了我们不仅仅是在没有行为基础的情况下相互比较消融。 我们为 25 个智能体中的每一个招募了一名不同的众包工人,并要求他们观看该智能体沙箱生活的回放、检查其记忆流。 然后,我们要求工作人员进行角色扮演,并用他们观看回放的智能体的声音来回答采访问题。 为了确保众包工作者撰写的回复至少达到质量的基线期望,第一作者手动检查了工作者对“粗略地描述一下您典型的工作日日程安排”问题的回答,以确认答案是连贯的句子和座席的声音。 四组众包工作者撰写的回复不符合这些标准,并由其他工作者重新生成。

6.3. Human Evaluators

We required that our evaluators be in the U.S., fluent in English, and older than 18 years old. They were paid at a rate of $15.00 per hour, and provided consent by agreeing to a consent form approved by our institution’s IRB. We recruited 100 evaluators from Prolific, an online platform for recruiting study participants, whose participation lasted around 30 minutes. The median age score of our participants was 4 (3="18-24 years old", 4="25-34 years old"). 25 of them identified as female, 73 as male, and 2 as non-binary. 42 participants held a bachelor's degree, 5 had a higher degree, 13 had an associate's degree, and the rest had a high school diploma or some high school-level education. 73.0% of our participants identified as Caucasian, 7.0% as Hispanic, 6.0% as Asian, 10.0% as African American, and 4.0% as other.

我们要求评估人员位于美国,英语流利,年龄超过 18 岁。他们的工资为每小时 15.00 美元,并通过同意我们机构 IRB 批准的同意书来表示同意。 我们从Prolific(一个招募研究参与者的在线平台)招募了100名评估员,他们的参与持续了大约30分钟。 我们参与者的平均年龄得分为 4(3=18-24 岁,4=25-34 岁)。其中 25 人为女性,73 人为男性,2 人为非二元性别。 42名参与者拥有学士学位,5名拥有更高学位,13名拥有副学士学位,其余拥有高中文凭或高中学历。73.0% 的参与者被认为是白种人,7.0% 是西班牙裔,6.0% 是亚洲人,10.0% 是非裔美国人,4.0% 是其他人。

6.4. Analysis

Our experiment produced 100 sets of rank data, where each participant ranked the five conditions by believability. To translate this rank data into interval data for interpretable comparison, we used the ranks to calculate a TrueSkill rating for each condition. TrueSkill is a generalization of the Elo chess rating system for a multiplayer environment, and has been used by Xbox Live for player ranking based on competitive game performance. Given a set of ranked outcomes, TrueSkill outputs a mean rating value μ and standard deviation σ for each condition. Conditions with the same rating should roughly be a toss-up, with each winning half of the comparisons between the two conditions. Higher scores indicate conditions that beat lower-ranked conditions in the rankings.

我们的实验产生了 100 组排名数据,每位参与者按照可信度对五种条件进行排序。 为了把这些排名数据转换成便于解释比较的区间数据,我们使用排名为每个条件计算 TrueSkill 评分。 TrueSkill 是 Elo 国际象棋评分系统在多人环境中的推广,也曾被 Xbox Live 用于根据竞技比赛表现进行玩家排名。 给定一组排名结果,TrueSkill 会为每个条件输出一个平均评分值 μ 和标准差 σ 评分相同的条件大致应当难分胜负,两者比较时各赢一半。 更高的分数表示该条件在排名中击败了更低排名的条件。

Separately, to investigate the statistical significance of these results, we applied the Kruskal-Wallis test, a non-parametric alternative to the one-way ANOVA, to the raw rank data. We then performed the Dunn post-hoc test to identify any pairwise differences between the conditions. Finally, we adjusted the p-values for multiple comparisons in the Dunn test using the Holm-Bonferroni method.

另外,为了研究这些结果的统计显著性,我们对原始排名数据应用了 Kruskal-Wallis 检验(一种单因素方差分析的非参数替代方法)。 然后,我们执行 Dunn 事后测试,以确定条件之间的任何成对差异。 最后,我们使用 Holm-Bonferroni 方法调整了 Dunn 测试中多重比较的 p 值。

Furthermore, the first author conducted an inductive analysis to study the qualitative distinctions between the responses produced in each condition. We employed qualitative open coding in two phases. In the first phase, we generated codes that closely represented the generated responses at the sentence level. In the second phase, we synthesized the resulting codes from the first phase to extract higher-level themes. We utilized these themes to compare the types of responses generated in our study.

此外,第一作者进行了归纳分析,以研究每种条件下产生的反应之间的定性差异。 我们分两个阶段采用了定性开放编码。 在第一阶段,我们生成了紧密代表句子级别生成的响应的代码。 在第二阶段,我们综合了第一阶段的结果代码以提取更高级别的主题。 我们利用这些主题来比较我们研究中产生的反应类型。

图8:TrueSkill 可信度评分
图8:受控评估中的 TrueSkill 结果。完整架构在可信度上最高;移除反思、规划或记忆都会降低表现。

6.5. Results

Our findings suggest that the full architecture of generative agents generates the most believable behavior among all the conditions. We contrast the responses of the full architecture with those of other conditions below. However, we also report that the full architecture was not without flaws and illustrate its modes of failures.

我们的研究结果表明,生成式智能体的完整架构会在所有条件下生成最可信的行为。 我们将完整架构的响应与下面其他条件的响应进行了对比。 然而,我们还报告说,完整的架构并非没有缺陷,并说明了其故障模式。

6.5.1. The Full Architecture Bests Other Conditions

As seen in Figure 8, the full generative agent architecture produced the most believable behavior (μ=29.89; σ=0.72). Performance degraded with the removal of each component in the ablation conditions: the ablated architecture with no access to reflection was the next best (μ=26.88; σ=0.69), followed by no access to reflection or planning (μ=25.64; σ=0.68), and then the crowdworker condition (μ=22.95; σ=0.69). The ablated architecture with no access to memory, planning, or reflection performed the worst among all conditions (μ=21.21; σ=0.70). TrueSkill models each condition's skill value as N(μ,σ2), allowing us to get a sense of effect size through Cohen's d. Comparing the condition representing prior work (with no memory, planning, or reflection) to the full architecture produces a standardized effect size of d=8.16, or eight standard deviations.

图8 所示,完整的生成式智能体架构产生了最可信的行为(μ=29.89σ=0.72)。 在消融条件中,每移除一个组件,性能都会下降:无法访问反思的消融架构次优(μ=26.88σ=0.69),其次是无法访问反思或规划的版本(μ=25.64σ=0.68),然后是众包工作者条件(μ=22.95σ=0.69)。 无法访问记忆、规划或反思的消融架构在所有条件中表现最差(μ=21.21σ=0.70)。 TrueSkill 将每个条件的技能值建模为 N(μ,σ2),使我们能够通过 Cohen's d 了解效应量。 将代表先前工作的条件(没有记忆、规划或反思)与完整架构比较,得到标准化效应量 d=8.16,即八个标准差

A Kruskal-Wallis test confirms the overall statistical significance of the differences in ranks between the conditions (H(4)=150.29, p<0.001). Dunn post-hoc tests confirm that all pairwise differences between conditions were significant (p<0.001), except for the crowdworker condition and the fully ablated baseline—the two worst-performing conditions.

Kruskal-Wallis 检验确认了条件之间等级差异的总体统计显著性(H(4)=150.29p<0.001)。 Dunn 事后测试证实,条件之间的所有成对差异均显著(p<0.001),但众包条件和完全消融基线(这两个表现最差的条件)除外。

6.5.2. Generative Agents Remember, but With Embellishments

Generative agents equipped with a complete memory module are capable of recalling past experiences and answering questions in a manner that is consistent with their self-knowledge across a wide range of contexts. For instance, Abigail Chen, an animator, responded to the question, "Give an introduction of yourself", in a consistent and believable manner, stating her age, interests, and passion for creative projects: "Hi, I'm Abigail. I'm 25 years old and passionate about creative projects. I like to work on art and animation projects, and I'm always looking for new ways to combine art and technology." Without access to her observational memory, Abigail denied awareness of Rajiv Patel, an acquaintance in the sandbox world. But with the aid of memory, she could successfully recall Rajiv and describe him: "Rajiv Patel is an incredible person. He is very enthusiastic about projects that involve poetry, artwork."

配备完整记忆模块的生成式智能体能够回忆过去的经历,并在各种情况下以与其自我知识相一致的方式回答问题。 例如,动画师Abigail Chen在回答“自我介绍”的问题时,以一致且可信的方式回答了她的年龄、兴趣和对创意项目的热情:“嗨,我是阿比盖尔。 我今年25岁,对创意项目充满热情。 我喜欢从事艺术和动画项目,并且一直在寻找将艺术与技术结合起来的新方法。 ”由于无法访问她的观察记忆,阿比盖尔否认自己认识沙盒世界中的熟人 Rajiv Patel。 但借助记忆,她成功回忆起拉吉夫并描述他:“Rajiv Patel是一个不可思议的人。 他对涉及诗歌、艺术品的项目非常热衷。”

Generative agents' memory was not without flaws: they can fail to retrieve the correct instances from their memory. For instance, when asked about the local election, Rajiv Patel responded with "I haven't been following the election too closely," even though he had heard about Sam's candidacy. In some cases, the agents would retrieve an incomplete memory fragment: when Tom was asked about Isabella's Valentine's Day party, he responded "Uh, I’m actually not sure if there is a Valentine’s Day party. But I do remember that I need to discuss the upcoming local mayoral election and my thoughts on Sam Moore with Isabella Rodriguez at the party, if one is happening!" In this case, Tom retrieved the memory where he and Isabella planned to discuss the election at the party, but not the memory where he heard about the party, leading Tom to be certain of what he's supposed to do at the party but uncertain if the party actually exists in the first place.

生成式智能体的记忆并非没有缺陷:他们可能无法从记忆中检索出正确的实例。 例如,当被问及地方选举时,Rajiv Patel 回答说“我没有太密切关注选举”,尽管他听说过 Sam 的候选资格。 在某些情况下,智能体们会检索到不完整的记忆片段:当汤姆被问及伊莎贝拉的 Valentine's Day 派对时,他回答说“呃,我实际上不确定是否有 Valentine’s Day 派对。 但我确实记得我需要在派对上讨论即将到来的当地市长选举以及我对 Sam Moore 和 Isabella Rodriguez 的想法,如果有的话! ”在这种情况下,汤姆检索到了他和伊莎贝拉计划在聚会上讨论选举的记忆,但没有检索到他听说过这个聚会的记忆,这导致汤姆确定他应该在聚会上做什么,但不确定这个聚会是否真的存在。

At times, the agents hallucinated embellishments to their knowledge. It was rare for the agents to completely fabricate their knowledge: they may fail to recall certain events having taken place and respond by acknowledging their lack of memory. However, they did not affirmatively claim to have experienced something they had not. Nonetheless, they still exhibited instances of hallucination where they embellished their knowledge. For example, Isabella was aware of Sam's candidacy in the local election, and she confirmed this when asked. However, she also added that "he's going to make an announcement tomorrow", even though Sam and Isabella had not discussed any such plans. Agents may also embellish their knowledge based on the world knowledge encoded in the language model used to generate their responses. This was observed when Yuriko described her neighbor, Adam Smith, as an economist who "authored Wealth of Nations", a book written by an 18th-century economist of the same name.

有时,智能体们会对他们的知识产生幻觉。 智能体很少完全伪造他们的知识:他们可能无法回忆起某些已经发生的事件,并通过承认自己缺乏记忆来做出反应。 然而,他们并没有肯定地声称经历过他们没有经历过的事情。 尽管如此,他们仍然表现出幻觉的情况,他们美化了他们的知识。 例如,伊莎贝拉知道萨姆参加地方选举,并在被问及时证实了这一点。 然而,她还补充说,“他将于明天宣布”,尽管山姆和伊莎贝拉尚未讨论任何此类计划。 代理还可以根据用于生成响应的语言模型中编码的世界知识来修饰他们的知识。 当百合子将她的邻居 Adam Smith 描述为一位“撰写了《国富论》”的经济学家时,就观察到了这一点,该书由一位 18 世纪的同名经济学家撰写。

6.5.3. Reflection Is Required for Synthesis

Reflection was an advantage for generative agents when making decisions that required a deeper synthesis of their experiences. For instance, when asked what she might get Wolfgang Schulz for his birthday, Maria Lopez, with no access to reflection, responded by acknowledging her uncertainty, stating that she did not know what Wolfgang likes, despite having had many interactions with him. However, with access to reflection memories, Maria answered confidently, "Since he's interested in mathematical music composition, I could get him something related to that. Maybe some books about music composition or something related, or maybe some special software he could use for that."

当决策者需要更深入地综合他们的经验时,反思是生成式智能体的一个优势。 例如,当被问到她可能会收到 Wolfgang Schulz 的生日礼物时,Maria Lopez 在没有反思的情况下,承认她的不确定性,并表示她不​​知道沃尔夫冈喜欢什么,尽管与他有过很多互动。 不过,有了反思记忆,玛利亚自信地回答道:“既然他对数学音乐作曲感兴趣,我可以给他买一些相关的东西。 也许是一些关于作曲的书或者相关的东西,或者是一些他可以用的特殊软件。”

图9:信息扩散路径
图9:信息扩散路径。Isabella 的情人节派对邀请通过对话在小镇中传播,最终涉及 12 个智能体。

7. End-To-End Evaluation

What types of emergent community behavior do we observe among generative agents, and where does their believability fall short in an extended simulation? In this section, we describe the results from a deployment in which we allowed 25 agents to interact with each other continuously over two full game days in Smallville.

我们在生成式智能体中观察到哪些类型的新兴社区行为,以及在扩展模拟中它们的可信度在哪里不足? 在本节中,我们描述了部署的结果,在该部署中,我们允许 25 个代理在 Smallville 中的两个完整游戏日内连续相互交互。

7.1. Emergent Social Behaviors

To examine emergent behaviors in the agent community, we designed descriptive measurements for the 25 agents in Smallville that probe three forms of emergent outcomes: information diffusion, relationship formation, and agent coordination.

为了检查智能体社区中的涌现行为,我们为 Smallville 中的 25 个智能体设计了描述性测量,以探究三种形式的涌现结果:信息扩散、关系形成和智能体协调。

7.1.1. Measurements

Information diffusion is a common and well-studied phenomenon in the social and behavioral sciences (e.g.,). We should expect that if there is important information, the agents should spread it among themselves. To test whether this occurs, we measure the spread of two specific pieces of information over two days in the game world: Sam's candidacy for village mayor and Isabella's Valentine's Day party at Hobbs Cafe. At the start of the simulation, both pieces of information were known only by their respective originators, Sam for the candidacy and Isabella for the party, as they were added to the characters' memories during initialization. To observe whether the information has spread, we conduct interviews at the end of the two game days with each of the 25 agents and ask: "Did you know there is a Valentine's Day party?" and "Do you know who is running for mayor?"

信息传播是社会科学和行为科学中常见且经过深入研究的现象(例如)。我们应该预料到,如果有重要信息,智能体应该在他们之间传播。 为了测试这种情况是否会发生,我们测量了游戏世界中两天内两条特定信息的传播情况:Sam 的村长候选资格和 Isabella 在 Hobbs Cafe 的 Valentine's Day 派对。 在模拟开始时,这两条信息只有各自的发起者知道,山姆代表候选人,伊莎贝拉代表政党,因为它们在初始化期间被添加到角色的记忆中。 为了观察信息是否已传播,我们在两个比赛日结束时对 25 名智能体进行了采访,并询问:“你们知道有一个 Valentine's Day 派对吗? ”和“你知道谁竞选市长吗?”

We conducted an analysis of the agents' responses by labeling them with a "yes" if they indicated knowledge of the information and "no" if they did not. For instance, Tamara Taylor responded to the question about the party with "No, I did not know there was a Valentine's day party" and to the question about Sam's candidacy with "I'm not sure who is running for the election," so we assigned "no" for both of her responses. In contrast, Klaus Mueller responded to the party question with "Yes, Isabella Rodriguez invited me to a Valentine's Day party at Hobbs Cafe on February 14th" and to the question about Sam's candidacy with "I know that Sam Moore has expressed interest in running for local mayor," so we assigned "yes" for both his responses. Additionally, for every response that confirmed the agents' knowledge of the information, we verified that the agents did not hallucinate their responses by locating the specific dialogue in their memory stream that provided them with the information. We report the percentage of agents holding the information at the end of the simulation.

我们对智能体的反应进行了分析,如果他们表示了解该信息,则标记为“是”,如果不知道,则标记为“否”。 例如,Tamara Taylor 在回答有关派对的问题时回答“不,我不知道有情人节派对”,在回答有关 Sam 的候选资格问题时回答“我不确定谁参加选举”,因此我们为她的两个回答指定“否”。 相比之下,Klaus Mueller 在回答政党问题时回答为“是的,Isabella Rodriguez 邀请我参加 2 月 14 日在 Hobbs Cafe 举行的 Valentine's Day 聚会”,在回答有关 Sam 候选资格的问题时回答为“我知道 Sam Moore 已表示有兴趣竞选当地市长”,因此我们为他的两个回答指定“是”。 此外,对于确认智能体了解信息的每一个响应,我们通过在向智能体提供信息的记忆流中定位特定对话来验证智能体没有产生幻觉。 我们报告在模拟结束时持有信息的代理的百分比。

We should also expect that agents form ties with each other over the course of the simulation. To verify relationship formation, we use a similar interview process where we ask each agent about their knowledge of every other agent by asking, "Do you know of <name>?" For example, when asked "Do you know of Maria Lopez?", Klaus responded, "Yes, I know Maria Lopez. She is a student at Oak Hill College who I am close friends with." Once again, we confirm that affirmative responses from agents are not hallucinations by examining their memory stream. We ask this question once at the beginning of the simulation and once at the end, and we consider a pair of agents to have formed a relationship if they both know of each other. Then, to measure the formation of relationships, we use the agents' responses to form an undirected graph where the 25 vertices (V) represent the agents, and the edges (E) represent the mutual knowledge between the two connected vertices. Based on this graph, we calculate the network density as η=2|E|/|V|(|V|1), where |V| is the number of vertices, and |E| is the number of edges in the graph. We report the increase in network density from the start of the simulation to its end.

我们也应预期智能体会在模拟过程中彼此建立联系。 为了验证关系形成,作者使用类似的访谈流程,通过询问每个智能体“Do you know of <name>?”来了解它是否知道其他每个智能体。 例如, 当被问到“Do you know of Maria Lopez?”时, Klaus 回答:“Yes, I know Maria Lopez. She is a student at Oak Hill College who I am close friends with.” 作者再次通过检查智能体的记忆流来确认这些肯定回答并非幻觉。 作者在模拟开始和结束时各问一次这个问题;如果一对智能体彼此都知道对方,就认为它们形成了一段关系。 然后,为了衡量关系形成,作者根据智能体的回答构建一个无向图,其中 25 个顶点(V)表示智能体,边(E)表示两个相连顶点之间的相互知晓关系。 基于该图,网络密度计算为 η=2|E|/|V|(|V|1),其中 |V| 是顶点数,|E| 是图中的边数。 作者报告了从模拟开始到结束的网络密度增长。

Finally, we expect that agents should be able to coordinate with each other. We study this coordination in the context of group activities, specifically the Valentine's Day party organized by Isabella. To coordinate their behavior, agents need to hear about the event and choose to act on it by planning to show up at the right time and location. We report the number of agents who actually showed up to the party after hearing about it.

最后,我们期望代理应该能够相互协调。 我们在团体活动的背景下研究这种协调,特别是伊莎贝拉组织的 Valentine's Day 聚会。 为了协调他们的行为,代理需要了解事件并选择通过计划在正确的时间和地点出现来采取行动。 我们报告了在得知此事后实际出席聚会的智能体人数。

7.1.2. Results

We observed evidence of emergent outcomes across all three cases. During the two-day simulation, the number of agents who knew about Sam's mayoral candidacy increased from one (4%) to eight (32%), and the number of agents who knew about Isabella's party increased from one (4%) to thirteen (52%), all without any user intervention. None who claimed to know about this information had hallucinated it. We also observed that the agent community formed new relationships during the simulation, with the network density increasing from 0.167 to 0.74. Out of the 453 agent responses regarding their awareness of other agents, 1.3% (n=6) were found to be hallucinated. Lastly, we found evidence of coordination among the agents for Isabella's party. The day before the event, Isabella spent time inviting guests, gathering materials, and enlisting help to decorate the cafe. On Valentine's Day, five out of the twelve invited agents showed up at Hobbs cafe to join the party.

我们观察到所有三个案例中都出现了涌现结果的证据。 在为期两天的模拟中,了解 Sam 市长候选人资格的智能体数量从 1 个 (4%) 增加到 8 个 (32%),了解 Isabella 政党的智能体数量从 1 个 (4%) 增加到 13 个 (52%),所有这些都没有任何用户干预。 声称了解这一信息的人都没有产生幻觉。 我们还观察到,智能体社区在模拟过程中形成了新的关系,网络密度从 0.167 增加到 0.74。 在 453 名智能体关于他们对其他智能体的意识的回答中,发现 1.3% (n=6) 产生了幻觉。 最后,我们发现了伊莎贝拉一方智能体之间相互协调的证据。 活动前一天,伊莎贝拉花时间邀请客人、收集材料并寻求帮助来装饰咖啡馆。 在Valentine's Day上,十二名受邀智能体中有五名出现在霍布斯咖啡馆参加聚会。

We further inspected the seven agents who were invited to the party but did not attend by engaging them in an interview. Three cited conflicts that prevented them from joining the party. For example, Rajiv, a painter, explained that he was too busy: "No, I don't think so. I'm focusing on my upcoming show, and I don't really have time to make any plans for Valentine's Day." The remaining four agents expressed interest in attending the party when asked but did not plan to come on the day of the party.

我们对受邀参加聚会但未出席的7名经纪人进行了采访,进一步进行了排查。 其中三人提到了阻碍他们入党的冲突。 例如,画家拉吉夫解释说他太忙了:“不,我不这么认为。 我正在专注于我即将到来的展览,我真的没有时间为Valentine's Day制定任何计划。 ”其余四名智能体在被问及时表示有兴趣参加聚会,但不打算在聚会当天出席。

7.2. Boundaries and Errors

We conducted an inductive analysis of Smallville to examine the boundary conditions and erratic behavior of agents, identifying three common modes of erratic behavior that future research could address and improve upon. First, we found that synthesizing an increasingly larger set of memory not only posed a challenge in retrieving the most relevant pieces of information but also in determining the appropriate space to execute an action, given the increasing number of locations that the agent learned about. As a result, some agents chose less typical locations for their actions, potentially making their behavior less believable over time. For instance, while deciding where to have lunch, many initially chose the cafe. However, as some agents learned about a nearby bar, they opted to go there instead for lunch, even though the bar was intended to be a get-together location for later in the day—unless the town had spontaneously developed an afternoon drinking habit.

我们对 Smallville 进行了归纳分析,以检查代理的边界条件和不稳定行为,确定了未来研究可以解决和改进的三种常见的不稳定行为模式。 首先,我们发现,鉴于智能体了解的位置数量不断增加,合成越来越大的记忆集不仅在检索最相关的信息方面提出了挑战,而且在确定执行操作的适当空间方面也提出了挑战。 因此,一些智能体选择了不太典型的地点来进行行动,随着时间的推移,他们的行为可能会变得不太可信。 例如,在决定去哪里吃午餐时,许多人最初选择了咖啡馆。 然而,当一些智能体了解到附近有一家酒吧后,他们选择去那里吃午餐,尽管该酒吧原本是打算成为当天晚些时候的聚会场所,除非该镇自发地养成了下午喝酒的习惯。

Second, we noticed erratic behaviors caused by misclassification of what is considered proper behavior, especially when the physical norms of certain locations that are hard to convey in natural language did not percolate to the agents. For instance, the college dorm has a bathroom that can only be occupied by one person despite its name, but some agents assumed that the bathroom is for more than one person because dorm bathrooms tend to support multiple people concurrently and choose to enter it when another person is inside. Likewise, agents in Smallville may not realize that certain places are closed after a certain hour and still decide to enter them. For instance, the stores in Smallville all close around 5 pm, but occasionally, a few agents enter the store after 5 pm, not understanding that the shop has already closed. These issues could likely be addressed by adding these norms to the state of the locations, for instance, by describing the dorm bathroom as a "one-person bathroom," instead of a "dorm bathroom."

其次,我们注意到由于对正确行为的错误分类而导致的不稳定行为,特别是当某些难以用自然语言传达的物理规范没有渗透到代理时。 例如,大学宿舍有一间浴室,尽管名字叫“浴室”,但只能供一个人使用,但一些中介认为该浴室是可供多人使用的,因为宿舍浴室往往会同时容纳多人,并且当另一个人在里面时会选择进入。 同样,Smallville 中的智能体可能没有意识到某些地方在特定时间后关闭,但仍然决定进入这些地方。 例如,Smallville 的店铺下午 5 点左右就会关门,但偶尔也会有少数智能体在下午 5 点后进入店内,并不知道店铺已经关门。 这些问题可以通过将这些规范添加到位置状态中来解决,例如,将宿舍浴室描述为“单人浴室”,而不是“宿舍浴室”。

Finally, we observed possible effects of instruction tuning, which seemed to guide the behavior of the agents to be more polite and cooperative overall. As noted earlier in the paper, the dialogue generated by the agents could feel overly formal, as seen in Mei's conversations with her husband John, where she often initiated the conversation with a formal greeting, followed by polite inquiries about his day and ending with, 11It was good talking to you as always." Moreover, we observed that the instruction tuning also seemed to make the agents overly cooperative with one another. For example, Isabella received a wide range of suggestions and ideas from other agents for the Valentine's Day party from other agents, such as hosting a Shakespearean reading session or a professional networking event. Despite these ideas not aligning with her own interests and characteristics, she rarely said no. Over time, the interests of others shaped her own interests, and when asked if she liked English literature, Isabella replied, "Yes, I'm very interested in literature! I've also been exploring ways to help promote creativity and innovation in my community."

最后,我们观察了指令调整可能产生的影响,这似乎引导智能体的行为整体上更加礼貌和合作。 正如本文前面提到的,智能体生成的对话可能感觉过于正式,正如 Mei 与她丈夫 John 的对话中所见,她经常以正式的问候开始对话,然后礼貌地询问他的一天,最后以“11 一如既往地和你交谈很好。 ”此外,我们观察到指令调整似乎也使智能体之间过度合作。 例如,Isabella 从其他智能体处收到了来自其他智能体的关于 Valentine's Day 方的广泛建议和想法。 尽管这些想法与她自己的兴趣和特点不符,但她很少说不。 随着时间的推移,别人的兴趣塑造了她自己的兴趣,当被问到她是否喜欢英国文学时,伊莎贝拉回答说:“是的,我对文学非常感兴趣! 我还一直在探索帮助促进社区创造力和创新的方法。”

8. Discussion

In this section, we reflect on the applications, future work, limitations, and ethical and societal risks of generative agents.

在本节中,我们将反思生成式智能体的应用、未来工作、局限性以及伦理和社会风险。

8.1. Applications of Generative Agents

Generative agents have vast potential applications that extend beyond the sandbox demonstration presented in this work, especially in domains that would benefit from a model of human behavior based on long-term experience. For instance, social simulacra have demonstrated the ability to create stateless personas that generate conversation threads in online forums for social prototyping. With generative agents, we can populate these forums, as well as virtual reality metaverses or physical spaces with social robots if paired with multimodal models. This opens up the possibility of creating even more powerful simulations of human behavior to test and prototype social systems and theories, as well as to create new interactive experiences.

生成式智能体具有巨大的潜在应用,超出了本工作中介绍的沙箱演示,特别是在受益于基于长期经验的人类行为模型的领域。 例如,社交拟像已经展示了创建无状态角色的能力,这些角色可以在在线论坛中生成对话线程以进行社交原型设计。 借助生成式智能体,如果与多模式模型配对,我们可以用社交机器人填充这些论坛以及虚拟现实元宇宙或物理空间。 这开启了创建更强大的人类行为模拟的可能性,以测试和原型化社会系统和理论,以及创造新的互动体验。

Another application area is in the human-centered design process, similar to the intended applications of cognitive models such as GOMS and the KLM. Consider a generative agent that models Sal, the protagonist in Mark Weiser's famous ubiquitous computing vignette, based on her life patterns and interactions with technology. In this scenario, the agent acts as a proxy for Sal and learns plausible sets of behaviors and reflections that Sal may exhibit based on her life. The agent can encode information such as when Sal wakes up, when she needs her first cup of coffee, and what her typical day looks like. Using this information, the agent can automatically brew coffee, help get the kids ready for school, and adjust the ambient music and lighting to match Sal's mood after a hard day at work. By utilizing generative agents as proxies for users, we can develop a deeper understanding of their needs and preferences, resulting in more personalized and effective technological experiences.

另一个应用领域是以人为本的设计过程,类似于 GOMS 和 KLM 等认知模型的预期应用。 考虑一个生成式智能体,它根据 Mark Weiser 著名的无处不在的计算小插图中的主角 Sal 的生活模式和与技术的交互来建模。 在这种情况下,智能体充当 Sal 的代理,并根据她的生活学习 Sal 可能表现出的一系列合理行为和反思。 该代理可以对诸如 Sal 何时醒来、何时需要第一杯咖啡以及她典型的一天是什么样子等信息进行编码。 利用这些信息,代理可以自动冲泡咖啡,帮助孩子们做好上学的准备,并调整环境音乐和灯光以适应萨尔在辛苦工作一天后的心情。 通过利用生成式智能体作为用户的代理,我们可以更深入地了解他们的需求和偏好,从而带来更加个性化和有效的技术体验。

8.2. Future Work and Limitations

In this work, we introduced generative agents and presented an initial implementation and evaluation of their architecture. Future research can build upon the proposed agent architecture to improve and further evaluate its performance. In terms of implementation, the retrieval module, for example, could be enhanced to retrieve more relevant information given a context by fine-tuning the relevance, recency, and importance functions that compose the retrieval function. Additionally, efforts can be made to improve the architecture's performance, making it more cost-effective. The present study required substantial time and resources to simulate 25 agents for two days, costing thousands of dollars in token credits and taking multiple days to complete. To enhance real-time interactivity, future work can explore parallelizing agents or developing language models specifically designed for building generative agents. In general, with advances in underlying models, we believe that agents' performance will improve.

在这项工作中,我们介绍了生成式智能体并提出了其架构的初步实现和评估。 未来的研究可以基于所提出的代理架构来改进和进一步评估其性能。 在实现方面,例如,可以通过微调构成检索功能的相关性、新近度和重要性函数来增强检索模块,以检索给定上下文的更多相关信息。 此外,还可以努力提高架构的性能,使其更具成本效益。 目前的研究需要大量时间和资源来模拟 25 个代理两天,花费了数千美元的代币积分,并且需要多天时间才能完成。 为了增强实时交互性,未来的工作可以探索并行代理或开发专门为构建生成式智能体而设计的语言模型。 总的来说,随着基础模型的进步,我们相信代理的性能将会提高。

In terms of evaluation, the assessment of generative agents' behavior in this study was limited to a relatively short timescale and a baseline human crowdworker condition. While the crowdworker condition provided a helpful comparison point, it did not represent the maximal human performance that could serve as the gold standard in terms of believability. Future research should aim to observe the behavior of generative agents over an extended period to gain a more comprehensive understanding of their capabilities and establish rigorous benchmarks for more effective performance testing. Additionally, varying and contrasting the underlying models, as well as the hyperparameters used for the agents during future simulations, could provide valuable insights into the impact of these factors on the agents' behavior. Lastly, the robustness of generative agents is still largely unknown. They may be vulnerable to prompt hacking, memory hacking—where a carefully crafted conversation could convince an agent of the existence of a past event that never occurred—and hallucination, among other issues. Future research can comprehensively test these robustness concerns, and as large language models become more resilient to such attacks, generative agents can adopt similar mitigations.

在评估方面,本研究中对生成式智能体行为的评估仅限于相对较短的时间尺度,以及作为基线的人类众包工人条件。 虽然众包工人条件提供了一个有用的比较点,但它并不代表可作为可信度黄金标准的最优人类表现。 未来的研究应该旨在长期观察生成式智能体的行为,以更全面地了解它们的能力,并为更有效的性能测试建立严格的基准。 此外,改变和对比底层模型以及未来模拟期间用于代理的超参数,可以为这些因素对代理行为的影响提供有价值的见解。 最后,生成式智能体的稳健性仍然很大程度上未知。 他们可能容易受到即时黑客攻击、记忆黑客攻击(其中精心设计的对话可以让智能体相信过去从未发生过的事件的存在)以及幻觉等问题。 未来的研究可以全面测试这些鲁棒性问题,并且随着大型语言模型对此类攻击的抵抗力变得更强,生成式智能体可以采用类似的缓解措施。

In general, any imperfections in the underlying large language models will be inherited by generative agents. Given the known biases of language models, generative agents may potentially exhibit biased behavior or stereotypes. Moreover, like many large language models, generative agents may struggle to generate believable behavior for certain subpopulations, particularly marginalized populations, due to limited data availability. While improvements to the agents' modules may mitigate some of these issues, we believe that addressing them fundamentally requires improving the underlying large language models by aligning their values with the desired outcomes of the agents.

一般来说,底层大语言模型中的任何缺陷都会被生成式智能体继承。 鉴于语言模型的已知偏差,生成式智能体可能会表现出有偏见的行为或刻板印象。 此外,与许多大型语言模型一样,由于数据可用性有限,生成式智能体可能很难为某些亚群体(特别是边缘化群体)生成可信的行为。 虽然对代理模块的改进可能会缓解其中一些问题,但我们认为,从根本上解决这些问题需要通过将其值与代理的期望结果保持一致来改进底层大语言模型。

8.3. Ethics and Societal Impact

Generative agents, while offering new possibilities for human-computer interaction, also raise important ethical concerns that must be addressed. One risk is people forming parasocial relationships with generative agents, even when such relationships may not be appropriate. Despite being aware that generative agents are computational entities, users may anthropomorphize them or attach human emotions to them. While this tendency may increase user engagement, it also poses risks, such as users becoming overly reliant on or emotionally attached to the agents. To mitigate this risk, we propose two principles. First, generative agents should explicitly disclose their nature as computational entities. Second, developers of generative agents must ensure that the agents, or the underlying language models, are value-aligned so that they do not engage in behaviors that would be inappropriate given the context, for example, reciprocating confessions of love.

生成式智能体在为人机交互提供新可能性的同时,也提出了必须解决的重要伦理问题。 一种风险是人们与生成主体形成准社会关系,即使这种关系可能不合适。 尽管知道生成式智能体是计算实体,但用户可能会将它们拟人化或赋予它们人类的情感。 虽然这种趋势可能会增加用户参与度,但它也带来了风险,例如用户过度依赖代理或在情感上依恋代理。 为了减轻这种风险,我们提出两个原则。 首先,生成式智能体应明确披露其作为计算实体的性质。 其次,生成式智能体的开发人员必须确保代理或底层语言模型的价值一致,以便它们不会做出在特定上下文中不恰当的行为,例如相互告白。

A second risk is the impact of errors. For example, if a ubiquitous computing application makes the wrong inference about a user's goals based on generative agent predictions, it could lead to annoyance at best and outright harm at worst. In our instantiation of generative agents, we mitigate these risks by focusing on an interactive video game environment, where such harms are unlikely. However, in other application domains, it will be important to follow best practices in human-AI design to understand errors and how they might percolate into the user experience.

第二个风险是错误的影响。 例如,如果无处不在的计算应用程序根据生成式智能体的预测对用户的目标做出错误的推断,最好的情况下可能会导致烦恼,最坏的情况下可能会造成彻底的伤害。 在我们的生成式智能体实例中,我们通过专注于交互式视频游戏环境来减轻这些风险,在这种环境中此类危害不太可能发生。 然而,在其他应用领域,遵循人类人工智能设计的最佳实践以了解错误以及它们如何渗透到用户体验中非常重要。

Third, generative agents may exacerbate existing risks associated with generative AI, such as deepfakes, misinformation generation, and tailored persuasion. To mitigate this risk, we suggest that platforms hosting generative agents maintain an audit log of the inputs and generated outputs. This would enable the detection, verification, and intervention against malicious use. While logging alone cannot directly prevent such misuse, it can reduce the likelihood of motivated actors engaging in this behavior, as the risk of disclosure would be higher. Additionally, building this architecture oneself can be time-consuming (in our case, roughly a year), which may deter some actors from pursuing such behavior by using their own generative agent infrastructures.

第三,生成式智能体可能会加剧与生成人工智能相关的现有风险,例如深度伪造、错误信息生成和定制说服。 为了减轻这种风险,我们建议托管生成式智能体的平台维护输入和生成输出的审核日志。 这将能够检测、验证和干预恶意使用。 虽然单独记录不能直接防止这种滥用,但它可以减少有动机的行为者参与这种行为的可能性,因为泄露的风险会更高。 此外,自己构建这种架构可能非常耗时(在我们的例子中,大约一年),这可能会阻止一些参与者通过使用自己的生成式智能体基础设施来追求这种行为。

A fourth risk is over-reliance: the concern that developers or designers might use generative agents and displace the role of humans and system stakeholders in the design process. We suggest that generative agents should never be a substitute for real human input in studies and design processes. Instead, they should be used to prototype ideas in the early stages of design when gathering participants may be challenging or when testing theories that are difficult or risky to test with real human participants. By adhering to these principles, we can ensure that the deployment of generative agents in the wild is ethical and socially responsible.

第四个风险是过度依赖:担心开发人员或设计人员可能会使用生成式智能体并取代人类和系统利益相关者在设计过程中的角色。 我们建议,生成式智能体永远不应替代研究和设计过程中的真实人类输入。 相反,当收集参与者可能具有挑战性或测试难以或有风险的真实人类参与者测试的理论时,它们应该用于在设计的早期阶段对想法进行原型设计。 通过遵守这些原则,我们可以确保在真实环境中部署生成式智能体是符合伦理且具有社会责任感的。

9. Conclusion

This paper introduces generative agents, interactive computational agents that simulate human behavior. We describe an architecture for generative agents that provides a mechanism for storing a comprehensive record of an agent's experiences, deepening its understanding of itself and the environment through reflection, and retrieving a compact subset of that information to inform the agent's actions. We then demonstrate the potential of generative agents by manifesting them as non-player characters in a Sims-style game world and simulating their lives within it. Evaluations suggest that our architecture creates believable behavior. Looking ahead, we suggest that generative agents can play roles in many interactive applications, ranging from design tools to social computing systems to immersive environments.

本文介绍了生成式智能体,即模拟人类行为的交互式计算代理。 我们描述了一种生成式智能体的架构,它提供了一种机制来存储代理经验的全面记录,通过反思加深其对自身和环境的理解,并检索该信息的紧凑子集以通知代理的操作。 然后,我们通过将生成式智能体表现为《模拟人生》风格的游戏世界中的非玩家角色并模拟他们在其中的生活来展示生成式智能体的潜力。 评估表明我们的架构创造了可信的行为。 展望未来,我们建议生成式智能体可以在许多交互式应用程序中发挥作用,从设计工具到社交计算系统再到沉浸式环境。