Abstract
Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
大型语言模型(LLM)被越来越多地用作目标驱动的智能体,与外部环境(如游戏、编译器、API)进行交互。然而,由于传统的强化学习方法需要大量的训练样本和昂贵的模型微调,这些语言智能体如何快速且高效地从试错中学习仍具挑战性。我们提出 Reflexion,一种新颖的框架,它并非通过更新权重,而是通过语言反馈来强化语言智能体。具体来说,Reflexion 智能体用语言对任务反馈信号进行反思,然后将自身的反思文本保存在情景记忆缓冲中,以在后续尝试中诱导出更好的决策。Reflexion 足够灵活,能够整合不同类型(标量值或自由形式语言)和来源(外部或内部模拟)的反馈信号,并在多种任务(序列决策、编程、语言推理)上相较于基线智能体取得了显著提升。例如,Reflexion 在 HumanEval 编程基准上达到了 91% 的 pass@1 准确率,超越了此前最先进的 GPT-4(80%)。我们还使用不同的反馈信号、反馈整合方法和智能体类型进行了消融与分析研究,并深入探讨了它们对性能的影响。
Introduction
Recent works such as ReAct, SayCan, Toolformer, HuggingGPT, generative agents, and WebGPT have demonstrated the feasibility of autonomous decision-making agents that are built on top of a large language model (LLM) core. These methods use LLMs to generate text and 'actions' that can be used in API calls and executed in an environment. Since they rely on massive models with an enormous number of parameters, such approaches have been so far limited to using in-context examples as a way of teaching the agents, since more traditional optimization schemes like reinforcement learning with gradient descent require substantial amounts of compute and time.
近期诸如 ReAct、SayCan、Toolformer、HuggingGPT、生成式智能体和 WebGPT 等研究,已经证明了构建于大语言模型核心之上的自主决策智能体的可行性。这些方法使用 LLM 生成文本和可在 API 调用中使用的“动作”,并在环境中执行。由于它们依赖具有海量参数的庞大模型,这类方法至今仍局限于使用上下文示例来教导智能体,因为更传统的优化方案(如基于梯度下降的强化学习)需要大量的算力和时间。
In this paper, we propose an alternative approach called Reflexion that uses verbal reinforcement to help agents learn from prior failings. Reflexion converts binary or scalar feedback from the environment into verbal feedback in the form of a textual summary, which is then added as additional context for the LLM agent in the next episode. This self-reflective feedback acts as a 'semantic' gradient signal by providing the agent with a concrete direction to improve upon, helping it learn from prior mistakes to perform better on the task. This is akin to how humans iteratively learn to accomplish complex tasks in a few-shot manner – by reflecting on their previous failures in order to form an improved plan of attack for the next attempt. For example, in figure 1, a Reflexion agent learns to optimize its own behavior to solve decision-making, programming, and reasoning tasks through trial, error, and self-reflection.
本文提出了一种名为 Reflexion 的替代方法,它利用语言强化来帮助智能体从先前的失败中学习。Reflexion 将环境提供的二元或标量反馈转换为文本摘要形式的语言反馈,然后在下一个回合中将其作为额外的上下文添加到 LLM 智能体中。这种自我反思的反馈充当了“语义”梯度信号,为智能体提供具体的改进方向,帮助其从先前的错误中学习,从而在任务中表现更好。这类似于人类如何通过少量尝试迭代学习完成复杂任务——通过反思先前的失败,为下一次尝试制定改进的行动计划。例如,在图 1 中,一个 Reflexion 智能体通过尝试、错误和自我反思,学会优化自身行为以解决决策、编程和推理任务。
Generating useful reflective feedback is challenging since it requires a good understanding of where the model made mistakes (i.e. the credit assignment problem [25]) as well as the ability to generate a summary containing actionable insights for improvement. We explore three ways for doing this – simple binary environment feedback, pre-defined heuristics for common failure cases, and self-evaluation such as binary classification using LLMs (decision-making) or self-written unit tests (programming). In all implementations, the evaluation signal is amplified to natural language experience summaries which can be stored in long-term memory.
生成有用的反思反馈具有挑战性,因为它需要很好地理解模型出错的地方(即信用分配问题),以及能够生成包含可行改进见解的摘要的能力。我们探索了三种实现方式:简单的二元环境反馈、针对常见失败情况的预定义启发式方法,以及自我评估(例如使用 LLM 进行二元分类用于决策,或自我编写单元测试用于编程)。在所有实现中,评估信号都被放大为自然语言的经验摘要,并可存储在长期记忆中。
Reflexion has several advantages compared to more traditional RL approaches like policy or value-based learning: 1) it is lightweight and doesn't require finetuning the LLM, 2) it allows for more nuanced forms of feedback (e.g. targeted changes in actions), compared to scalar or vector rewards that are challenging to perform accurate credit assignment with, 3) it allows for a more explicit and interpretable form of episodic memory over prior experiences, and 4) it provides more explicit hints for actions in future episodes. At the same time, it does have the disadvantages of relying on the power of the LLM's self-evaluation capabilities (or heuristics) and not having a formal guarantee for success. However, as LLM capabilities improve, we only expect this paradigm to get better over time.
与更传统的强化学习方法(如基于策略或值的学习)相比,Reflexion 具有以下几个优势:1)它轻量且无需微调 LLM;2)它允许更细致入微的反馈形式(例如有针对性的行动变更),而标量或向量奖励则难以进行准确的信用分配;3)它允许对先前经验形成更明确、更可解释的情景记忆;4)它为未来回合中的行动提供了更明确的提示。同时,它的缺点在于依赖 LLM 的自我评估能力(或启发式方法)的强大程度,并且没有成功的形式化保证。然而,随着 LLM 能力的提升,我们预期这种范式只会越来越好。
We perform experiments on (1) decision-making tasks to test sequential action choices over long trajectories, (2) reasoning tasks to test knowledge-intensive, single-step generation improvement, and (3) programming tasks to teach the agent to effectively use external tools such as compilers and interpreters. Across all three types of tasks, we observe Reflexion agents are better decision-makers, reasoners, and programmers. More concretely, Reflexion agents improve on decision-making AlfWorld tasks over strong baseline approaches by an absolute 22% in 12 iterative learning steps, and on reasoning questions in HotPotQA by 20%, and Python programming tasks on HumanEval by as much as 11%.
我们在三类任务上进行了实验:(1)决策任务,用于测试长轨迹上的序列行动选择;(2)推理任务,用于测试知识密集型、单步生成的改进;(3)编程任务,用于教导智能体有效使用外部工具(如编译器和解释器)。在所有三类任务中,我们观察到 Reflexion 智能体是更好的决策者、推理者和编程者。具体而言,在决策型 AlfWorld 任务上,Reflexion 智能体在 12 次迭代学习步骤中相较于强基线方法绝对提升了 22%;在 HotPotQA 的推理问题上提升了 20%;在 HumanEval 的 Python 编程任务上提升高达 11%。
To summarize, our contributions are the following:
- We propose Reflexion, a new paradigm for 'verbal' reinforcement that parameterizes a policy as an agent's memory encoding paired with a choice of LLM parameters.
- We explore this emergent property of self-reflection in LLMs and empirically show that self-reflection is extremely useful to learn complex tasks over a handful of trials.
- We introduce LeetcodeHardGym, a code-generation RL gym environment consisting of 40 challenging Leetcode questions ('hard-level') in 19 programming languages.
- We show that Reflexion achieves improvements over strong baselines across several tasks, and achieves state-of-the-art results on various code generation benchmarks.
总结而言,我们的贡献如下:
- 我们提出了 Reflexion,一种新的“语言”强化范式,它将策略参数化为智能体的记忆编码与 LLM 参数选择的结合。
- 我们探索了 LLM 中这种 自我反思 的涌现属性,并通过实验证明自我反思对于在少量尝试中学习复杂任务极为有用。
- 我们引入了 LeetcodeHardGym,一个代码生成的强化学习环境,包含 19 种编程语言中的 40 道具有挑战性的 LeetCode 问题(“困难级”)。
- 我们展示了 Reflexion 在多个任务上相较于强基线均有提升,并在各种代码生成基准上取得了最先进的结果。
Related work
Reasoning and decision-making Self-Refine employs an iterative framework for self-refinement to autonomously improve generation through self-evaluation. These self-evaluation and self-improvement steps are conditioned on given task constraints, such as "How can this generation be written in a more positive way". Self-Refine is effective but is limited to single-generation reasoning tasks. Pryzant et al. performs a similar semantic prompt-writing optimization, but is also limited to single-generation tasks. Paul et al. fine-tune critic models to provide intermediate feedback within trajectories to improve reasoning responses. Xie et al. use stochastic beam search over actions to perform a more efficient decision-making search strategy which allows the agent to use foresight advantage due to its self-evaluation component. Yoran et al. and Nair et al. use decider models to reason over several generations. Kim et al. use a retry pattern over a fixed number of steps without an evaluation step. Goodman perform a qualitative evaluation step that proposes optimizations to the previous generation. In this paper, we show that several of these concepts can be enhanced with self-reflection to build a persisting memory of self-reflective experiences which allows an agent to identify its own errors and self-suggest lessons to learn from its mistakes over time.
推理与决策 Self-Refine 采用了一种迭代的自我精炼框架,通过自我评估来自主改进生成内容。这些自我评估和自我改进步骤依赖于给定的任务约束,例如“如何以更积极的方式重写这段生成内容”。Self-Refine 虽然有效,但仅限于单步生成的推理任务。Pryzant 等人进行了类似的语义提示词优化,同样局限于单步生成任务。Paul 等人微调了批评者模型,在轨迹内提供中间反馈以改进推理响应。Xie 等人使用对动作的随机束搜索,执行更高效的决策搜索策略,由于具有自我评估组件,使智能体能够利用前瞻性优势。Yoran 等人和 Nair 等人使用决策模型对多次生成进行推理。Kim 等人使用固定步数内的重试模式,但没有评估步骤。Goodman 执行了定性评估步骤,对先前的生成提出优化建议。在本文中,我们展示了其中一些概念可以通过“自我反思”得到增强,构建一个持久的自我反思经验记忆,使智能体能够识别自身错误,并随着时间的推移,从错误中自我总结可学习的教训。
Programming Several past and recent works employ variations of test-driven development or code debugging practices. AlphaCode evaluates a set of generations on hidden test cases. CodeT uses self-generated unit tests that are used to score generated function implementations. Self-Debugging employs a debugging component that is used to improve existing implementations given feedback from a code execution environment. CodeRL sets the problem in an RL framework using an actor-critic setup to debug programs given feedback from an execution environment. AlphaCode, Self-Debugging and CodeRL are effective in fixing less-complex program bugs, but they rely upon ground truth test cases that invalidate pass@1 eligibility, and do not use self-reflection to bridge the gap between error identification and implementation improvement. CodeT does not access hidden test cases but does not implement a self-learning step to improve code writing.
编程 过去和近期的一些工作采用了测试驱动开发或代码调试实践的变体。AlphaCode 在一组隐藏测试用例上评估生成的代码集。CodeT 使用自生成的单元测试来评分生成的函数实现。Self-Debugging 采用调试组件,根据代码执行环境的反馈来改进现有实现。CodeRL 将问题置于强化学习框架中,使用演员-评论家设置,根据执行环境的反馈调试程序。AlphaCode、Self-Debugging 和 CodeRL 在修复不太复杂的程序错误方面是有效的,但它们依赖真实测试用例,这会使得 pass@1 指标失效,并且没有利用自我反思来弥合错误识别与实现改进之间的鸿沟。CodeT 不访问隐藏测试用例,但也没有实现自我学习步骤来改进代码编写。
图1:Reflexion适用于决策制定(4.1节)、编程(4.3节)和推理(4.2节)任务。
Reflexion: reinforcement via verbal reflection
We develop a modular formulation for Reflexion, utilizing three distinct models: an Actor, denoted as
我们为 Reflexion 开发了一种模块化的表述方式,利用三个不同的模型:一个行动者(Actor),记为
Actor The Actor is built upon a large language model (LLM) that is specifically prompted to generate the necessary text and actions conditioned on the state observations. Analogous to traditional policy-based RL setups, we sample an action or generation,
行动者 行动者基于一个大型语言模型构建,该模型通过特定提示,根据状态观察生成所需的文本和动作。类似于传统的基于策略的强化学习设置,我们在时间
Evaluator The Evaluator component of the Reflexion framework plays a crucial role in assessing the quality of the generated outputs produced by the Actor. It takes as input a generated trajectory and computes a reward score that reflects its performance within the given task context. Defining effective value and reward functions that apply to semantic spaces is difficult, so we investigate several variants of the Evaluator model. For reasoning tasks, we explore reward functions based on exact match (EM) grading, ensuring that the generated output aligns closely with the expected solution. In decision-making tasks, we employ pre-defined heuristic functions that are tailored to specific evaluation criteria. Additionally, we experiment with using a different instantiation of an LLM itself as an Evaluator, generating rewards for decision-making and programming tasks. This multi-faceted approach to Evaluator design allows us to examine different strategies for scoring generated outputs, offering insights into their effectiveness and suitability across a range of tasks.
评估者 Reflexion 框架中的评估者组件在评估行动者所生成输出的质量方面起着至关重要的作用。它以生成的轨迹为输入,并计算一个反映其在给定任务上下文中表现的奖励分数。定义适用于语义空间的有效价值和奖励函数是困难的,因此我们研究了评估者模型的几种变体。对于推理任务,我们探索了基于精确匹配评分的奖励函数,确保生成的输出与预期解决方案紧密对齐。在决策任务中,我们采用了针对特定评估标准预先定义的启发式函数。此外,我们还尝试使用不同的 LLM 实例本身作为评估者,为决策和编程任务生成奖励。这种多方面的评估者设计方法使我们能够检验不同的输出评分策略,并深入了解它们在一系列任务中的有效性和适用性。
Self-reflection The Self-Reflection model instantiated as an LLM, plays a crucial role in the Reflexion framework by generating verbal self-reflections to provide valuable feedback for future trials. Given a sparse reward signal, such as a binary success status (success/fail), the current trajectory, and its persistent memory mem, the self-reflection model generates nuanced and specific feedback. This feedback, which is more informative than scalar rewards, is then stored in the agent's memory (mem). For instance, in a multi-step decision-making task, when the agent receives a failure signal, it can infer that a specific action
自我反思 自我反思模型同样以 LLM 实现,它在 Reflexion 框架中扮演关键角色,通过生成语言自我反思来为后续尝试提供宝贵的反馈。给定一个稀疏的奖励信号(例如二进制的成功状态:成功/失败)、当前轨迹及其持久性记忆 mem,自我反思模型会生成细致且具体的反馈。这种比标量奖励信息更丰富的反馈随后被存储在智能体的记忆(mem)中。例如,在一个多步决策任务中,当智能体收到失败信号时,它可以推断出某个特定动作
Memory Core components of the Reflexion process are the notion of short-term and long-term memory. At inference time, the Actor conditions its decisions on short and long-term memory, similar to the way that humans remember fine-grain recent details while also recalling distilled important experiences from long-term memory. In the RL setup, the trajectory history serves as the short-term memory while outputs from the Self-Reflection model are stored in long-term memory. These two memory components work together to provide context that is specific but also influenced by lessons learned over several trials, which is a key advantage of Reflexion agents over other LLM action choice works.
记忆 Reflexion 过程的核心组成部分是短期记忆和长期记忆的概念。在推理时,行动者根据短期和长期记忆来调节其决策,类似于人类既能记住近期细节,又能从长期记忆中回忆起提炼出的重要经验。在强化学习设置中,轨迹历史充当短期记忆,而自我反思模型的输出则存储在长期记忆中。这两个记忆组件共同作用,提供既具体又受多次尝试中习得教训影响的上下文,这是 Reflexion 智能体相对于其他 LLM 动作选择工作的一个关键优势。
The Reflexion process Reflexion is formalized as an iterative optimization process. In the first trial, the Actor produces a trajectory
Reflexion 过程 Reflexion 被形式化为一个迭代优化过程。在第一次尝试中,行动者通过与环境交互产生轨迹
图2:Reflexion
Limitations
At its core, Reflexion is an optimization technique that uses natural language to do policy optimization. Policy optimization is a powerful approach to improve action choice through experience, but it may still succumb to non-optimal local minima solutions. In this study, we limit long-term memory to a sliding window with maximum capacity, but we encourage future work to extend the memory component of Reflexion with more advanced structures such as vector embedding databases or traditional SQL databases. Specific to code generation, there are many practical limitations to test-driven development in specifying accurate input-output mappings such as non-deterministic generator functions, impure functions that interact with APIs, functions that vary output according to hardware specifications, or functions that invoke parallel or concurrent behavior that may be difficult to predict.
从根本上讲,Reflexion 是一种使用自然语言进行策略优化的优化技术。策略优化是一种通过经验改进动作选择的强大方法,但它仍然可能陷入非最优的局部极小解。在本研究中,我们将长期记忆限制为具有最大容量的滑动窗口,但我们鼓励未来的工作用更高级的结构(如向量嵌入数据库或传统 SQL 数据库)来扩展 Reflexion 的记忆组件。具体到代码生成,测试驱动开发在指定精确的输入-输出映射方面存在许多实际限制,例如非确定性的生成器函数、与 API 交互的不纯函数、根据硬件规格改变输出的函数,或者调用可能难以预测的并行或并发行为的函数。
Broader impact
Large language models are increasingly used to interact with external environments (e.g., the Internet, software, robotics, etc.) and humans. Our work has the potential of reinforcing and empowering these agents toward greater automation and work efficiency, but it also amplifies the risks when these agents were put into misuse. We believe that this direction of research will need more effort in safety and ethical considerations.
大型语言模型越来越多地被用于与外部环境(例如互联网、软件、机器人等)和人类进行交互。我们的工作有可能强化和赋能这些智能体,使其走向更高的自动化和工作效率,但当这些智能体被滥用时,它也会放大风险。我们相信,这一研究方向需要在安全和伦理考量方面付出更多努力。
On the other hand, reinforcement learning has suffered from its black-box policy and optimization setups in which interpretability and alignment have been challenging. Our proposed "verbal" reinforcement learning might address some of the issues and turn autonomous agents more interpretable and diagnosable. For example, in the case of tool-usage that may be too hard for humans to understand, self-reflections could be monitored to ensure proper intent before using the tool.
另一方面,强化学习一直受困于其黑箱式的策略和优化设置,在这些设置中,可解释性和对齐性一直是挑战。我们提出的“语言”强化学习或许能解决部分问题,使自主智能体更具可解释性和可诊断性。例如,在人类可能难以理解的工具使用场景中,可以通过监控自我反思来确保在使用工具前意图正确。
Conclusion
In this work, we present Reflexion, an approach that leverages verbal reinforcement to teach agents to learn from past mistakes. We empirically show that Reflexion agents significantly outperform currently widely-used decision-making approaches by utilizing self-reflection. In future work, Reflexion could be used to employ more advanced techniques that have been thoroughly studied in traditional RL settings, such as value learning in natural language or off-policy exploration techniques.
在这项工作中,我们提出了 Reflexion,一种利用语言强化来教导智能体从过去的错误中学习的方法。我们通过实验证明,Reflexion 智能体通过利用自我反思,显著优于当前广泛使用的决策方法。在未来的工作中,Reflexion 可用于采用在传统强化学习设置中已被深入研究过的更先进技术,例如自然语言中的价值学习或离策略探索技术。