Skip to content


Abstract

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.

尽管大型语言模型(LLM)在语言理解和交互式决策等任务上展现出了令人印象深刻的能力,但其推理(例如思维链提示)与行动(例如行动计划生成)的能力长期以来一直被作为独立主题进行研究。本文探索了以交错方式让LLM同时生成推理轨迹和任务特定行动,从而实现二者之间更紧密的协同:推理轨迹帮助模型归纳、跟踪和更新行动计划,并处理异常情况;而行动则使模型能够与外部源(如知识库或环境)交互,以收集额外信息。我们将这一方法命名为ReAct,并将其应用于一系列不同的语言与决策任务,展示了其相对于最先进基线的有效性,以及相较于不含推理或行动成分的方法在人类可解释性和可信赖性上的提升。具体而言,在问答(HotpotQA)和事实验证(Fever)任务上,ReAct通过与简单的Wikipedia API交互,克服了思维链推理中普遍存在的幻觉和错误传播问题,并生成了类似人类的任务求解轨迹,比不含推理轨迹的基线方法更具可解释性。在两个交互式决策基准(ALFWorld和WebShop)上,ReAct仅通过一个或两个上下文示例进行提示,即取得了绝对成功率分别高出34%和10%的成绩,超越了模仿学习与强化学习方法。

Introduction

A unique feature of human intelligence is the ability to seamlessly combine task-oriented actions with verbal reasoning (or inner speech, Alderson-Day & Fernyhough, 2015), which has been theorized to play an important role in human cognition for enabling self-regulation or strategization (Vygotsky, 1987; Luria, 1965; Fernyhough, 2010) and maintaining a working memory (Baddeley, 1992). Consider the example of cooking up a dish in the kitchen. Between any two specific actions, we may reason in language in order to track progress (“now that everything is cut, I should heat up the pot of water”), to handle exceptions or adjust the plan according to the situation (“I don’t have salt, so let me use soy sauce and pepper instead”), and to realize when external information is needed (“how do I prepare dough? Let me search on the Internet”). We may also act (open a cookbook to read the recipe, open the fridge, check ingredients) to support the reasoning and to answer questions (“What dish can I make right now?”). This tight synergy between “acting” and “reasoning” allows humans to learn new tasks quickly and perform robust decision making or reasoning, even under previously unseen circumstances or facing information uncertainties.

人类智能的一个独特特征是能够将面向任务的行动与言语推理(或内部言语)无缝结合,这一能力被认为在人类认知中起着重要作用,有助于实现自我调节或策略规划,并维持工作记忆。以在厨房烹饪为例。在任何两个具体行动之间,我们可能会用语言进行推理以跟踪进度(“现在所有东西都切好了,我应该加热锅里的水”),处理异常情况或根据实际情况调整计划(“我没有盐,所以让我用酱油和胡椒代替”),以及意识到何时需要外部信息(“我如何准备面团?让我上网搜索”)。我们还可能采取行动(打开食谱阅读配方,打开冰箱,检查食材)来支持推理并回答问题(“我现在可以做什么菜?”)。这种“行动”与“推理”之间的紧密协同使人类能够快速学习新任务,即使在从未见过的情况下或面临信息不确定性时,也能做出稳健的决策或推理。

Recent results have hinted at the possibility of combining verbal reasoning with interactive decision making in autonomous systems. On one hand, properly prompted large language models (LLMs) have demonstrated emergent capabilities to carry out several steps of reasoning traces to derive answers from questions in arithmetic, commonsense, and symbolic reasoning tasks (Wei et al., 2022). However, this “chain-of-thought” reasoning is a static black box, in that the model uses its own internal representations to generate thoughts and is not grounded in the external world, which limits its ability to reason reactively or update its knowledge. This can lead to issues like fact hallucination and error propagation over the reasoning process (Figure 1 (1b)). On the other hand, recent work has explored the use of pre-trained language models for planning and acting in interactive environments (Ahn et al., 2022; Nakano et al., 2021; Yao et al., 2020; Huang et al., 2022a), with a focus on predicting actions via language priors. These approaches usually convert multi-modal observations into text, use a language model to generate domain-specific actions or plans, and then use a controller to choose or execute them. However, they do not employ language models to reason abstractly about high-level goals or maintain a working memory to support acting, barring Huang et al. (2022b) who perform a limited form of verbal reasoning to reiterate spatial facts about the current state. Beyond such simple embodied tasks to interact with a few blocks, there have not been studies on how reasoning and acting can be combined in a synergistic manner for general task solving, and if such a combination can bring systematic benefits compared to reasoning or acting alone.

近期的研究结果暗示了在自主系统中将言语推理与交互式决策相结合的可能性。一方面,适当提示的大型语言模型(LLM)已展现出涌现能力,能够执行几步推理轨迹,从算术、常识和符号推理任务的问题中推导出答案。然而,这种“思维链”推理是一个静态黑箱,模型使用其自身内部表征生成思维,并未根植于外部世界,这限制了其反应性推理或更新知识的能力。这可能导致诸如事实幻觉和推理过程中错误传播等问题(图1(1b))。另一方面,近期研究探索了使用预训练语言模型在交互式环境中进行规划与行动,重点是通过语言先验预测行动。这些方法通常将多模态观察转换为文本,使用语言模型生成特定领域的行动或计划,然后使用控制器选择或执行它们。然而,它们没有利用语言模型对高层目标进行抽象推理或维持工作记忆以支持行动,除Huang等人进行了一种有限形式的言语推理来重申关于当前状态的空间事实。除了这种与几个积木互动的简单具身任务外,尚未有研究探讨如何以协同方式将推理与行动结合起来用于通用任务解决,以及这种结合是否能比单独推理或单独行动带来系统性益处。

In this work, we present ReAct, a general paradigm to combine reasoning and acting with language models for solving diverse language reasoning and decision making tasks (Figure 1). ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interact with the external environments (e.g. Wikipedia) to incorporate additional in!formation into reasoning (act to reason).

在本工作中,我们提出了ReAct,这是一种将推理与行动与语言模型相结合以解决多种语言推理与决策任务的通用范式(图1)。ReAct通过提示LLMs以交错方式生成与任务相关的言语推理轨迹和行动,这使得模型能够进行动态推理,以创建、维护和调整用于行动的高层计划(推理以行动),同时与外部环境(例如维基百科)交互,将额外信息纳入推理(行动以推理)。

We conduct empirical evaluations of ReAct and state-of-the-art baselines on four diverse benchmarks: question answering (HotPotQA, Yang et al., 2018), fact verification (Fever, Thorne et al., 2018), text-based game (ALFWorld, Shridhar et al., 2020b), and webpage navigation (WebShop, Yao et al., 2022). For HotPotQA and Fever, with access to a Wikipedia API that the model can interact with, ReAct outperforms vanilla action generation models while being competitive with chain-of-thought reasoning (CoT) (Wei et al., 2022). The best approach overall is a combination of ReAct and CoT that allows for the use of both internal knowledge and externally obtained information during reasoning. On ALFWorld and WebShop, two or even one-shot ReAct prompting is able to outperform imitation or reinforcement learning methods trained with 103105 task instances, with an absolute improvement of 34% and 10% in success rates respectively. We also demonstrate the importance of sparse, versatile reasoning in decision making by showing consistent advantages over controlled baselines with actions only. Besides general applicability and performance boost, the combination of reasoning and acting also contributes to model interpretability, trustworthiness, and diagnosability across all domains, as humans can readily distinguish information from model’s internal knowledge versus external environments, as well as inspect reasoning traces to understand the decision basis of model actions.

我们在四个不同的基准上对 ReAct 和最先进的基线方法进行了实证评估:问答(HotPotQA)、事实验证(Fever)、文本游戏(ALFWorld)和网页导航(WebShop)。对于 HotPotQA 和 Fever,在模型可以与之交互的 Wikipedia API 支持下,ReAct 优于普通的动作生成模型,同时与思维链推理(CoT)相比具有竞争力。总体上最佳的方法是 ReAct 与 CoT 的组合,它允许在推理过程中同时使用内部知识和外部获取的信息。在 ALFWorld 和 WebShop 上,两次甚至一次性的 ReAct 提示就能超越使用 103105 个任务实例训练的模仿学习或强化学习方法,成功率绝对提升分别达到 34% 和 10%。我们还通过展示 ReAct 相对于仅动作的受控基线具有持续优势,论证了稀疏、多用途推理在决策中的重要性。除了广泛的适用性和性能提升外,推理与行动的结合还有助于模型的可解释性、可信赖性和可诊断性——人类可以轻易区分来自模型内部知识的信息与来自外部环境的信息,并检查推理轨迹以理解模型行动的决策依据。

To summarize, our key contributions are the following: (1) we introduce ReAct, a novel prompt-based paradigm to synergize reasoning and acting in language models for general task solving; (2) we perform extensive experiments across diverse benchmarks to showcase the advantage of ReAct in a few-shot learning setup over prior approaches that perform either reasoning or action generation in isolation; (3) we present systematic ablations and analysis to understand the importance of acting in reasoning tasks, and reasoning in interactive tasks; (4) we analyze the limitations of ReAct under the prompting setup (i.e. limited support of reasoning and acting behaviors), and perform initial finetuning experiments showing the potential of ReAct to improve with additional training data. Scaling up ReAct to train and operate on more tasks and combining it with complementary paradigms like reinforcement learning could further unlock the potential of large language models.

总结而言,我们的主要贡献如下:(1)我们提出了 ReAct,一种新颖的基于提示的范式,旨在协同语言模型中的推理与行动,以解决通用任务;(2)我们在多个不同基准上进行了大量实验,展示了 ReAct 在少样本学习设置下相较于先前单独进行推理或动作生成方法的优势;(3)我们进行了系统的消融实验与分析,以理解行动在推理任务中的重要性,以及推理在交互式任务中的重要性;(4)我们分析了 ReAct 在提示设置下的局限性(即推理与行动行为支持有限),并进行了初步的微调实验,展示了 ReAct 借助更多训练数据进一步提升的潜力。将 ReAct 扩展到更多任务的训练与执行,并将其与强化学习等互补范式相结合,有望进一步释放大型语言模型的潜力。

您的浏览器不支持 SVG

图 1:(1) 四种提示方法的对比——(a) 标准提示、(b) 思维链(仅推理)、(c) 仅行动,以及 (d) ReAct(推理+行动)——在解决 HotpotQA 问题时表现。(2) (a) 仅行动提示与 (b) ReAct 提示在解决 AlfWorld 游戏任务时的对比。在两个领域示例中,我们均省略了提示中的上下文示例,仅展示由模型生成的(行动、思维)轨迹以及由环境返回的(观察)结果。

ReAct: Synergizing Reasoning + Acting

Consider a general setup of an agent interacting with an environment for task solving. At time step t, an agent receives an observation otO from the environment and takes an action atA following some policy π(at|ct), where ct=(o1,a1,,ot1,at1,ot) is the context to the agent. Learning a policy is challenging when the mapping ctat is highly implicit and requires extensive computation. For example, the agent shown in Figure 1(c) is unable to generate the correct final action (Act 4) to finish the QA task as it requires complex reasoning over the trajectory context (Question, Act 1-3, Obs 1-3). Similarly, the agent shown in Figure 1(2a) fails to comprehend from the context that sinkbasin 1 does not contain peppershaker 1, thus keep producing hallucinating actions.

考虑一个智能体与环境交互以完成任务求解的一般设定。在时间步 t,智能体从环境接收观察 otO,并根据某个策略 π(at|ct) 采取行动 atA,其中 ct=(o1,a1,,ot1,at1,ot) 是智能体的上下文。当映射 ctat 高度隐式且需要大量计算时,学习策略极具挑战。例如,图1(c)中的智能体无法生成完成问答任务的正确最终行动(行动4),因为这需要对轨迹上下文(问题、行动1-3、观察1-3)进行复杂推理。类似地,图1(2a)中的智能体未能从上下文中理解“水槽1不包含胡椒瓶1”,从而持续产生幻觉行动。

The idea of ReAct is simple: we augment the agent’s action space to A^=AL, where L is the space of language. An action a^tL in the language space, which we will refer to as a thought or a reasoning trace, does not affect the external environment, thus leading to no observation feedback. Instead, a thought a^t aims to compose useful information by reasoning over the current context ct, and update the context ct+1=(ct,a^t) to support future reasoning or acting. As shown in Figure 1, there could be various types of useful thoughts, e.g. decomposing task goals and create action plans (2b, Act 1; 1d, Thought 1), injecting commonsense knowledge relevant to task solving (2b, Act 1), extracting important parts from observations (1d, Thought2, 4), track progress and transit action plans (2b, Act 8), handle exceptions and adjust action plans (1d, Thought 3), and so on.

ReAct的思想十分简单:我们将智能体的行动空间扩展为 A^=AL,其中 L 是语言空间。语言空间中的行动 a^tL(我们称之为“思维”或“推理轨迹”)不会影响外部环境,因此不会产生观察反馈。相反,思维 a^t 旨在通过对当前上下文 ct 进行推理来组合有用信息,并更新上下文 ct+1=(ct,a^t),以支持未来的推理或行动。如图1所示,思维可以有多种有用形式,例如:分解任务目标并创建行动计划(2b,行动1;1d,思维1)、注入与任务求解相关的常识知识(2b,行动1)、从观察中提取重要部分(1d,思维2、4)、跟踪进度并转换行动计划(2b,行动8)、处理异常并调整行动计划(1d,思维3)等。

However, as the language space L is unlimited, learning in this augmented action space is difficult and requires strong language priors. In this paper, we mainly focus on the setup where a frozen large language model, PaLM-540B (Chowdhery et al., 2022)(^1), is prompted with few-shot in-context examples to generate both domain-specific actions and free-form language thoughts for task solving (Figure 1(d), (2b)). Each in-context example is a human trajectory of actions, thoughts, and environment observations to solve a task instance (see Appendix C). For the tasks where reasoning is of primary importance (Figure 1(1)), we alternate the generation of thoughts and actions so that the task-solving trajectory consists of multiple thought-action-observation steps. In contrast, for decision making tasks that potentially involve a large number of actions (Figure 1(2)), thoughts only need to appear sparsely in the most relevant positions of a trajectory, so we let the language model decide the asynchronous occurrence of thoughts and actions for itself.

然而,由于语言空间 L 是无限的,在这个扩展的行动空间中进行学习非常困难,需要强大的语言先验知识。本文主要关注这样一种设定:使用冻结的大型语言模型 PaLM-540B1,通过少量上下文示例进行提示,使其生成特定领域的行动以及自由形式的语言思维以完成任务求解(图1(d)、(2b))。每个上下文示例都是由人类生成的、用于解决某个任务实例的行动、思维和环境观察轨迹(见附录C)。对于以推理为主要任务(图1(1)),我们交替生成思维和行动,使得任务求解轨迹由多个“思维-行动-观察”步骤组成。相反,对于可能涉及大量行动的决策任务(图1(2)),思维只需要稀疏地出现在轨迹中最相关的位置,因此我们让语言模型自行决定思维和行动的异步发生。

Since decision making and reasoning capabilities are integrated into a large language model, ReAct enjoys several unique features: A) Intuitive and easy to design: Designing ReAct prompts is straightforward as human annotators just type down their thoughts in language on top of their actions taken. No ad-hoc format choice, thought design, or example selection is used in this paper. We detail prompt design for each task in Sections 3 and 4. B) General and flexible: Due to the flexible thought space and thought-action occurrence format, ReAct works for diverse tasks with distinct action spaces and reasoning needs, including but not limited to QA, fact verification, text game, and web navigation. C) Performant and robust: ReAct shows strong generalization to new task instances while learning solely from one to six in-context examples, consistently outperforming baselines with only reasoning or acting across different domains. We also show in Section 3 additional benefits when finetuning is enabled, and in Section 4 how ReAct performance is robust to prompt selections. D) Human aligned and controllable: ReAct promises an interpretable sequential decision making and reasoning process where humans can easily inspect reasoning and factual correctness. Moreover, humans can also control or correct the agent behavior on the go by thought editing, as shown in Figure 5 in Section 4.

由于决策能力和推理能力被整合到同一个大型语言模型中,ReAct具备若干独特优势:A)直观且易于设计:设计ReAct提示非常直接,人类标注者只需在用语言记录所采取的行动的同时写下自己的思维。本文未使用任何特设的格式选择、思维设计或示例筛选。各任务的提示设计详见第3节和第4节。B)通用且灵活:得益于灵活的思维空间和思维-行动出现格式,ReAct能够适应具有不同行动空间和推理需求的多样化任务,包括但不限于问答、事实验证、文本游戏和网页导航。C)高性能且鲁棒:ReAct仅从1至6个上下文示例中学习,即展现出对未见任务实例的强大泛化能力,并在不同领域中持续优于仅推理或仅行动的基线。我们还在第3节展示了启用微调时的额外收益,并在第4节展示了ReAct性能对提示选择的鲁棒性。D)符合人类认知且可控:ReAct提供了可解释的序列决策与推理过程,人类可以轻松检查其推理和事实的正确性。此外,人类还可以通过编辑思维实时控制或纠正智能体的行为,如图5(第4节)所示。

Language model for reasoning Perhaps the most well-known work of using LLMs for reasoning is Chain-of-Thought (CoT) (Wei et al., 2022), which reveals the ability of LLMs to formulate their own "thinking procedure" for problem solving. Several follow-up works have since been performed, including least-to-most prompting for solving complicated tasks (Zhou et al., 2022), zero-shot-CoT (Kojima et al., 2022), and reasoning with self-consistency (Wang et al., 2022a). Recently, (Madaan & Yazdanbakhsh, 2022) systematically studied the formulation and structure of CoT, and observed that the presence of symbols, patterns and texts is crucial to the effectiveness of CoT. Other work has also been extended to more sophisticated reasoning architecture beyond simple prompting. For example Selection-Inference (Creswell et al., 2022) divides the reasoning process into two steps of "selection" and "inference". STaR (Zelikman et al., 2022) bootstraps the reasoning process by finetuning the model on correct rationales generated by the model itself. Faithful reasoning (Creswell & Shanahan, 2022) decomposes multi-step reasoning into three steps, each performed by a dedicated LM respectively. Similar approaches like Scratchpad (Nye et al., 2021), which finetunes a LM on intermediate computation steps, also demonstrate improvement on multi-step computation problems. In contrast to these methods, ReAct performs more than just isolated, fixed reasoning, and integrates model actions and their corresponding observations into a coherent stream of inputs for the model to reason more accurately and tackle tasks beyond reasoning (e.g. interactive decision making).

语言模型用于推理 将大语言模型用于推理方面最著名的工作或许是思维链(CoT),它揭示了大语言模型能够为问题求解制定自己的“思考过程”。此后涌现了若干后续研究,包括用于解决复杂任务的自简到繁提示、零样本思维链以及基于自洽性的推理。最近有研究系统性地考察了思维链的表述形式与结构,并观察到符号、模式和文本的存在对其有效性至关重要。其他工作也已扩展到比简单提示更复杂的推理架构。例如,选择-推理方法将推理过程分为“选择”和“推理”两步;STaR 通过在模型自身生成的正确推理依据上对其进行微调来引导推理过程;忠实推理将多步推理分解为三个步骤,每一步分别由专门的 LM 执行。类似的方法如 Scratchpad,在中间计算步骤上微调 LM,也在多步计算问题上展现了提升。与这些方法不同,ReAct 不仅仅执行孤立、固定的推理,而是将模型行动及其对应的观察整合成一个连贯的输入流,使模型能够更准确地进行推理,并处理超出纯粹推理范畴的任务(例如交互式决策)。

Language model for decision making The strong capability of LLMs has enabled them to perform tasks beyond language generation, and it is becoming more popular to take advantage of LLMs as a policy model for decision making, especially in interactive environments. WebGPT (Nakano et al., 2021) uses an LM to interact with web browsers, navigate through web pages, and infer answers to complicated questions from ELI5 (Fan et al., 2019). In comparison to ReAct, WebGPT does not explicitly model the thinking and reasoning procedure, instead rely on expensive human feedback for reinforcement learning. In conversation modeling, chatbots like BlenderBot (Shuster et al., 2022b) and Sparrow (Glaese et al., 2022) and task-oriented dialogue systems like SimpleTOD (Hosseini-Asl et al., 2020) also train LMs to make decision about API calls. Unlike ReAct, they do not explicitly consider the reasoning procedure either, and also relies on expensive datasets and human feedback collections for policy learning. In contrast, ReAct learns a policy in a much cheaper way, since the decision making process only requires language description of the reasoning procedure.

语言模型用于决策 大语言模型的强大能力使其能够执行超越语言生成的任务,利用 LLM 作为决策策略模型正变得越来越流行,尤其是在交互式环境中。WebGPT 使用 LM 与网页浏览器交互、浏览网页,并从 ELI5 数据集的问题中推断复杂问题的答案。与 ReAct 相比,WebGPT 并未显式建模思维与推理过程,而是依赖昂贵的人工反馈进行强化学习。在对话建模方面,诸如 BlenderBot 和 Sparrow 等聊天机器人,以及像 SimpleTOD 这类面向任务的对话系统,同样训练 LM 对 API 调用进行决策。与 ReAct 不同的是,它们也没有显式考虑推理过程,并且依赖昂贵的数据集和人工反馈收集来进行策略学习。相比之下,ReAct 以更低成本的方式学习策略,因为其决策过程仅需对推理过程进行语言描述。

LLMs have also been increasingly employed in interactive and embodied environments for planning and decision making. Perhaps most relevant to ReAct in this respect are SayCan (Ahn et al., 2022) and Inner Monologue (Huang et al., 2022b), which use LLMs for robotic action planning and decision making. In SayCan, LLMs were prompted to directly predict possible actions a robot can take, which is then reranked by an affordance model grounded on the visual environments for final prediction. Inner Monologue made further improvements by adding the eponymous "inner monologue", which is implemented as injected feedback from the environment. To our knowledge, Inner Monologue is the first work that demonstrates such a closed-loop system, which ReAct builds on. However, we argue that Inner Monologue does not truly comprise of inner thoughts — this is elaborated in Section 4. We also note that leveraging language as semantically-rich inputs in the process of interactive decision making has been shown to be successful under other settings (Abramson et al., 2020; Karamcheti et al., 2021; Huang et al., 2022a; Li et al., 2022). It is becoming more evident that with the help of LLMs, language as a fundamental cognitive mechanism will play a critical role in interaction and decision making. What is more, progress in LLMs has also inspired the development of versatile and generalist agents like Reed et al. (2022).

LLM 也越来越多地被应用于交互式和具身环境中的规划与决策。在这方面与 ReAct 最相关的或许是 SayCan 和 Inner Monologue,它们均使用 LLM 进行机器人动作规划与决策。在 SayCan 中,LLM 被提示直接预测机器人可能采取的动作,然后通过基于视觉环境的可实现性模型对预测结果进行重排序。Inner Monologue 通过增加同名的“内心独白”做了进一步改进,这种独白通过从环境中注入反馈来实现。据我们所知,Inner Monologue 是首个展示这种闭环系统的工作,ReAct 正是在此基础上构建的。然而,我们认为 Inner Monologue 并不真正包含内心思维——这一点将在第4节详细阐述。我们也注意到,在其他设定下,利用语言作为语义丰富的输入在交互式决策过程中已被证明是成功的。越来越明显的是,在 LLM 的帮助下,语言作为一种基础认知机制将在交互与决策中发挥关键作用。此外,LLM 的进展也激发了诸如通才型智能体等的发展。

Conclusion

We have proposed ReAct — a simple yet effective method for synergizing reasoning and acting in large language models. Through a diverse set of experiments on multi-hop question-answering, fact checking, and interactive decision-making tasks, we show that ReAct leads to superior performance with interpretable decision traces. Despite the simplicity of our method, complex tasks with large action spaces require more demonstrations to learn well, which unfortunately can easily go beyond the input length limit of in-context learning. We explore the fine-tuning approach on HotpotQA with initial promising results, but learning from more high-quality human annotations will be the desiderata to further improve the performance. Scaling up ReAct with multi-task training and combining it with complementary paradigms like reinforcement learning could result in stronger agents that further unlock the potential of LLMs for more applications.

我们提出了 ReAct——一种简单而有效的方法,用于协同大语言模型中的推理与行动。通过在多元跳问答、事实验证和交互式决策任务上的一系列多样化实验,我们展示了 ReAct 能够带来更优的性能,并生成可解释的决策轨迹。尽管我们的方法简单,但具有大型动作空间的复杂任务需要更多示范示例才能学习良好,而这往往容易超出上下文学习的输入长度限制。我们在 HotpotQA 上探索了微调方法,初步结果令人鼓舞,但要进一步提升性能,仍需更多高质量的人工标注数据。通过多任务训练扩展 ReAct,并将其与强化学习等互补范式相结合,有望催生更强大的智能体,进一步释放大语言模型在更广泛应用中的潜力。