Skip to content


思维树:利用大语言模型进行深思熟虑的问题求解

Abstract

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, "Tree of Thoughts" (ToT), which generalizes over the popular "Chain of Thought" approach to prompting language models, and enables exploration over coherent units of text ("thoughts") that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.

语言模型正越来越多地被部署用于跨广泛任务的通用问题求解,但在推理过程中,它们仍受限于逐词元、从左到右的决策过程。这意味着在需要探索、战略性前瞻或初始决策起关键作用的任务中,它们可能表现不佳。为了克服这些挑战,我们提出了一个新的语言模型推理框架——"思维树"(ToT),它推广了流行的"思维链"提示方法,使得能够探索作为问题求解中间步骤的连贯文本单元("思维")。ToT允许语言模型通过考虑多种不同的推理路径并自我评估选择来决定下一步行动,以及在必要时进行前瞻或回溯以做出全局选择。我们的实验表明,在三个需要非平凡规划或搜索的新任务上(24点游戏、创意写作和小型填字游戏),ToT显著提升了语言模型的问题求解能力。例如,在24点游戏中,采用思维链提示的GPT-4仅解决了4%的任务,而我们的方法达到了74%的成功率。

Introduction

Originally designed to generate text, scaled-up versions of language models (LMs) such as GPT and PaLM have been shown to be increasingly capable of performing an ever wider range of tasks requiring mathematical, symbolic, commonsense, and knowledge reasoning. It is perhaps surprising that underlying all this progress is still the original autoregressive mechanism for generating text, which makes token-level decisions one by one and in a left-to-right fashion. Is such a simple mechanism sufficient for a LM to be built toward a general problem solver? If not, what problems would challenge the current paradigm, and what should be alternative mechanisms?

最初为生成文本而设计的语言模型(如 GPT 和 PaLM)在规模扩大后,已被证明在越来越广泛的任务上展现出日益增强的能力,这些任务涉及数学、符号、常识和知识推理。或许令人惊讶的是,所有这些进步背后的基础仍是原始的自回归文本生成机制——它以从左到右的方式逐个进行词元级决策。这样一个简单的机制是否足以将语言模型构建成通用问题求解器?如果不够,哪些问题会对当前范式构成挑战,又应该采用何种替代机制?

The literature on human cognition provides some clues to answer these questions. Research on “dual process” models suggests that people have two modes in which they engage with decisions – a fast, automatic, unconscious mode (“System 1”) and a slow, deliberate, conscious mode (“System 2”). These two modes have previously been connected to a variety of mathematical models used in machine learning. For example, research on reinforcement learning in humans and other animals has explored the circumstances under which they engage in associative “model free” learning or more deliberative “model based” planning. The simple associative token-level choices of LMs are also reminiscent of “System 1”, and thus might benefit from augmentation by a more deliberate “System 2” planning process that (1) maintains and explores diverse alternatives for current choices instead of just picking one, and (2) evaluates its current status and actively looks ahead or backtracks to make more global decisions.

关于人类认知的文献为回答这些问题提供了一些线索。对“双过程”模型的研究表明,人们参与决策时有两种模式:快速、自动、无意识的模式(“系统1”)和缓慢、审慎、有意识的模式(“系统2”)。这两种模式此前已与机器学习中使用的多种数学模型相关联。例如,对人类和其他动物强化学习的研究探讨了它们在何种情况下会进行关联性的“无模型”学习,或是更具审慎性的“基于模型”的规划。语言模型简单的关联性词元级选择也让人联想到“系统1”,因此或许可以通过一个更具审慎性的“系统2”规划过程来增强,该过程能够:(1)对当前选择维持并探索多种替代方案,而非仅选取一个;(2)评估当前状态,并主动前瞻或回溯以做出更全局的决策。

To design such a planning process, we return to the origins of artificial intelligence (and cognitive science), drawing inspiration from the planning processes explored by Newell, Shaw, and Simon starting in the 1950s. Newell and colleagues characterized problem solving as search through a combinatorial problem space, represented as a tree. We thus propose the Tree of Thoughts (ToT) framework for general problem solving with language models. As Figure 1 illustrates, while existing methods (detailed below) sample continuous language sequences for problem solving, ToT actively maintains a tree of thoughts, where each thought is a coherent language sequence that serves as an intermediate step toward problem solving (Table 1). Such a high-level semantic unit allows the LM to self-evaluate the progress different intermediate thoughts make towards solving the problem through a deliberate reasoning process that is also instantiated in language (Figures 2,4,6). This implementation of search heuristics via LM self-evaluation and deliberation is novel, as previous search heuristics are either programmed or learned. Finally, we combine this language-based capability to generate and evaluate diverse thoughts with search algorithms, such as breadth-first search (BFS) or depth-first search (DFS), which allow systematic exploration of the tree of thoughts with lookahead and backtracking.

为了设计这样的规划过程,我们回溯到人工智能(及认知科学)的起源,从纽厄尔、肖和西蒙自20世纪50年代起探索的规划过程中汲取灵感。纽厄尔及其同事将问题求解描述为在组合问题空间中的搜索,该空间可表示为一棵树。由此,我们提出了用于语言模型通用问题求解的“思维树”框架。如图1所示,现有方法(下文详述)为问题求解采样连续的语言序列,而ToT则主动维护一棵“思维树”,其中每个思维是一个连贯的语言序列,作为问题求解的中间步骤(表1)。这种高层语义单元使语言模型能够通过同样以语言实例化的审慎推理过程,自我评估不同中间思维在解决问题上的进展(图2、4、6)。这种通过语言模型自我评估与审慎来实现搜索启发式的方法是新颖的,因为以往的搜索启发式要么是编程实现的,要么是学习得到的。最后,我们将这种基于语言的生成与评估多样化思维的能力与搜索算法(如广度优先搜索或深度优先搜索)相结合,从而能够系统性地探索思维树,并实现前瞻与回溯。

Empirically, we propose three new problems that challenge existing LM inference methods even with the state-of-the-art language model, GPT-4: Game of 24, Creative Writing, and Crosswords (Table 1). These tasks require deductive, mathematical, commonsense, lexical reasoning abilities, and a way to incorporate systematic planning or search. We show ToT obtains superior results on all three tasks by being general and flexible enough to support different levels of thoughts, different ways to generate and evaluate thoughts, and different search algorithms that adapt to the nature of different problems. We also analyze how such choices affect model performances via systematic ablations and discuss future directions to better train and use LMs.

在实验上,我们提出了三个新问题,这些问题即使对最先进的语言模型GPT-4 的现有推理方法也构成挑战:24点游戏、创意写作和填字游戏(表1)。这些任务需要演绎、数学、常识和词汇推理能力,以及融入系统性规划或搜索的方式。我们展示了ToT在所有三个任务上均取得了优越的结果,因为它足够通用和灵活,能够支持不同层级的思维、不同的思维生成与评估方式,以及适应不同问题性质的搜索算法。我们还通过系统的消融实验分析了这些选择如何影响模型性能,并讨论了未来更好训练和使用语言模型的方向。

您的浏览器不支持 SVG

图1:使用大语言模型进行问题求解的各种方法示意图。每个矩形框代表一个“思维”,即作为问题求解中间步骤的连贯语言序列。关于思维如何生成、评估和搜索的具体示例,见图2、图4、图6。

Background

We first formalize some existing methods that use large language models for problem-solving, which our approach is inspired by and later compared with. We use pθ to denote a pre-trained LM with parameters θ, and lowercase letters x,y,z,s, to denote a language sequence, i.e. x=(x[1],,x[n]) where each x[i] is a token, so that pθ(x)=i=1npθ(x[i]|x[1..i]). We use uppercase letters S, to denote a collection of language sequences.

我们首先形式化一些现有的使用大语言模型进行问题求解的方法,我们的方法受其启发,并将在后文与之进行比较。我们用 pθ 表示参数为 θ 的预训练语言模型,小写字母 x,y,z,s, 表示一个语言序列,即 x=(x[1],,x[n]),其中每个 x[i] 是一个词元,因此 pθ(x)=i=1npθ(x[i]|x[1..i])。我们用大写字母 S, 表示语言序列的集合。

Input-output (IO) prompting is the most common way to turn a problem input x into output y with LM: ypθ(y|promptIO(x)), where promptIO(x) wraps input x with task instructions and/or few-shot input-output examples. For simplicity, let us denote pθprompt(output|input) =pθ(output|prompt(input)), so that IO prompting can be formulated as ypθIO(y|x).

输入-输出(IO)提示 是使用语言模型将问题输入 x 转化为输出 y 的最常见方式:ypθ(y|promptIO(x)),其中 promptIO(x) 将输入 x 与任务指令和/或少量输入-输出示例包装在一起。为简洁起见,记 pθprompt (output|input)=pθ(output|prompt(input)),则IO提示可表述为 ypθIO(y|x)

Chain-of-thought (CoT) prompting (Wei et al., 2022) was proposed to address cases where the mapping of input x to output y is non-trivial (e.g. when x is a math question and y is the final numerical answer). The key idea is to introduce a chain of thoughts z1,,zn to bridge x and y, where each zi is a coherent language sequence that serves as a meaningful intermediate step toward problem solving (e.g. zi could be an intermediate equation for math QA). To solve problems with CoT, each thought zipθCoT(zix,z1i1) is sampled sequentially, then the output ypθCoT(yx,z1n). In practice, [z1n,y]pθCoT(z1n,yx) is sampled as a continuous language sequence, and the decomposition of thoughts (e.g. is each zi a phrase, a sentence, or a paragraph) is left ambiguous.

思维链(CoT)提示 旨在处理输入 x 到输出 y 的映射不平凡的情形(例如,当 x 是一道数学题,y 是最终数值答案时)。其核心思想是引入一条由“思维” z1,,zn 组成的链条来连接 xy,每个 zi 是一个连贯的语言序列,作为问题求解的有意义的中间步骤(例如,zi 可以是数学问答中的中间方程)。使用CoT求解问题时,依次采样每个思维 zipθCoT(zix,z1i1),然后输出 ypθCoT(yx,z1n)。在实践中,[z1n,y]pθCoT(z1n,yx) 是作为一个连续语言序列采样的,而思维的“分解”方式(例如每个 zi 是一个短语、一个句子还是一个段落)则未被明确界定。

Self-consistency with CoT (CoT-SC) (Wang et al., 2022) is an ensemble approach that samples k i.i.d. chains of thought: [z1n(i),y(i)]pθCoT(z1n,yx) (i=1k), then returns the most frequent output: argmaxy#{iy(i)=y}. CoT-SC improves upon CoT, because there are generally different thought processes for the same problem (e.g. different ways to prove the same theorem), and the output decision can be more faithful by exploring a richer set of thoughts. However, within each chain there is no local exploration of different thought steps, and the “most frequent” heuristic only applies when the output space is limited (e.g. multi-choice QA).

基于自洽性的思维链(CoT-SC) 是一种集成方法,它采样 k 条独立的思维链:[z1n(i),y(i)]pθCoT(z1n,yx)(i=1k),然后返回出现频率最高的输出:argmaxy#{iy(i)=y}. CoT-SC 改进了 CoT,因为同一个问题通常存在不同的思维过程(例如,证明同一个定理的不同方式),通过探索更丰富的思维集,输出决策可以更加可靠。然而,在每条思维链内部,并不存在对不同思维步骤的局部探索,而且“最高频”启发式仅在输出空间有限(例如多选题)时适用。

Tree of Thoughts: Deliberate Problem Solving with LM

A genuine problem-solving process involves the repeated use of available information to initiate exploration, which discloses, in turn, more information until a way to attain the solution is finally discovered. — Newell et al.

真正的问题求解过程涉及反复利用可用信息启动探索,探索反过来又会揭示更多信息,直到最终发现达成解决方法的途径。

Research on human problem-solving suggests that people search through a combinatorial problem-space – a tree where the nodes represent partial solutions, and the branches correspond to operators that modify them. Which branch to take is determined by heuristics that help to navigate the problem-space and guide the problem-solver towards a solution. This perspective highlights two key shortcomings of existing approaches that use LMs to solve general problems: 1) Locally, they do not explore different continuations within a thought process – the branches of the tree. 2) Globally, they do not incorporate any type of planning, lookahead, or backtracking to help evaluate these different options – the kind of heuristic-guided search that seems characteristic of human problem-solving.

对人类问题求解的研究表明,人们在组合问题空间中进行搜索——这个空间是一棵树,其中节点代表部分解,分支代表修改这些解的操作符。选择哪个分支由启发式方法决定,这些方法有助于在问题空间中导航,并引导问题求解者走向解决方案。这一视角凸显了当前使用语言模型解决通用问题的两种方法的两个关键缺陷:1)在局部,它们不在思维过程中探索不同的延续路径——即树的分支;2)在全局,它们没有结合任何形式的规划、前瞻或回溯来帮助评估这些不同的选项——而这正是人类问题求解中看似特征的启发式引导搜索。

To address these shortcomings, we introduce Tree of Thoughts (ToT), a paradigm that allows LMs to explore multiple reasoning paths over thoughts (Figure 1(c)). ToT frames any problem as a search over a tree, where each node is a state s=[x,z1i] representing a partial solution with the input and the sequence of thoughts so far. A specific instantiation of ToT involves answering four questions:

  1. How to decompose the intermediate process into thought steps;
  2. How to generate potential thoughts from each state;
  3. How to heuristically evaluate states;
  4. What search algorithm to use.

为解决这些缺陷,我们引入了 思维树(ToT) 范式,它允许语言模型在思维上探索多条推理路径(图1(c))。ToT将任何问题框架化为在树上的搜索,其中每个节点是一个状态 s=[x,z1i],代表一个包含了输入和迄今思维序列的部分解。ToT的具体实例化需要回答四个问题:

  1. 如何将中间过程分解为思维步骤;
  2. 如何从每个状态生成潜在的思维;
  3. 如何启发式地评估状态;
  4. 使用何种搜索算法。

1. Thought decomposition While CoT samples thoughts coherently without explicit decomposition, ToT leverages problem properties to design and decompose intermediate thought steps. As Table 1 shows, depending on different problems, a thought could be a couple of words (Crosswords), a line of equation (Game of 24), or a whole paragraph of writing plan (Creative Writing). In general, a thought should be "small" enough so that LMs can generate promising and diverse samples (e.g. generating a whole book is usually too "big" to be coherent), yet "big" enough so that LMs can evaluate its prospect toward problem solving (e.g. generating one token is usually too "small" to evaluate).

1. 思维分解 CoT在没有明确分解的情况下连贯地采样思维,ToT则利用问题属性来设计和分解中间思维步骤。如表1所示,根据不同的问题,一个思维可以是几个词(填字游戏)、一行方程(24点游戏)或一整段写作计划(创意写作)。一般来说,思维应该足够“小”,以便语言模型能够生成有前景且多样化的样本(例如,生成整本书通常太“大”而难以连贯),同时又足够“大”,以便语言模型能够评估其解决问题的前景(例如,生成一个词元通常太“小”而无法评估)。

2. Thought generator G(pθ,s,k) Given a tree state s=[x,z1i], we consider two strategies to generate k candidates for the next thought step:

  • (a) Sample i.i.d. thoughts from a CoT prompt (Creative Writing, Figure 4): z(j)pθCoT(zi+1|s) =pθCoT(zi+1|x,z1i) (j=1k)
    This works better when the thought space is rich (e.g. each thought is a paragraph), and i.i.d. samples lead to diversity;
  • (b) Propose thoughts sequentially using a "propose prompt" (Game of 24, Figure 2; Crosswords, Figure 6): [z(1),,z(k)]pθpropose (zi+1(1k)|s) This works better when the thought space is more constrained (e.g. each thought is just a word or a line), so proposing different thoughts in the same context avoids duplication.

2. 思维生成器 G(pθ,s,k) 给定树状态 s=[x,z1i],我们考虑两种策略来为下一个思维步骤生成 k 个候选:

  • (a) 采样:从CoT提示中独立同分布地采样思维(创意写作,图4):z(j)pθCoT(zi+1|s) =pθCoT(zi+1|x,z1i)(j=1k) 当思维空间丰富时(例如每个思维是一个段落),这种方法效果更好,独立同分布样本能带来多样性;
  • (b) 提议:使用“提议提示”顺序地提出思维(24点游戏,图2;填字游戏,图6):[z(1),,z(k)]pθpropose(zi+1(1k)|s) 当思维空间更为受限时(例如每个思维只是一个词或一行),这种方法效果更好,因为在同一上下文中提出不同思维可以避免重复。

3. State evaluator V(pθ,S) Given a frontier of different states, the state evaluator evaluates the progress they make towards solving the problem, serving as a heuristic for the search algorithm to determine which states to keep exploring and in which order. While heuristics are a standard approach to solving search problems, they are typically either programmed (e.g. DeepBlue) or learned (e.g. AlphaGo). We propose a third alternative, by using the LM to deliberately reason about states. When applicable, such a deliberate heuristic can be more flexible than programmed rules, and more sample-efficient than learned models. Similar to the thought generator, we consider two strategies to evaluate states either independently or together:

  • (a) Value each state independently: V(pθ,S)(s)pθvalue(v|s)sS, where a value prompt reasons about the state s to generate a scalar value v (e.g. 1-10) or a classification (e.g. sure/likely/impossible) that could be heuristically turned into a value. The basis of such evaluative reasoning can vary across problems and thought steps. In this work, we explore evaluation via few lookahead simulations (e.g. quickly confirm that 5, 5, 14 can reach 24 via 5 + 5 + 14, or "hot_l" can mean "inn" via filling "e" in "_") plus commonsense (e.g. 1 2 3 are too small to reach 24, or no word can start with "tzxc"). While the former might promote "good" states, the latter could help eliminate "bad" states. Such valuations do not need to be perfect, and only need to be approximately helpful for decision making.
  • (b) Vote across states: V(pθ,S)(s)=1[s=s], where a "good" state spθvote(s|S) is voted out based on deliberately comparing different states in S in a vote prompt. When problem success is harder to directly value (e.g. passage coherency), it is natural to instead compare different partial solutions and vote for the most promising one. This is similar in spirit to a "step-wise" self-consistency strategy, i.e. cast "which state to explore" as a multi-choice QA, and use LM samples to vote for it.

3. 状态评估器 V(pθ,S) 给定一个由不同状态组成的边界,状态评估器评估它们在解决问题方面取得的进展,作为搜索算法的启发式,用于决定保留哪些状态继续探索以及探索的顺序。虽然启发式是解决搜索问题的标准方法,但它们通常是编程实现的(例如DeepBlue)或学习得到的(例如AlphaGo)。我们提出了第三种替代方案:使用语言模型来审慎地推理状态。在适用的情况下,这种审慎的启发式可以比编程规则更灵活,比学习模型更节省样本。与思维生成器类似,我们考虑两种独立或联合评估状态的策略:

  • (a) 独立评分每个状态:V(pθ,S)(s)pθvalue(v|s)sS,其中值提示对状态 s 进行推理,生成一个标量值 v(例如1-10)或一个分类(例如确定/可能/不可能),这些可以启发式地转换为一个值。这种评估推理的基础可以因问题和思维步骤而异。在本工作中,我们通过少量前瞻模拟(例如,快速确认5、5、14可以通过5+5+14达到24,或者“hot_l”可以通过在“_”中填入“e”表示“inn”)加上常识(例如,1、2、3太小无法达到24,或者没有单词能以“tzxc”开头)来进行评估。前者可能促进“好”的状态,后者则有助于消除“坏”的状态。这种评估不需要完美,只需在决策时大致有用即可。
  • (b) 跨状态投票V(pθ,S)(s)=1[s=s],其中通过投票提示审慎比较 S 中的不同状态,选出“好”的状态 spθvote(s|S)。当问题成功难以直接评估时(例如段落的连贯性),很自然地会比较不同的部分解并为最有希望的一个投票。这本质上类似于一种“逐步”的自洽性策略,即将“探索哪个状态”转化为一个多项选择题,并使用语言模型样本为其投票。

For both strategies, we could prompt the LM multiple times to aggregate the value or vote results to trade time/resource/cost for more faithful/robust heuristics.

对于这两种策略,我们可以多次提示语言模型以聚合评分或投票结果,从而用时间/资源/成本换取更可靠/更稳健的启发式。

4. Search algorithm Finally, within the ToT framework, one can plug and play different search algorithms depending on the tree structure. We explore two relatively simple search algorithms and leave more advanced ones (e.g. A*, MCTS) for future work:

  • (a) Breadth-first search (BFS) (Algorithm 1) maintains a set of the b most promising states per step. This is used for Game of 24 and Creative Writing where the tree depth is limit (T3), and initial thought steps can be evaluated and pruned to a small set (b5).
  • (b) Depth-first search (DFS) (Algorithm 2) explores the most promising state first, until the final output is reached (t>T), or the state evaluator deems it impossible to solve the problem from the current s (V(pθ,{s})(s)vth for a value threshold vth). In the latter case, the subtree from s is pruned to trade exploration for exploitation. In both cases, DFS backtracks to the parent state of s to continue exploration.

4. 搜索算法 最后,在ToT框架内,可以根据树结构灵活选用不同的搜索算法。我们探索了两种相对简单的搜索算法,并将更高级的算法(例如A*、MCTS)留给未来工作:

  • (a) 广度优先搜索(BFS)(算法1):每步维护一个包含 b 个最有希望的状态的集合。这用于24点游戏和创意写作,其中树深度有限(T3),且初始思维步骤可以被评估并剪枝到一个较小的集合(b5)。
  • (b) 深度优先搜索(DFS)(算法2):首先探索最有希望的状态,直到达到最终输出(t>T),或者状态评估器认为从当前 s 无法解决问题(对于值阈值 vth,有 V(pθ,{s})(s)vth)。在后一种情况下,从 s 开始的子树被剪枝,以在探索与利用之间进行权衡。在这两种情况下,DFS都会回溯s 的父状态以继续探索。

Conceptually, ToT has several benefits as a method for general problem-solving with LMs:

  • (1) Generality. IO, CoT, CoT-SC, and self-refinement can be seen as special cases of ToT (i.e., trees of limited depth and breadth; Figure 1).
  • (2) Modularity. The base LM, as well as the thought decomposition, generation, evaluation, and search procedures can all be varied independently.
  • (3) Adaptability. Different problem properties, LM capabilities, and resource constraints can be accommodated.
  • (4) Convenience. No extra training is needed, just a pre-trained LM is sufficient. The next section will show how these conceptual benefits translate to strong empirical performance in different problems.

从概念上讲,ToT作为使用语言模型进行通用问题求解的方法具有几个优点:

  • (1) 通用性。IO、CoT、CoT-SC和自我精炼都可以视为ToT的特例(即深度和广度受限的树;图1)。
  • (2) 模块化。基础语言模型,以及思维分解、生成、评估和搜索过程都可以独立变化。
  • (3) 适应性。可以适应不同的问题属性、语言模型能力和资源约束。
  • (4) 便利性。无需额外训练,仅需一个预训练的语言模型就足够了。下一节将展示这些概念上的优势如何转化为在不同问题上的强大实证表现。

算法1 思维树广度优先搜索 (ToT-BFS)

输入: 输入 x,语言模型 pθ,思维生成器 G 和数量限制 k,状态评估器 V,步数限制 T,宽度限制 b

  1. S0{x}
  2. for t=1,,T do
  3.     St{[s,z]sSt1,zG(pθ,s,k)}
  4.     VtV(pθ,St)
  5.     StargmaxSSt,|S|=bsSVt(s)     // 保留得分最高的 b 个状态
  6. end for
  7. return G(pθ,argmaxsSTVT(s),1)

算法2 思维树深度优先搜索 (ToT-DFS)

输入: 当前状态 s,当前步数 t,语言模型 pθ,思维生成器 G 和数量限制 k,状态评估器 V,步数限制 T,阈值 vth

  1. if t>T then record output G(pθ,s,1)
  2. for each sG(pθ,s,k) (sorted candidates) do
  3.     if V(pθ,{s})(s)>vth then     // 剪枝
  4.          recursively call ToT-DFS(s,t+1,pθ,G,k,V,T,vth)
  5.     end if
  6. end for

Planning and decision making. Smart planning and decision making are critical to achieving predefined goals. As they are trained on vast amount of world knowledge and human examples, LMs are known to have already absorbed rich commonsense that makes it possible to propose reasonable plans conditioned on problem setting and environmental states. Our proposed ToT approach extends existing planning formulations by considering multiple potentially feasible plans simultaneously at each problem-solving step, and proceeding with the most promising ones. The integration between thought sampling and value feedback organically integrates planning and decision-making mechanisms, enabling effective search inside a solution tree. On the other hand, traditional decision-making procedures usually require training dedicated reward and policy models as in reinforcement learning (for example CHAI), whereas we use the LM itself to provide the value estimates for decision making. RAP is a concurrent work that treats language model reasoning as planning with its internal world model, and proposes a MCTS-based method similar to ToT. However, its tasks are simpler than ours, and its framework lacks the modularity to incorporate different tree search algorithms.

规划与决策制定。 智能规划与决策制定对于达成预定目标至关重要。由于语言模型在海量世界知识和人类示例上训练,它们被认为已经吸收了丰富的常识,能够根据问题设定和环境状态提出合理的计划。我们提出的ToT方法通过在每一步问题求解中同时考虑多个潜在可行的计划,并推进最有希望的几个,扩展了现有的规划框架。思维采样与价值反馈的整合有机地结合了规划与决策机制,实现了在解空间树中的高效搜索。另一方面,传统的决策过程通常需要像强化学习中那样训练专门的奖励模型和策略模型(例如 CHAI),而我们则使用语言模型自身来提供决策的价值估计。RAP 是一项并行工作,它将语言模型推理视为利用其内部世界模型进行规划,并提出了一种基于MCTS的类似ToT的方法。然而,其任务比我们的更简单,且其框架缺乏整合不同树搜索算法的模块化能力。

Self-reflection. Using LLMs to assess the viability of their own predictions is becoming an increasingly important procedure in problem solving. [28, 20, 24] introduced the “self-reflection” mechanism, in which LMs provide feedback to their generation candidates. [4] improves LMs code generation accuracy by injecting feedback messages generated by the LM itself based on its code execution results. Similarly, [17] also introduces “critic” or review steps over the actions and states, deciding the next action to take in solving computer operation tasks. Another recent work very relevant to ours is “self-eval guided decoding” . Similar to our method, self-eval decoding also follows a tree-search procedure with leaves sampled from stochastic beam search decoding, which are then evaluated by LLM itself with carefully prepared self-eval prompts. Their approach however, uses the PAL formulation which represents thoughts as codes, which makes it difficult to tackle challenging tasks like creative writing which we consider in this paper. Our Tree-of-Thought formulation is thus more versatile and handles challenging tasks on which GPT-4 only achieves very low accuracy with standard prompts.

自我反思。 使用大语言模型评估自身预测的可行性正成为问题求解中日益重要的过程。[28, 20, 24] 引入了“自我反思”机制,其中语言模型为其生成候选项提供反馈。[4] 通过注入基于代码执行结果由语言模型自身生成的反馈消息,提高了语言模型的代码生成准确率。类似地,[17] 也在行动和状态上引入了“批评”或审查步骤,以决定在解决计算机操作任务时的下一步行动。另一项与我们非常相关的最新工作是“自评估引导解码” 。与我们的方法相似,自评估解码也遵循树搜索过程,其叶节点从随机波束搜索解码中采样,然后由语言模型本身使用精心准备的自评估提示进行评估。然而,他们的方法使用了PAL框架,该框架将思维表示为代码,这使得处理像本文所考虑的创意写作这类具有挑战性的任务变得困难。因此,我们的思维树框架更加通用,能够处理那些即便使用标准提示GPT-4也只能达到极低准确率的挑战性任务。

Program-guided LLM generation. Our proposal is also related to recent advancements that organize LM’s behavior with systematic procedures or symbolic program guidance. For example, Schlag et al. embeds LMs in an algorithmic search procedure to help solve problems like question answering step-by-step, in which the search trees are expanded by relevant paragraphs that might provide answers. This approach however differs from ours in that trees are expanded by sampling external paragraphs instead of the LM’s own thoughts, and there is no reflection or voting steps. Another approach, LLM+P, goes one step further and delegates the actual planning process to a classical planner.

程序引导的语言模型生成。 我们的提议也与最近通过系统性程序或符号程序引导来组织语言模型行为的进展相关。例如,Schlag 等人将语言模型嵌入算法搜索过程中,以帮助逐步解决如问答之类的问题,其中搜索树通过可能提供答案的相关段落进行扩展。然而,这种方法与我们的不同之处在于,树是通过采样外部段落而非语言模型自身的思维来扩展的,并且没有反思或投票步骤。另一种方法 LLM+P 更进一步,将实际的规划过程委托给经典规划器。

Classical search methods. Last but not least, our approach can be treated as a modern rendition of classical search methods for problem solving. For example it can be considered as a heuristic search algorithm like A*, in which the heuristic at each search node is provided by the LM’s self-assessment. From this perspective, our method is also related to NeuroLogic A* esque decoding, which is inspired by A* search but introduces look-ahead heuristics that are efficient for LMs to improve the beam-search or top-k sampling decoding. This method however is constrained to sentence generation tasks, whereas our framework are designed for complex, multi-step problem solving guarded by value feedback.

经典搜索方法。 最后但同样重要的是,我们的方法可以被视为问题求解中经典搜索方法的现代演绎。例如,它可以被视为一种启发式搜索算法,如 A*,其中每个搜索节点的启发值由语言模型的自我评估提供。从这个角度看,我们的方法也与 NeuroLogic A* esque 解码 相关,后者受 A* 搜索启发,引入了对语言模型高效的前瞻启发式,以改进波束搜索或 top-k 采样解码。然而,该方法局限于句子生成任务,而我们的框架则专为受价值反馈引导的复杂多步问题求解而设计。

Discussion

Limitations and future directions. Deliberate search such as ToT might not be necessary for many existing tasks that GPT-4 already excels at (see Appendix B.1), and as an initial step this work only explores three relatively simple tasks that challenges GPT-4 (see Appendix B.2 for some GPT-3.5 experiment results) and calls of better search and planning abilities incorporated with LMs. However, as we begin to deploy LMs for more real-world decision making applications (e.g. coding, data analysis, robotics, etc.), more complex tasks could emerge and present new opportunities to study these research questions. Also, search methods like ToT requires more resources (e.g. GPT-4 API cost) than sampling methods in order to improve task performances, but the modular flexibility of ToT allows users to customize such performance-cost tradeoffs, and ongoing open-source efforts should readily reduce such costs in the near future. More details about cost and efficiency are in Appendix B.3. Lastly, this work focuses on using an off-the-shelf LM, and fine-tuning LMs using a ToT-style high-level counterfactual decision making (e.g. deliberating over potential choices for the next paragraph, instead of predicting the next token) might present opportunities to enhance the problem-solving capabilities of LMs.

局限性与未来方向。 对于许多GPT-4已经擅长的现有任务而言,像ToT这样的审慎搜索可能并非必要(参见附录B.1),并且作为初步工作,本文仅探索了三个对GPT-4构成挑战的相对简单的任务(GPT-3.5的部分实验结果见附录B.2),并呼吁在语言模型中融入更好的搜索与规划能力。然而,随着我们开始将语言模型部署到更多现实世界的决策应用中(例如编程、数据分析、机器人技术等),可能会出现更复杂的任务,为研究这些问题带来新的机遇。此外,像ToT这样的搜索方法需要比采样方法更多的资源(例如GPT-4 API成本)来提升任务性能,但ToT的模块化灵活性允许用户自定义这种性能-成本的权衡,且正在进行的开源工作应能在不久的将来轻松降低此类成本。关于成本与效率的更多细节见附录B.3。最后,本工作侧重于使用现成的语言模型,而利用ToT风格的高层反事实决策(例如,思考下一段的潜在选择,而非预测下一个词元)对语言模型进行微调,或许能为增强语言模型的问题求解能力提供机会。

Conclusion. The associative “System 1” of LMs can be beneficially augmented by a “System 2” based on searching a tree of possible paths to the solution to a problem. The Tree of Thoughts framework provides a way to translate classical insights about problem-solving into actionable methods for contemporary LMs. At the same time, LMs address a weakness of these classical methods, providing a way to solve complex problems that are not easily formalized, such as creative writing. We see this intersection of LMs with classical approaches to AI as an exciting direction.

结论。 语言模型的联想式“系统1”可以通过基于搜索问题解的可能路径树的“系统2”得到有益的增强。思维树框架提供了一种方式,将关于问题求解的经典见解转化为适用于当代语言模型的可行方法。同时,语言模型弥补了这些经典方法的一个弱点,提供了一种解决不易形式化的复杂问题(如创意写作)的途径。我们认为语言模型与经典人工智能方法的这一交叉是一个令人兴奋的方向。

Broader Impact. ToT is a framework that empowers LMs to more autonomously and intelligently make decisions and solve problems. While current tasks are limited to reasoning and search problems, future applications involving interaction with external environments or humans could bring potential danger, e.g. facilitating harmful uses of LMs. On the other hand, ToT also improves the interpretability of model decisions and the opportunity for human alignment, as the resulting representations are readable, high-level language reasoning instead of implicit, low-level token values.

更广泛的影响. ToT是一个框架,它使语言模型能够更自主、更智能地做出决策和解决问题。虽然当前任务仅限于推理和搜索问题,但未来涉及与外部环境或人类交互的应用可能带来潜在危险,例如促进语言模型的有害使用。另一方面,ToT也提高了模型决策的可解释性和与人类对齐的机会,因为其产生的表示是可读的高层语言推理,而非隐式的低层词元值。