Skip to content


思维链提示激发大型语言模型的推理能力

Abstract

We explore how generating a chain of thought—a series of intermediate reasoning steps—significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain-of-thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting.


Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

我们探索了生成“思维链”——一系列中间推理步骤——如何显著提升大型语言模型执行复杂推理的能力。我们特别展示了,通过一种名为“思维链提示”的简单方法,即在提示中提供少量思维链示例作为范例,这种推理能力如何在足够大的语言模型中自然涌现。


在三个大型语言模型上的实验表明,思维链提示在一系列算术、常识和符号推理任务上均提升了性能。实证效果令人瞩目。例如,仅用八个思维链示例提示 PaLM 540B,便在数学应用题基准 GSM8K 上达到了最先进的准确率,甚至超越了带有验证器的微调 GPT-3。

您的浏览器不支持 SVG

图1:思维链提示使大型语言模型能够处理复杂的算术、常识与符号推理任务。思维链推理过程以高亮显示。

Introduction

The NLP landscape has recently been revolutionized by language models. Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency. However, scaling up model size alone has not proved sufficient for achieving high performance on challenging tasks such as arithmetic, commonsense, and symbolic reasoning.


This work explores how the reasoning ability of large language models can be unlocked by a simple method motivated by two ideas. First, techniques for arithmetic reasoning can benefit from generating natural language rationales that lead to the final answer. Prior work has given models the ability to generate natural language intermediate steps by training from scratch or finetuning a pretrained model, in addition to neuro-symbolic methods that use formal languages instead of natural language. Second, large language models offer the exciting prospect of in-context few-shot learning via prompting. That is, instead of finetuning a separate language model checkpoint for each new task, one can simply "prompt" the model with a few input--output exemplars demonstrating the task. Remarkably, this has been successful for a range of simple question-answering tasks.


Both of the above ideas, however, have key limitations. For rationale-augmented training and finetuning methods, it is costly to create a large set of high quality rationales, which is much more complicated than simple input--output pairs used in normal machine learning. For the traditional few-shot prompting method, it works poorly on tasks that require reasoning abilities, and often does not improve substantially with increasing language model scale. In this paper, we combine the strengths of these two ideas in a way that avoids their limitations. Specifically, we explore the ability of language models to perform few-shot prompting for reasoning tasks, given a prompt that consists of triples: <input, chain of thought, output>. A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output, and we refer to this approach as chain-of-thought prompting. An example prompt is shown in Figure 1.


We present empirical evaluations on arithmetic, commonsense, and symbolic reasoning benchmarks, showing that chain-of-thought prompting outperforms standard prompting, sometimes to a striking degree. Figure 2 illustrates one such result---on the GSM8K benchmark of math word problems, chain-of-thought prompting with PaLM 540B outperforms standard prompting by a large margin and achieves new state-of-the-art performance. A prompting only approach is important because it does not require a large training dataset and because a single model checkpoint can perform many tasks without loss of generality. This work underscores how large language models can learn via a few examples with natural language data about the task (c.f. automatically learning the patterns underlying inputs and outputs via a large training dataset)

最近,语言模型彻底变革了自然语言处理领域。研究表明,扩大语言模型的规模能够带来一系列好处,例如性能提升和样本效率提高。然而,仅靠扩大模型规模并不足以在算术、常识和符号推理等挑战性任务上取得高性能。


本研究探索如何通过一个受两个思想启发的简单方法来解锁大型语言模型的推理能力。首先,算术推理技术可以从生成导向最终答案的自然语言解释中获益。已有工作通过从头训练或微调预训练模型,赋予模型生成自然语言中间步骤的能力,此外还有使用形式语言而非自然语言的神经符号方法。其次,大型语言模型提供了通过提示进行上下文少样本学习的激动人心的前景。也就是说,无需为每个新任务微调单独的语言模型检查点,只需用几个展示任务的输入-输出示例来“提示”模型即可。值得注意的是,这种方法已在一系列简单问答任务上取得成功。


然而,上述两种思想都存在关键局限性。对于增强解释的训练和微调方法,创建大量高质量解释的成本高昂,且比普通机器学习中使用的简单输入-输出对复杂得多。而传统的少样本提示方法在需要推理能力的任务上表现不佳,且通常不会随着语言模型规模的扩大而显著提升。本文以一种规避其局限性的方式结合了这两种思想的优势。具体来说,我们探索语言模型在推理任务上进行少样本提示的能力,所使用的提示由三元组构成:<输入,思维链,输出>。思维链是一系列导向最终输出的中间自然语言推理步骤,我们将这种方法称为思维链提示。图1展示了一个示例提示。


我们在算术、常识和符号推理基准上进行了实证评估,结果表明思维链提示优于标准提示,有时效果提升惊人。图2展示了其中一个结果——在数学应用题基准GSM8K上,使用PaLM 540B的思维链提示大幅超越标准提示,并取得了新的最先进性能。纯提示方法之所以重要,是因为它不需要大规模训练数据集,并且单个模型检查点可以执行多种任务而不失通用性。本研究强调了大型语言模型如何通过少量关于任务的自然语言数据示例进行学习(相较于通过大规模训练数据集自动学习输入与输出的潜在模式)。

您的浏览器不支持 SVG

图2:PaLM 540B 使用思维链提示,在数学应用题基准 GSM8K 上取得了新的最先进性能。微调的 GPT-3 和此前的最佳结果。

Chain-of-Thought Prompting

Consider one's own thought process when solving a complicated reasoning task such as a multi-step math word problem. It is typical to decompose the problem into intermediate steps and solve each before giving the final answer: "After Jane gives 2 flowers to her mom she has 10 ... then after she gives 3 to her dad she will have 7 ... so the answer is 7." The goal of this paper is to endow language models with the ability to generate a similar chain of thought---a coherent series of intermediate reasoning steps that lead to the final answer for a problem. We will show that sufficiently large language models can generate chains of thought if demonstrations of chain-of-thought reasoning are provided in the exemplars for few-shot prompting.


Figure 1 shows an example of a model producing a chain of thought to solve a math word problem that it would have otherwise gotten incorrect. The chain of thought in this case resembles a solution and can interpreted as one, but we still opt to call it a chain of thought to better capture the idea that it mimics a step-by-step thought process for arriving at the answer (and also, solutions/explanations typically come after the final answer).

当解决复杂的推理任务(例如多步数学应用题)时,可以思考一下自己的思维过程。通常的做法是将问题分解为中间步骤,逐一求解,然后给出最终答案:“简给了妈妈2朵花后,她有10朵……然后她给了爸爸3朵后,她将剩下7朵……所以答案是7。”本文的目标是赋予语言模型生成类似思维链的能力——即一系列连贯的中间推理步骤,最终导出问题的答案。我们将证明,只要在少样本提示的示例中提供了思维链推理的示范,足够大的语言模型就能够生成思维链。


图 1 展示了一个模型生成思维链解决数学应用题的例子,而如果没有思维链,该模型会得出错误答案。此处的思维链类似于解答过程,也可以被理解为解答过程,但我们仍选择称其为思维链,以便更好地体现其模拟得出答案的逐步思维过程(此外,解答/解释通常出现在最终答案之后)。

Chain-of-thought prompting has several attractive properties as an approach for facilitating reasoning in language models.

  1. First, chain of thought, in principle, allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.
  2. Second, a chain of thought provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong (although fully characterizing a model's computations that support an answer remains an open question).
  3. Third, chain-of-thought reasoning can be used for tasks such as math word problems, commonsense reasoning, and symbolic manipulation, and is potentially applicable (at least in principle) to any task that humans can solve via language.
  4. Finally, chain-of-thought reasoning can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.

In empirical experiments, we will observe the utility of chain-of-thought prompting for arithmetic reasoning (Section 3), commonsense reasoning (Section 4), and symbolic reasoning (Section 5).

思维链提示作为一种促进语言模型推理的方法,具有几个吸引人的特性:

  1. 第一,思维链原则上允许模型将多步问题分解为中间步骤,这意味着可以将额外的计算分配给需要更多推理步骤的问题。
  2. 第二,思维链为模型行为提供了一个可解释的窗口,暗示模型是如何得出特定答案的,并为调试推理路径出错的地方提供了机会(尽管完整刻画支持某一答案的模型计算过程仍是一个开放问题)。
  3. 第三,思维链推理可用于数学应用题、常识推理和符号操作等任务,并且(至少原则上)潜在地适用于人类可以通过语言解决的任何任务。
  4. 最后,在足够大的现成语言模型中,只需将思维链序列示例包含到少样本提示的样本中,就能轻易地激发思维链推理。

在实证实验中,我们将观察思维链提示在算术推理、常识推理和符号推理中的效用。