Skip to content


Abstract

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

语言模型在仅凭少量示例或文本指令解决新任务方面展现出卓越能力,尤其是在大规模情况下。矛盾的是,它们在基本功能上却表现挣扎,例如算术或事实查找,而这些正是更简单、更小的专用模型所擅长的。在本文中,我们展示了语言模型可以通过简单的API自学使用外部工具,从而实现两全其美。我们引入了Toolformer,一个经过训练的模型,它可以决定调用哪些API、何时调用、传递什么参数,以及如何将结果最好地融入未来的词元预测中。这一切都是以自监督的方式完成的,每个API仅需少量演示即可。我们整合了多种工具,包括计算器、问答系统、搜索引擎、翻译系统和日历。Toolformer在各种下游任务上实现了显著提升的零样本性能,通常可与更大的模型相媲美,且不牺牲其核心的语言建模能力。

Introduction

Large language models achieve impressive zero and few-shot results on a variety of natural language processing tasks. However, these models have several inherent limitations that can at best be partially addressed by further scaling. These limitations include an inability to access up-to-date information on recent events and the related tendency to hallucinate facts, difficulties in understanding low-resource languages, a lack of mathematical skills to perform precise calculations and an unawareness of the progression of time.

大型语言模型在各种自然语言处理任务上取得了令人印象深刻的零样本和少样本结果。然而,这些模型存在几个固有的局限性,进一步扩展规模最多只能部分解决这些问题。这些局限性包括无法获取关于近期事件的最新信息以及相关的编造事实倾向、理解低资源语言的困难、缺乏进行精确计算的数学技能,以及对时间流逝的无感知。

A simple way to overcome the limitations of today's language models is to give them the ability to use external tools such as search engines, calculators, or calendars. However, existing approaches either rely on large amounts of human annotations or limit tool use to task-specific settings only, hindering a more widespread adoption of tool use in LMs. Therefore, we propose Toolformer, a model that learns to use tools in a novel way, which fulfills the following desiderata:

克服当今语言模型局限性的一种简单方法是赋予它们使用搜索引擎、计算器或日历等外部工具的能力。然而,现有方法要么依赖大量人工标注,要么将工具使用限制在特定任务场景中,阻碍了工具使用在语言模型中的更广泛采用。因此,我们提出了Toolformer,一种以新颖方式学习使用工具的模型,它满足以下期望:

  • Tool use should be learned in a self-supervised way without large amounts of human annotations. This is important not only because of the costs associated with such annotations, but also because what humans find useful may be different from what a model finds useful.
  • The LM should not lose any of its generality and should be able to decide for itself when and how to use which tool. In contrast to existing approaches, this enables a much more comprehensive use of tools that is not tied to specific tasks.
  • 工具使用应以自监督的方式学习,无需大量人工标注。这不仅是因为此类标注的成本,还因为人类认为有用的东西可能与模型认为有用的不同。
  • 语言模型不应丧失其任何通用性,并且能够自行决定何时以及如何使用哪种工具。与现有方法相比,这使得工具的使用更加全面,且不局限于特定任务。

Our approach for achieving these goals is based on the recent idea of using large LMs with in-context learning to generate entire datasets from scratch: Given just a handful of human-written examples of how an API can be used, we let a LM annotate a huge language modeling dataset with potential API calls. We then use a self-supervised loss to determine which of these API calls actually help the model in predicting future tokens. Finally, we finetune the LM itself on the API calls that it considers useful. As illustrated in Figure 1, through this simple approach, LMs can learn to control a variety of tools, and to choose for themselves which tool to use when and how.

我们实现这些目标的方法基于近期使用大型语言模型通过上下文学习从头生成整个数据集的想法:仅需少量关于如何使用API的人工编写示例,我们让一个语言模型用潜在的API调用注释一个巨大的语言建模数据集。然后,我们使用自监督损失来确定哪些API调用真正有助于模型预测未来的词元。最后,我们在模型认为有用的API调用上对语言模型本身进行微调。如图1所示,通过这种简单的方法,语言模型可以学会控制多种工具,并为自己选择何时以及如何使用哪种工具。

As our approach is agnostic of the dataset being used, we can apply it to the exact same dataset that was used to pretrain a model in the first place. This ensures that the model does not lose any of its generality and language modeling abilities. We conduct experiments on a variety of different downstream tasks, demonstrating that after learning to use tools, Toolformer, which is based on a pretrained GPT-J model with 6.7B parameters, achieves much stronger zero-shot results, clearly outperforming a much larger GPT-3 model and several other baselines on various tasks.

由于我们的方法不依赖于所使用的数据集,我们可以将其应用于最初用于预训练模型的完全相同的数据集。这确保了模型不会丧失其任何通用性和语言建模能力。我们在各种不同的下游任务上进行了实验,结果表明,在学习使用工具后,基于预训练的6.7B参数GPT-J模型的Toolformer取得了更强的零样本结果,在各种任务上明显优于大得多的GPT-3模型和其他几个基线。

Approach

Our aim is to equip a language model M with the ability to use different tools through API calls. We represent API calls as tuples c=(ac,ic) where ac is the name of the API and ic is the corresponding input. Given an API call c with a corresponding result r, we denote the linearized sequences of the API call not including and including its result, respectively, as:

我们的目标是让语言模型 M 具备通过API调用使用不同工具的能力。我们将API调用表示为元组 c=(ac,ic),其中 ac 是API的名称,ic 是对应的输入。给定一个API调用 c 及其对应的结果 r,我们将不包含和包含结果的API调用的线性化序列分别表示为:

e(c)=<API>ac(ic)</API>e(c,r)=<API>ac(ic)r</API>

where “<API>”, “</API>” and “→” are special tokens. Some examples of linearized API calls inserted into text sequences are shown in Figure 1.

其中 “<API>”、“</API>” 和 “→” 是特殊词元。图1展示了一些插入到文本序列中的线性化API调用示例。

Given a dataset C={x1,,x|C|} of plain texts, we first convert this dataset into a dataset C augmented with API calls. This is done in three steps, illustrated in Figure 2: First, we exploit the in-context learning ability of M to sample a large number of API calls. We then execute them and finally check whether the obtained responses are helpful for predicting future tokens; this is used as a filtering criterion. After filtering, we merge API calls for different tools, resulting in the augmented dataset C, and finetune M itself on this dataset. Each step is described in more detail below.

给定一个纯文本数据集 C={x1,,x|C|},我们首先将此数据集转换为一个增广了API调用的数据集 C。这分为三个步骤,如图2所示:首先,我们利用 M 的上下文学习能力采样大量API调用。然后,我们执行这些调用,最后检查获得的响应是否有助于预测未来的词元;这用作过滤标准。过滤后,我们合并不同工具的API调用,得到增广数据集 C,并在此数据集上对 M 本身进行微调。下面更详细地描述每个步骤。

Sampling API Calls For each API, we write a prompt P(x) that encourages the LM to annotate an example x=x1,,xn with API calls. An example of such a prompt for a question answering tool is shown in Figure 3. Let pM(zn+1|z1,,zn) be the probability that M assigns to token zn+1 as a continuation for the sequence z1,,zn. We first sample up to k candidate positions for doing API calls by computing, for each i{1,,n}, the probability pi=pM(<API>|P(x),x1:i1) that M assigns to starting an API call at position i. Given a sampling threshold τs, we keep all positions I={i|pi>τs}; if there are more than k such positions, we only keep the top k. For each position iI, we then obtain up to m API calls ci1,,cim by sampling from M given the sequence [P(x),x1,,xi1,<API>] as a prefix and </API> as an end-of-sequence token.

采样API调用 对于每个API,我们编写一个提示 P(x),鼓励LM用API调用注释一个示例 x=x1,,xn。图3展示了一个问答工具的提示示例。设 pM(zn+1|z1,,zn)M 将词元 zn+1 作为序列 z1,,zn 的延续的概率。我们首先通过计算每个 i{1,,n}M 开始API调用的概率 pi=pM(<API>|P(x),x1:i1),来采样最多 k 个候选位置进行API调用。给定采样阈值 τs,我们保留所有位置 I={i|pi>τs};如果这样的位置多于 k 个,我们只保留前 k 个。然后,对于每个位置 iI,我们以序列 [P(x),x1,,xi1,<API>] 为前缀,以 </API> 为结束标记,从 M 中采样,获得最多 m 个API调用 ci1,,cim

Executing API Calls As a next step, we execute all API calls generated by M. How this is done depends entirely on the API itself – for example, it can involve calling another neural network, executing a Python script or using a retrieval system to perform search over a large corpus. The response for each API call ci needs to be a single text sequence ri.

执行API调用 作为下一步,我们执行由 M 生成的所有API调用。这如何完成完全取决于API本身——例如,它可能涉及调用另一个神经网络、执行Python脚本或使用检索系统在大规模语料库上进行搜索。每个API调用 ci 的响应需要是一个单独的文本序列 ri

Filtering API Calls Let i be the position of the API call ci in the sequence x=x1,,xn, and let ri be the response from the API. Further, given a sequence (wiiN) of weights, let

过滤API调用i 为API调用 ci 在序列 x=x1,,xn 中的位置,ri 为API的响应。此外,给定一个权重序列 (wiiN),定义

Li(z)=j=inwjilogpM(xjz,x1:j1)

be the weighted cross entropy loss for M over the tokens xi,,xn if the model is prefixed with some text sequence z. We compare two different instantiations of this loss:

为当模型以某个文本序列 z 作为前缀时,M 在词元 xi,,xn 上的加权交叉熵损失。我们比较此损失的两种不同实例:

Li+=Li(e(ci,ri))Li=min(Li(ε),Li(e(ci,ε))

where ε denotes an empty sequence. The former is the weighted loss over all tokens xi,,xn if the API call and its result are given to M as a prefix; the latter is the minimum of the losses obtained from (i) doing no API call at all and (ii) doing an API call, but not providing the response. Intuitively, an API call is helpful to M if providing it with both the input and the output of this call makes it easier for the model to predict future tokens, compared to not receiving the API call at all, or receiving only its input. Given a filtering threshold τf, we thus only keep API calls for which LiLi+τf holds, i.e., adding the API call and its result reduces the loss by at least τf, compared to not doing any API call or obtaining no result from it.

其中 ε 表示空序列。前者是如果将API调用及其结果作为前缀提供给 M 后,在词元 xi,,xn 上的加权损失;后者是从 (i) 完全不进行API调用 和 (ii) 进行API调用但不提供响应 所获得的损失中的最小值。直观地说,与完全不接收API调用或仅接收其输入相比,如果向 M 提供API调用的输入和输出使其更容易预测未来的词元,则该API调用对 M 是有帮助的。给定过滤阈值 τf,我们只保留满足 LiLi+τf 的API调用,即添加API调用及其结果与不进行任何API调用或未从中获得结果相比,至少减少了 τf 的损失。

Model Finetuning After sampling and filtering calls for all APIs, we finally merge the remaining API calls and interleave them with the original inputs. That is, for an input text x=x1,,xn with a corresponding API call and result (ci,ri) at position i, we construct the new sequence x=x1:i1,e(ci,ri),xi:n; we proceed analogously for texts with multiple API calls. Doing this for all xC results in the new dataset C augmented with API calls. We use C to finetune M, using a standard language modeling objective. Crucially, apart from inserted API calls, C contains the exact same texts as C, the original dataset. As a consequence, finetuning M on C exposes it to the same content as finetuning on C. Moreover, as API calls are inserted in exactly those positions and with exactly those inputs that help M predict future tokens, finetuning on C enables the language model to decide when and how to use which tool, based purely on its own feedback.

模型微调 在为所有API采样和过滤调用之后,我们最终合并剩余的API调用,并将它们与原始输入交错。也就是说,对于在位置 i 处有相应API调用和结果 (ci,ri) 的输入文本 x=x1,,xn,我们构建新的序列 x=x1:i1,e(ci,ri),xi:n;对于具有多个API调用的文本,我们类似地处理。对所有 xC 执行此操作,得到增广了API调用的新数据集 C。我们使用 C 通过标准的语言建模目标对 M 进行微调。关键的是,除了插入的API调用外,C 包含与原始数据集 C 完全相同的文本。因此,在 C 上微调 M 使其接触到与在 C 上微调相同的内容。此外,由于API调用被精确地插入到那些位置,并使用那些有助于 M 预测未来词元的输入,因此在 C 上微调使语言模型能够仅基于其自身的反馈来决定何时以及如何使用哪种工具。

Inference When generating text with M after finetuning with our approach, we perform regular decoding until M produces the “→” token, indicating that it next expects the response for an API call. At this point, we interrupt the decoding process, call the appropriate API to get a response, and continue the decoding process after inserting both the response and the </API> token.

推理 在使用我们的方法微调后,用 M 生成文本时,我们执行常规解码,直到 M 产生 “→” 词元,表示它接下来期望一个API调用的响应。此时,我们中断解码过程,调用相应的API以获取响应,并在插入响应和 </API> 词元后继续解码过程。

Tools

We explore various tools to address different shortcomings of LMs. The only constraints we impose are that (i) their inputs and outputs can be represented as texts, and (ii) we can obtain a few demonstrations of their intended use. Concretely, we explore a question answering system, a Wikipedia search engine, a calculator, a calendar, and a machine translation system. Examples for the APIs associated with each of these tools are shown in Table 1. We briefly discuss all tools below.

我们探索了各种工具来解决LM的不同缺陷。我们施加的唯一限制是:(i) 它们的输入和输出可以表示为文本,以及 (ii) 我们可以获得一些关于其预期用途的演示。具体来说,我们探索了问答系统、维基百科搜索引擎、计算器、日历和机器翻译系统。与每种工具相关的API示例见表1。下面我们简要讨论所有工具。

Question Answering Our first tool is a question answering system based on another LM that can answer simple factoid questions. Specifically, we use Atlas, a retrieval-augmented LM finetuned on Natural Questions.

问答 我们的第一个工具是基于另一个LM的问答系统,可以回答简单的事实性问题。具体来说,我们使用 Atlas,一个在Natural Questions上微调的检索增强LM。

Calculator As a second tool, we use a calculator that can perform simple numeric calculations; we only support the four basic arithmetic operations. Results are always rounded to two decimal places.

计算器 作为第二个工具,我们使用可以执行简单数值计算的计算器;我们只支持四种基本算术运算。结果总是四舍五入到小数点后两位。

Wikipedia Search Our third tool is a search engine that, given a search term, returns short text snippets from Wikipedia. Compared to our question answering tool, this search enables a model to get more comprehensive information on a subject, but requires it to extract the relevant parts by itself. As our search engine, we use a BM25 retriever that indexes the Wikipedia dump from KILT.

维基百科搜索 我们的第三个工具是一个搜索引擎,给定搜索词,返回维基百科中的短文本片段。与我们的问答工具相比,这种搜索使模型能够获得关于某个主题的更全面信息,但需要其自行提取相关部分。作为我们的搜索引擎,我们使用一个BM25检索器,它对来自KILT的维基百科数据转储建立了索引。

Machine Translation System Our fourth tool is a machine translation system based on a LM that can translate a phrase from any language into English. More concretely, we use the 600M parameter NLLB as our multilingual machine translation model that works for 200 languages (including low-resource ones). The source language is automatically detected using the fastText classifier, while the target language is always set to English.

机器翻译系统 我们的第四个工具是一个基于LM的机器翻译系统,可以将任何语言的短语翻译成英语。更具体地说,我们使用6亿参数的NLLB作为我们的多语言机器翻译模型,该模型支持200种语言。源语言使用fastText分类器自动检测,目标语言始终设置为英语。

Calendar Our final tool is a calendar API that, when queried, returns the current date without taking any input. This provides temporal context for predictions that require some awareness of time.

日历 我们的最后一个工具是一个日历API,当被查询时,它返回当前日期,不接受任何输入。这为需要时间感知的预测提供了时间上下文。

Language Model Pretraining There are various approaches that augment LMs with some form of additional textual information during pretraining, including various forms of metadata, HTML tags, Wikipedia markup, or related texts obtained from an information retrieval system. For all of these approaches, additional information is always provided, regardless of whether it is helpful or not. In contrast, Toolformer learns for itself to explicitly asks for the right information.

语言模型预训练 有多种方法在预训练期间以某种形式的额外文本信息增强语言模型,包括各种形式的元数据、HTML标签、维基百科标记,或从信息检索系统获得的相关文本。对于所有这些方法,无论信息是否有用,总是会提供额外信息。相比之下,Toolformer 自学并明确地请求正确的信息。

Tool Use Several approaches aim to equip LMs with the ability to use external tools such as search engines, web browsers, calculators, translation systems and Python interpreters. The way these models learn to use tools can roughly be divided into two approaches: Either they rely on large amounts of human supervision or they work by prompting the language model in a few-shot setup tailored towards a specific task where it is known a priori which tools needs to be used. In contrast, the self-supervised nature of Toolformer enables it to learn how and when to use tools without requiring a specific prompt that shows task-specific examples of how a tool could be used. Perhaps most closely related to our work is TALM, an approach that uses a similar self-supervised objective for teaching a model to use a calculator and a search engine, but explores this only in settings where a model is finetuned for downstream tasks.

工具使用 几种方法旨在使语言模型具备使用外部工具的能力,例如搜索引擎、网页浏览器、计算器、翻译系统和 Python 解释器。这些模型学习使用工具的方式大致可分为两种:要么依赖大量人工监督,要么通过针对特定任务量身定制的少样本设置来提示语言模型,该设置预先知道需要使用的工具。相比之下,Toolformer 的自监督特性使其能够学习如何以及何时使用工具,而无需显示如何使用工具的任务特定示例的特定提示。与我们工作最相关的可能是 TALM,它使用类似的自监督目标来教导模型使用计算器和搜索引擎,但仅在模型为下游任务进行微调的设置中探索了这一点。

Bootstrapping The idea of using self-training and bootstrapping techniques to improve models has been investigated in various contexts, ranging from word sense disambiguation, relation extraction, parsing, sequence generation, few-shot text classification and retrieval to reasoning. In a similar spirit, Toolformer is trained on its own predictions after applying a perplexity-based filtering step.

引导 使用自训练和引导技术来改进模型的想法已在各种背景下得到研究,包括词义消歧、关系抽取、句法分析、序列生成、少样本文本分类、检索到推理。类似地,Toolformer 在应用了基于困惑度的过滤步骤后,基于自身的预测进行训练。

Limitations

While our approach enables LMs to learn how to use a variety of tools in a self-supervised way, there are some clear limitations to what can be achieved with our method in its current form. One such limitation is the inability of Toolformer to use tools in a chain (i.e., using the output of one tool as an input for another tool). This is due to the fact that API calls for each tool are generated independently; as a consequence, there are no examples of chained tool use in the finetuning dataset, since this would necessitate multiple API calls per example. Our current approach also does not allow the LM to use a tool in an interactive way – especially for tools such as search engines, that could potentially return hundreds of different results, enabling a LM to browse through these results or to refine its search query in a similar spirit to Nakano et al. (2021) can be crucial for certain applications. Beyond this, we found models trained with Toolformer to often be sensitive to the exact wording of their input when deciding whether or not to call an API; this is perhaps unsurprising given that LMs are known to be very sensitive to the prompt they are provided with in both zero- and few-shot settings. Depending on the tool, our method is also very sample-inefficient; for example, processing more than a million documents results in only a few thousand examples of useful calls to the calculator API. A potential solution to this problem might be to iteratively apply our approach, similar to how this is done in related bootstrapping approaches. Finally, when deciding whether or not to make an API call, Toolformer currently does not take into account the tool-dependent, computational cost incurred from making an API call.

虽然我们的方法使语言模型能够以自监督的方式学习使用各种工具,但当前形式的方法所能达到的效果存在一些明显的局限性。一个这样的局限性是Toolformer无法以链式方式使用工具。这是因为每个工具的API调用都是独立生成的;因此,微调数据集中没有链式工具使用的示例,因为这需要在每个示例中进行多次API调用。我们目前的方法也不允许语言模型以交互式方式使用工具——特别是对于搜索引擎等工具,它们可能返回数百个不同的结果,使语言模型能够浏览这些结果或优化其搜索查询,这对于某些应用可能至关重要。除此之外,我们发现使用Toolformer训练的模型在决定是否调用API时,常常对其输入的准确措辞很敏感;鉴于已知语言模型在零样本和少样本设置中对提供给它们的提示非常敏感,这可能并不奇怪。根据工具的不同,我们的方法也非常样本效率低下;例如,处理超过一百万份文档,仅能产生几千个对计算器API有用的调用示例。解决这个问题的一个潜在方法可能是迭代地应用我们的方法,类似于相关引导方法中的做法。最后,在决定是否进行API调用时,Toolformer目前没有考虑因进行API调用而产生的、依赖于工具的计算成本。

Conclusion

We have introduced Toolformer, a LM that learns in a self-supervised way how to use different tools such as search engines, calculators, and translation systems via simple API calls. This is done by finetuning on sampled API calls that are filtered based on whether they reduce perplexity on future tokens. Toolformer considerably improves zero-shot performance of a 6.7B parameter GPT-J model, enabling it to even outperform a much larger GPT-3 model on a range of different downstream tasks.

我们引入了Toolformer,这是一个以自监督方式学习如何通过简单的API调用使用不同工具的语言模型。这是通过在采样的API调用上进行微调来实现的,这些API调用根据它们是否降低未来词元的困惑度进行过滤。Toolformer显著提高了6.7B参数GPT-J模型的零样本性能,使其甚至在一系列不同的下游任务上优于大得多的GPT-3模型。