Abstract
Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.
语言模型在仅凭少量示例或文本指令解决新任务方面展现出卓越能力,尤其是在大规模情况下。矛盾的是,它们在基本功能上却表现挣扎,例如算术或事实查找,而这些正是更简单、更小的专用模型所擅长的。在本文中,我们展示了语言模型可以通过简单的API自学使用外部工具,从而实现两全其美。我们引入了Toolformer,一个经过训练的模型,它可以决定调用哪些API、何时调用、传递什么参数,以及如何将结果最好地融入未来的词元预测中。这一切都是以自监督的方式完成的,每个API仅需少量演示即可。我们整合了多种工具,包括计算器、问答系统、搜索引擎、翻译系统和日历。Toolformer在各种下游任务上实现了显著提升的零样本性能,通常可与更大的模型相媲美,且不牺牲其核心的语言建模能力。
Introduction
Large language models achieve impressive zero and few-shot results on a variety of natural language processing tasks. However, these models have several inherent limitations that can at best be partially addressed by further scaling. These limitations include an inability to access up-to-date information on recent events and the related tendency to hallucinate facts, difficulties in understanding low-resource languages, a lack of mathematical skills to perform precise calculations and an unawareness of the progression of time.
大型语言模型在各种自然语言处理任务上取得了令人印象深刻的零样本和少样本结果。然而,这些模型存在几个固有的局限性,进一步扩展规模最多只能部分解决这些问题。这些局限性包括无法获取关于近期事件的最新信息以及相关的编造事实倾向、理解低资源语言的困难、缺乏进行精确计算的数学技能,以及对时间流逝的无感知。
A simple way to overcome the limitations of today's language models is to give them the ability to use external tools such as search engines, calculators, or calendars. However, existing approaches either rely on large amounts of human annotations or limit tool use to task-specific settings only, hindering a more widespread adoption of tool use in LMs. Therefore, we propose Toolformer, a model that learns to use tools in a novel way, which fulfills the following desiderata:
克服当今语言模型局限性的一种简单方法是赋予它们使用搜索引擎、计算器或日历等外部工具的能力。然而,现有方法要么依赖大量人工标注,要么将工具使用限制在特定任务场景中,阻碍了工具使用在语言模型中的更广泛采用。因此,我们提出了Toolformer,一种以新颖方式学习使用工具的模型,它满足以下期望:
- Tool use should be learned in a self-supervised way without large amounts of human annotations. This is important not only because of the costs associated with such annotations, but also because what humans find useful may be different from what a model finds useful.
- The LM should not lose any of its generality and should be able to decide for itself when and how to use which tool. In contrast to existing approaches, this enables a much more comprehensive use of tools that is not tied to specific tasks.
- 工具使用应以自监督的方式学习,无需大量人工标注。这不仅是因为此类标注的成本,还因为人类认为有用的东西可能与模型认为有用的不同。
- 语言模型不应丧失其任何通用性,并且能够自行决定何时以及如何使用哪种工具。与现有方法相比,这使得工具的使用更加全面,且不局限于特定任务。
Our approach for achieving these goals is based on the recent idea of using large LMs with in-context learning to generate entire datasets from scratch: Given just a handful of human-written examples of how an API can be used, we let a LM annotate a huge language modeling dataset with potential API calls. We then use a self-supervised loss to determine which of these API calls actually help the model in predicting future tokens. Finally, we finetune the LM itself on the API calls that it considers useful. As illustrated in Figure 1, through this simple approach, LMs can learn to control a variety of tools, and to choose for themselves which tool to use when and how.
我们实现这些目标的方法基于近期使用大型语言模型通过上下文学习从头生成整个数据集的想法:仅需少量关于如何使用API的人工编写示例,我们让一个语言模型用潜在的API调用注释一个巨大的语言建模数据集。然后,我们使用自监督损失来确定哪些API调用真正有助于模型预测未来的词元。最后,我们在模型认为有用的API调用上对语言模型本身进行微调。如图1所示,通过这种简单的方法,语言模型可以学会控制多种工具,并为自己选择何时以及如何使用哪种工具。
As our approach is agnostic of the dataset being used, we can apply it to the exact same dataset that was used to pretrain a model in the first place. This ensures that the model does not lose any of its generality and language modeling abilities. We conduct experiments on a variety of different downstream tasks, demonstrating that after learning to use tools, Toolformer, which is based on a pretrained GPT-J model with 6.7B parameters, achieves much stronger zero-shot results, clearly outperforming a much larger GPT-3 model and several other baselines on various tasks.
由于我们的方法不依赖于所使用的数据集,我们可以将其应用于最初用于预训练模型的完全相同的数据集。这确保了模型不会丧失其任何通用性和语言建模能力。我们在各种不同的下游任务上进行了实验,结果表明,在学习使用工具后,基于预训练的6.7B参数GPT-J模型的Toolformer取得了更强的零样本结果,在各种任务上明显优于大得多的GPT-3模型和其他几个基线。
Approach
Our aim is to equip a language model
我们的目标是让语言模型
where “<API>”, “</API>” and “→” are special tokens. Some examples of linearized API calls inserted into text sequences are shown in Figure 1.
其中 “<API>”、“</API>” 和 “→” 是特殊词元。图1展示了一些插入到文本序列中的线性化API调用示例。
Given a dataset
给定一个纯文本数据集
Sampling API Calls For each API, we write a prompt
采样API调用 对于每个API,我们编写一个提示
Executing API Calls As a next step, we execute all API calls generated by
执行API调用 作为下一步,我们执行由
Filtering API Calls Let
过滤API调用 设
be the weighted cross entropy loss for
为当模型以某个文本序列
where
其中
Model Finetuning After sampling and filtering calls for all APIs, we finally merge the remaining API calls and interleave them with the original inputs. That is, for an input text
模型微调 在为所有API采样和过滤调用之后,我们最终合并剩余的API调用,并将它们与原始输入交错。也就是说,对于在位置
Inference When generating text with
推理 在使用我们的方法微调后,用
Tools
We explore various tools to address different shortcomings of LMs. The only constraints we impose are that (i) their inputs and outputs can be represented as texts, and (ii) we can obtain a few demonstrations of their intended use. Concretely, we explore a question answering system, a Wikipedia search engine, a calculator, a calendar, and a machine translation system. Examples for the APIs associated with each of these tools are shown in Table 1. We briefly discuss all tools below.
我们探索了各种工具来解决LM的不同缺陷。我们施加的唯一限制是:(i) 它们的输入和输出可以表示为文本,以及 (ii) 我们可以获得一些关于其预期用途的演示。具体来说,我们探索了问答系统、维基百科搜索引擎、计算器、日历和机器翻译系统。与每种工具相关的API示例见表1。下面我们简要讨论所有工具。
Question Answering Our first tool is a question answering system based on another LM that can answer simple factoid questions. Specifically, we use Atlas, a retrieval-augmented LM finetuned on Natural Questions.
问答 我们的第一个工具是基于另一个LM的问答系统,可以回答简单的事实性问题。具体来说,我们使用 Atlas,一个在Natural Questions上微调的检索增强LM。
Calculator As a second tool, we use a calculator that can perform simple numeric calculations; we only support the four basic arithmetic operations. Results are always rounded to two decimal places.
计算器 作为第二个工具,我们使用可以执行简单数值计算的计算器;我们只支持四种基本算术运算。结果总是四舍五入到小数点后两位。
Wikipedia Search Our third tool is a search engine that, given a search term, returns short text snippets from Wikipedia. Compared to our question answering tool, this search enables a model to get more comprehensive information on a subject, but requires it to extract the relevant parts by itself. As our search engine, we use a BM25 retriever that indexes the Wikipedia dump from KILT.
维基百科搜索 我们的第三个工具是一个搜索引擎,给定搜索词,返回维基百科中的短文本片段。与我们的问答工具相比,这种搜索使模型能够获得关于某个主题的更全面信息,但需要其自行提取相关部分。作为我们的搜索引擎,我们使用一个BM25检索器,它对来自KILT的维基百科数据转储建立了索引。
Machine Translation System Our fourth tool is a machine translation system based on a LM that can translate a phrase from any language into English. More concretely, we use the 600M parameter NLLB as our multilingual machine translation model that works for 200 languages (including low-resource ones). The source language is automatically detected using the fastText classifier, while the target language is always set to English.
机器翻译系统 我们的第四个工具是一个基于LM的机器翻译系统,可以将任何语言的短语翻译成英语。更具体地说,我们使用6亿参数的NLLB作为我们的多语言机器翻译模型,该模型支持200种语言。源语言使用fastText分类器自动检测,目标语言始终设置为英语。
Calendar Our final tool is a calendar API that, when queried, returns the current date without taking any input. This provides temporal context for predictions that require some awareness of time.
日历 我们的最后一个工具是一个日历API,当被查询时,它返回当前日期,不接受任何输入。这为需要时间感知的预测提供了时间上下文。
Related Work
Language Model Pretraining There are various approaches that augment LMs with some form of additional textual information during pretraining, including various forms of metadata, HTML tags, Wikipedia markup, or related texts obtained from an information retrieval system. For all of these approaches, additional information is always provided, regardless of whether it is helpful or not. In contrast, Toolformer learns for itself to explicitly asks for the right information.
语言模型预训练 有多种方法在预训练期间以某种形式的额外文本信息增强语言模型,包括各种形式的元数据、HTML标签、维基百科标记,或从信息检索系统获得的相关文本。对于所有这些方法,无论信息是否有用,总是会提供额外信息。相比之下,Toolformer 自学并明确地请求正确的信息。
Tool Use Several approaches aim to equip LMs with the ability to use external tools such as search engines, web browsers, calculators, translation systems and Python interpreters. The way these models learn to use tools can roughly be divided into two approaches: Either they rely on large amounts of human supervision or they work by prompting the language model in a few-shot setup tailored towards a specific task where it is known a priori which tools needs to be used. In contrast, the self-supervised nature of Toolformer enables it to learn how and when to use tools without requiring a specific prompt that shows task-specific examples of how a tool could be used. Perhaps most closely related to our work is TALM, an approach that uses a similar self-supervised objective for teaching a model to use a calculator and a search engine, but explores this only in settings where a model is finetuned for downstream tasks.
工具使用 几种方法旨在使语言模型具备使用外部工具的能力,例如搜索引擎、网页浏览器、计算器、翻译系统和 Python 解释器。这些模型学习使用工具的方式大致可分为两种:要么依赖大量人工监督,要么通过针对特定任务量身定制的少样本设置来提示语言模型,该设置预先知道需要使用的工具。相比之下,Toolformer 的自监督特性使其能够学习如何以及何时使用工具,而无需显示如何使用工具的任务特定示例的特定提示。与我们工作最相关的可能是 TALM,它使用类似的自监督目标来教导模型使用计算器和搜索引擎,但仅在模型为下游任务进行微调的设置中探索了这一点。
Bootstrapping The idea of using self-training and bootstrapping techniques to improve models has been investigated in various contexts, ranging from word sense disambiguation, relation extraction, parsing, sequence generation, few-shot text classification and retrieval to reasoning. In a similar spirit, Toolformer is trained on its own predictions after applying a perplexity-based filtering step.
引导 使用自训练和引导技术来改进模型的想法已在各种背景下得到研究,包括词义消歧、关系抽取、句法分析、序列生成、少样本文本分类、检索到推理。类似地,Toolformer 在应用了基于困惑度的过滤步骤后,基于自身的预测进行训练。
Limitations
While our approach enables LMs to learn how to use a variety of tools in a self-supervised way, there are some clear limitations to what can be achieved with our method in its current form. One such limitation is the inability of Toolformer to use tools in a chain (i.e., using the output of one tool as an input for another tool). This is due to the fact that API calls for each tool are generated independently; as a consequence, there are no examples of chained tool use in the finetuning dataset, since this would necessitate multiple API calls per example. Our current approach also does not allow the LM to use a tool in an interactive way – especially for tools such as search engines, that could potentially return hundreds of different results, enabling a LM to browse through these results or to refine its search query in a similar spirit to Nakano et al. (2021) can be crucial for certain applications. Beyond this, we found models trained with Toolformer to often be sensitive to the exact wording of their input when deciding whether or not to call an API; this is perhaps unsurprising given that LMs are known to be very sensitive to the prompt they are provided with in both zero- and few-shot settings. Depending on the tool, our method is also very sample-inefficient; for example, processing more than a million documents results in only a few thousand examples of useful calls to the calculator API. A potential solution to this problem might be to iteratively apply our approach, similar to how this is done in related bootstrapping approaches. Finally, when deciding whether or not to make an API call, Toolformer currently does not take into account the tool-dependent, computational cost incurred from making an API call.
虽然我们的方法使语言模型能够以自监督的方式学习使用各种工具,但当前形式的方法所能达到的效果存在一些明显的局限性。一个这样的局限性是Toolformer无法以链式方式使用工具。这是因为每个工具的API调用都是独立生成的;因此,微调数据集中没有链式工具使用的示例,因为这需要在每个示例中进行多次API调用。我们目前的方法也不允许语言模型以交互式方式使用工具——特别是对于搜索引擎等工具,它们可能返回数百个不同的结果,使语言模型能够浏览这些结果或优化其搜索查询,这对于某些应用可能至关重要。除此之外,我们发现使用Toolformer训练的模型在决定是否调用API时,常常对其输入的准确措辞很敏感;鉴于已知语言模型在零样本和少样本设置中对提供给它们的提示非常敏感,这可能并不奇怪。根据工具的不同,我们的方法也非常样本效率低下;例如,处理超过一百万份文档,仅能产生几千个对计算器API有用的调用示例。解决这个问题的一个潜在方法可能是迭代地应用我们的方法,类似于相关引导方法中的做法。最后,在决定是否进行API调用时,Toolformer目前没有考虑因进行API调用而产生的、依赖于工具的计算成本。
Conclusion
We have introduced Toolformer, a LM that learns in a self-supervised way how to use different tools such as search engines, calculators, and translation systems via simple API calls. This is done by finetuning on sampled API calls that are filtered based on whether they reduce perplexity on future tokens. Toolformer considerably improves zero-shot performance of a 6.7B parameter GPT-J model, enabling it to even outperform a much larger GPT-3 model on a range of different downstream tasks.
我们引入了Toolformer,这是一个以自监督方式学习如何通过简单的API调用使用不同工具的语言模型。这是通过在采样的API调用上进行微调来实现的,这些API调用根据它们是否降低未来词元的困惑度进行过滤。Toolformer显著提高了6.7B参数GPT-J模型的零样本性能,使其甚至在一系列不同的下游任务上优于大得多的GPT-3模型。