Skip to content


语言模型是少样本学习者

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

近期研究表明,通过在大规模文本语料库上进行预训练,然后针对特定任务进行微调,可以在许多NLP任务和基准上取得显著提升。尽管这种方法在架构上通常是任务无关的,但它仍然需要数千或数万个示例的特定任务微调数据集。相比之下,人类通常只需几个示例或简单指令就能执行一个新的语言任务——而当前的NLP系统在很大程度上仍难以做到这一点。在此,我们表明,扩大语言模型的规模极大地提升了任务无关的少样本性能,有时甚至能与先前最先进的微调方法相媲美。具体来说,我们训练了GPT-3,一个拥有1750亿参数的自回归语言模型,比任何之前的非稀疏语言模型大10倍,并在少样本设置下测试了其性能。对于所有任务,GPT-3的应用都不涉及任何梯度更新或微调,任务和少样本演示完全通过与模型的文本交互来指定。GPT-3在许多NLP数据集上取得了强劲的性能,包括翻译、问答和完形填空任务,以及一些需要即时推理或领域适配的任务,如词汇重排、在句子中使用新词或执行三位数算术。同时,我们也发现了一些GPT-3的少样本学习仍有困难的数据集,以及一些因在大型网络语料库上训练而面临方法论问题的数据集。最后,我们发现GPT-3可以生成新闻文章样本,人类评估者难以将其与人类撰写的文章区分开来。我们讨论了这一发现以及GPT-3整体的更广泛社会影响。

Introduction

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models have been directly fine-tuned, entirely removing the need for task-specific architectures.

近年来,NLP系统呈现出一种趋势,即使用预训练语言表示,并以日益灵活和任务无关的方式应用于下游迁移。首先,使用词向量学习单层表示,并将其输入到特定任务的架构中,然后使用具有多层表示和上下文状态的RNN来形成更强的表示(尽管仍然应用于特定任务的架构),而最近,预训练的循环或Transformer语言模型被直接微调,完全消除了对特定任务架构的需求。

This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.

这最后一种范式在阅读理解、问答、文本蕴含等许多具有挑战性的NLP任务上取得了实质性进展,并随着新架构和算法的出现而持续进步。然而,这种方法的一个主要限制是,虽然架构是任务无关的,但仍然需要特定任务的数据集和特定任务的微调:要在期望的任务上取得强劲性能,通常需要针对该任务在数千到数十万个示例的数据集上进行微调。出于几个原因,消除这一限制是可取的。

First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.

首先,从实践角度来看,为每个新任务都需要一个大型标注示例数据集限制了语言模型的适用性。存在各种各样可能有用的语言任务,涵盖从纠正语法、生成抽象概念的示例到评论短篇小说等各个方面。对于其中许多任务,很难收集到一个大型的监督训练数据集,特别是当这个过程必须为每个新任务重复进行时。

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance, larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task.

其次,利用训练数据中虚假相关性的可能性从根本上随着模型的表达能力和训练分布的狭窄性而增长。这可能会给预训练加微调的范式带来问题,在该范式中,模型被设计得很大以在预训练期间吸收信息,但随后却在非常狭窄的任务分布上进行微调。例如,更大的模型不一定能更好地泛化到分布之外。有证据表明,在这种范式下实现的泛化可能很差,因为模型过于针对训练分布,并且不能很好地泛化到训练分布之外。因此,微调模型在特定基准上的性能,即使名义上达到了人类水平,也可能夸大了其在基本任务上的实际表现。

Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language or at most a tiny number of demonstrations is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.

第三,人类不需要大型监督数据集来学习大多数语言任务——一个简短的指令或至多少量示例通常就足以让人类至少达到合理的能力水平来执行一个新任务。除了指出我们当前NLP技术的概念性局限外,这种适应性还具有实际优势——它允许人类无缝地混合或切换许多任务和技能,例如在长时间对话中执行加法运算。为了广泛有用,我们希望有一天我们的NLP系统也能拥有这种流畅性和通用性。

One potential route towards addressing these issues is meta-learning – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work attempts to do this via what we call "in-context learning", using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.

解决这些问题的一个潜在途径是元学习——在语言模型的背景下,这意味着模型在训练时发展出一套广泛的技能和模式识别能力,然后在推理时使用这些能力来快速适应或识别期望的任务。近期工作试图通过我们称之为"上下文学习"的方法来实现这一点,即使用预训练语言模型的文本输入作为任务规范的一种形式:模型以自然语言指令和/或任务的少量演示为条件,然后期望通过简单地预测接下来会发生什么来完成任务的更多实例。

While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example it achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks.

虽然它已经显示出一些初步的希望,但这种方法的结果仍然远不如微调——例如,它在Natural Questions上仅达到4%,甚至其55 F1的CoQa结果现在也比最先进水平落后超过35个百分点。元学习显然需要大幅改进才能作为解决语言任务的实用方法而可行。

Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters, to 300 million parameters, to 1.5 billion parameters, to 8 billion parameters, 11 billion parameters, and finally 17 billion parameters. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

语言建模中的另一个近期趋势可能提供一条前进的道路。近年来,Transformer语言模型的容量显著增加,从1亿参数,到3亿参数,到15亿参数,到80亿参数,110亿参数,最后到170亿参数。每一次增加都带来了文本合成和/或下游NLP任务的改进,并且有证据表明,与许多下游任务密切相关的对数损失,随着规模的扩大呈现出平滑的改进趋势。由于上下文学习涉及在模型参数中吸收许多技能和任务,因此上下文学习能力可能随着规模的扩大而展现出同样强劲的提升,这是合理的。

In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) "few-shot learning", or in-context learning where we allow as many demonstrations as will fit into the model's context window (typically 10 to 100), (b) "one-shot learning", where we allow only one demonstration, and (c) "zero-shot" learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work.

在本文中,我们通过训练一个1750亿参数的自回归语言模型(我们称之为GPT-3)并衡量其上下文学习能力来验证这一假设。具体来说,我们在超过二十几个NLP数据集上评估GPT-3,以及几个旨在测试对不太可能直接包含在训练集中的任务进行快速适应能力的新颖任务。对于每个任务,我们在3种条件下评估GPT-3:(a)"少样本学习",或上下文学习,其中我们允许尽可能多的演示放入模型的上下文窗口,(b)"单样本学习",其中我们只允许一个演示,以及(c)"零样本"学习,其中不允许演示,只给模型一个自然语言指令。原则上GPT-3也可以在传统的微调设置中进行评估,但我们将此留给未来的工作。

Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to remove extraneous symbols from a word. Model performance improves with the addition of a natural language task description, and with the number of examples in the model's context, K. Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these "learning" curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.

图1.2展示我们所研究的条件,并展示了一个需要模型从单词中去除多余符号的简单任务的少样本学习。模型性能随着自然语言任务描述的添加以及模型上下文中示例数量 K 的增加而提高。少样本学习也随着模型规模的增大而显著提高。尽管这种情况下的结果特别引人注目,但模型规模和上下文示例数量的一般趋势对我们研究的大多数任务都成立。我们强调,这些"学习"曲线不涉及梯度更新或微调,只是增加作为条件给出的演示数量。

Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting.

总的来说,在NLP任务上,GPT-3在零样本和单样本设置中取得了有希望的结果,并且在少样本设置中有时能与最先进水平竞争,甚至偶尔超越(尽管最先进水平由微调模型保持)。例如,GPT-3在零样本设置下CoQA上达到81.5 F1,在单样本设置下CoQA上达到84.0 F1,在少样本设置下达到85.0 F1。类似地,GPT-3在零样本设置下TriviaQA上达到64.3%的准确率,在单样本设置下达到68.0%,在少样本设置下达到71.2%,最后一项相对于在相同闭书设置下运行的微调模型来说是最先进的。

GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles.

GPT-3还在旨在测试快速适应或即时推理的任务上表现出单样本和少样本的能力,包括词汇重排、执行算术运算以及在只看到一次定义后在句子中使用新词。我们还表明,在少样本设置下,GPT-3可以生成合成新闻文章,人类评估者难以将其与人类撰写的文章区分开来。

At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3's strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.

与此同时,我们也发现一些任务即使在GPT-3的规模下,少样本性能仍然不佳。这包括像ANLI数据集这样的自然语言推理任务,以及像RACE或QuAC这样的一些阅读理解数据集。通过广泛描述GPT-3的优势和劣势,包括这些局限性,我们希望激发对语言模型中少样本学习的研究,并引起对最需要取得进展的领域的关注。

A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should not be seen as a rigorous or meaningful benchmark in itself).

对整体结果的一个启发式感受可见于图1.3,该图汇总了各种任务。

We also undertake a systematic study of "data contamination" – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3's performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.

我们还对"数据污染"进行了系统研究——当在Common Crawl等数据集上训练高容量模型时,这是一个日益严重的问题,因为这些数据集可能潜在地包含来自测试数据集的内容,仅仅因为这些内容经常存在于网络上。在本文中,我们开发了系统工具来衡量数据污染并量化其扭曲效应。虽然我们发现数据污染对GPT-3在大多数数据集上的性能影响很小,但我们确实发现少数数据集可能会夸大结果,根据严重程度,我们要么不报告这些数据集的结果,要么用星号标记它们。

In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

除了上述所有内容,我们还训练了一系列较小的模型,以便在零样本、单样本和少样本设置中将其性能与GPT-3进行比较。总的来说,对于大多数任务,我们发现性能在所有三种设置下都随模型容量相对平滑地扩展;一个值得注意的模式是,零样本、单样本和少样本性能之间的差距通常随着模型容量的增加而扩大,这可能表明更大的模型是更熟练的元学习者。

Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3's characteristics in this regard.

最后,鉴于GPT-3所展现的广泛能力,我们讨论了对偏见、公平性和更广泛社会影响的担忧,并尝试对GPT-3在这方面的特征进行初步分析。

The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.

本文的其余部分组织如下。在第2节中,我们描述了训练和评估GPT-3的方法。第3节展示了在零样本、单样本和少样本设置下全部任务的结果。第4节讨论数据污染问题。第5节讨论GPT-3的局限性。第6节讨论更广泛的影响。第7节回顾相关工作,第8节总结。

Approach

Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

我们的基础预训练方法,包括模型、数据和训练,与之前描述的过程相似,只是模型规模、数据集大小和多样性以及训练时长得到了相对直接的扩展。我们对上下文学习的使用也与之前的工作类似,但在这项工作中,我们系统地探索了在上下文中学习的不同设置。因此,我们首先明确定义和对比我们将要评估GPT-3的、或者说原则上可以评估GPT-3的不同设置。这些设置可以被看作是它们所依赖的任务特定数据量大小谱系上的不同点。具体来说,我们可以在这个谱系上识别出至少四个点:

  • Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution, and the potential to exploit spurious features of the training data, potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.
  • 微调 是近年来最常见的方法,涉及通过在特定于所需任务的监督数据集上进行训练来更新预训练模型的权重。通常使用数千到数十万个标记示例。微调的主要优点是在许多基准上表现强劲。主要缺点是需要为每个任务准备新的大规模数据集,可能存在分布外泛化能力差的潜在问题,以及可能利用训练数据中的虚假特征,可能导致与人类性能进行不公平的比较。在这项工作中,我们没有微调GPT-3,因为我们的重点是任务无关的性能,但原则上GPT-3是可以微调的,这是未来工作的一个有前景的方向。
  • Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set K in the range of 10 to 100 as this is how many examples can fit in the model's context window (nctx=2048). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.
  • 少样本 是我们将在本文中使用的术语,指在推理时给模型提供少量任务演示作为条件,但不允许进行权重更新的设置。如图2.1所示,对于一个典型的数据集,一个示例包含一个上下文和一个期望的补全,而少样本学习通过提供 K 个上下文和补全的示例,然后提供最后一个上下文的示例,期望模型提供其补全。我们通常将 K 设置在10到100之间,因为这大约是可以放入模型上下文窗口的示例数量。少样本的主要优点是大减少了任务特定数据的需求,并降低了从一个庞大但狭窄的微调数据集中学习到过于狭隘分布的可能性。主要缺点是这种方法的结果迄今仍远逊于最先进的微调模型。此外,仍然需要少量特定任务的数据。正如其名,这里描述的语言模型的少样本学习与机器学习其他背景下的少样本学习相关——两者都涉及基于广泛的任务分布进行学习,然后快速适应新任务。
  • One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.
  • 单样本 与少样本相同,区别在于只允许一个演示,外加一个任务的自然语言描述。将单样本与少样本和零样本区分开来的原因是它最接近某些任务传达给人类的方式。例如,当要求人类在人工服务平台生成数据集时,通常会给一个任务演示。相比之下,如果不给任何示例,有时很难传达任务的内容或格式。
  • Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given a natural language instruction describing the task. This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans to understand the format of the task without prior examples, so this setting is in some cases "unfairly hard". For example, if someone is asked to "make a table of world records for the 200m dash", this request can be ambiguous, as it may not be clear exactly what format the table should have or what should be included (and even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example in Figure 2.1, a human would likely know what to do from just the text instruction.
  • 零样本 与单样本相同,区别在于不允许任何演示,只给模型一个描述任务的自然语言指令。这种方法提供了最大的便利性、鲁棒性潜力,并能避免虚假相关性,但也是最具有挑战性的设置。在某些情况下,人类甚至可能很难在没有先前示例的情况下理解任务的格式,因此这种设置有时是“不公平地难”。例如,如果有人被要求“制作一个200米短跑世界纪录的表格”,这个请求可能是有歧义的,因为可能不清楚表格应该采用什么格式或应该包含什么内容。尽管如此,至少在某些设置中,零样本最接近人类执行任务的方式——例如,在图2.1的翻译示例中,人类很可能仅凭文本指令就知道该做什么。

Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.

图2.1以英译法为例展示了这四种方法。在本文中,我们专注于零样本、单样本和少样本,目的不是将它们作为相互竞争的替代方案进行比较,而是作为在特定基准性能和样本效率之间提供不同权衡的不同问题设置。我们特别强调少样本的结果,因为其中许多结果仅略逊于最先进的微调模型。然而,归根结底,单样本甚至有时是零样本,似乎是与人性能最公平的比较,也是未来工作的重要目标。

Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.

下面的第2.1-2.3节分别详细介绍了我们的模型、训练数据和训练过程。第2.4节讨论了我们进行少样本、单样本和零样本评估的具体细节。

2.1 Model and Architectures

We use the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.

2.1 模型与架构

我们采用与 GPT-2 相同的模型和架构,包括其中描述的修改后的初始化、预归一化和可逆分词,不同之处在于我们在 Transformer 的各层中使用了密集和局部带状稀疏注意力模式交替的结构,类似于 Sparse Transformer。为了研究机器学习性能对模型规模的依赖性,我们训练了 8 种不同规模的模型,涵盖从 1.25 亿参数到 1750 亿参数的三个数量级,最后一个就是我们称之为 GPT-3 的模型。先前的研究表明,在训练数据充足的情况下,验证损失的缩放应大致是模型规模的平滑幂律函数;训练多种不同规模的模型使我们能够检验这一假设,无论是对于验证损失还是对于下游语言任务。

Table 2.1 shows the sizes and architectures of our 8 models. Here nparams is the total number of trainable parameters, nlayers is the total number of layers, dmodel is the number of units in each bottleneck layer (we always have the feedforward layer four times the size of the bottleneck layer, dff=4dmodel), and dhead is the dimension of each attention head. All models use a context window of nctx=2048 tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU's. Previous work suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

表 2.1 展示了我们 8 个模型的规模和架构。其中 nparams 是可训练参数总数,nlayers 是总层数,dmodel 是每个瓶颈层的单元数(我们总是将前馈层设置为瓶颈层大小的四倍,即 dff=4dmodel),dhead 是每个注意力头的维度。所有模型都使用 nctx=2048 个词元的上下文窗口。我们沿着深度和宽度维度将模型划分到多个 GPU 上,以最小化节点间的数据传输。每个模型的具体架构参数是根据计算效率和模型在 GPU 上的布局负载平衡来选择的。先前的研究表明,在一个相当宽的范围内,验证损失对这些参数并不十分敏感。

2.2 Training Dataset

Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

2.2 训练数据集

语言模型的数据集迅速扩大,最终形成了包含近万亿词的 Common Crawl 数据集。这种规模的数据集足以训练我们最大的模型,而不会对同一个序列更新两次。然而,我们发现未经筛选或轻度筛选的 Common Crawl 版本往往比更精心整理的数据集质量低。因此,我们采取了 3 个步骤来提高数据集的平均质量:(1)我们根据与一系列高质量参考语料库的相似性,下载并筛选了一个版本的 Common Crawl;(2)我们在文档层面进行了模糊去重,包括数据集内部和跨数据集,以防止冗余,并保持我们保留的验证集作为过拟合准确度量的完整性;(3)我们还将已知的高质量参考语料库添加到训练组合中,以增强 Common Crawl 并增加其多样性。

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset, collected by scraping links over a longer period of time, and first described in previous work, two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

前两点(Common Crawl 的处理)的细节在附录 A 中描述。对于第三点,我们添加了几个精心整理的高质量数据集,包括扩展版的 WebText 数据集(通过更长时间抓取链接收集,并在先前工作中首次描述)、两个基于互联网的书籍语料库和英文维基百科。

Table 2.2 shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data.

表 2.2 显示了我们在训练中使用的最终数据集组合。Common Crawl 数据下载自 2016 年至 2019 年每月一次的 Common Crawl 的 41 个分片,过滤前包含 45TB 的压缩纯文本,过滤后为 570GB,大致相当于 4000 亿个字节对编码词元。请注意,在训练期间,数据集并非按其大小比例进行采样,而是我们认为质量更高的数据集被更频繁地采样,导致 Common Crawl 和 Books2 数据集在训练期间被采样的次数少于一次,而其他数据集被采样 2-3 次。这本质上是接受了少量的过拟合,以换取更高质量的训练数据。

A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination.

对于在广泛互联网数据上预训练的语言模型,特别是那些有能力记忆大量内容的大型模型,一个主要的方法论问题是在预训练期间无意中看到下游任务的测试集或开发集,从而导致潜在的数据污染。为了减少这种污染,我们搜索并尝试移除了与本文研究的所有基准的开发集和测试集的任何重叠。不幸的是,筛选过程中的一个错误导致我们忽略了一些重叠,并且由于训练成本,重新训练模型并不可行。在第 4 节中,我们将描述剩余重叠的影响,并在未来的工作中更积极地清除数据污染。

2.3 Training Process

As found in previous work, larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU's on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyperparameter settings are described in Appendix B.

2.3 训练过程

正如先前研究所发现的,较大的模型通常可以使用更大的批量大小,但需要较小的学习率。我们在训练过程中测量梯度噪声尺度,并用它来指导我们选择批量大小。表 2.1 显示了我们使用的参数设置。为了在内存不足的情况下训练更大的模型,我们在每个矩阵乘法内部混合使用模型并行,并在网络的各层之间使用模型并行。所有模型都是在微软提供的高带宽集群的一部分 V100 GPU 上训练的。训练过程和超参数设置的详细信息见附录 B。

2.4 Evaluation

For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task's training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it.

2.4 评估

对于少样本学习,我们通过从该任务的训练集中随机抽取 K 个示例作为条件来评估评估集中的每个示例,示例之间根据任务用 1 个或 2 个换行符分隔。对于 LAMBADA 和 Storycloze,没有可用的监督训练集,因此我们从开发集中抽取条件示例,并在测试集上进行评估。对于 Winograd,只有一个数据集,因此我们直接从该数据集中抽取条件示例。

K can be any value from 0 to the maximum amount allowed by the model's context window, which is nctx=2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better, so when a separate development and test set are available, we experiment with a few values of K on the development set and then run the best value on the test set. For some tasks we also use a natural language prompt in addition to (or for K=0, instead of) demonstrations.

K 可以是 0 到模型上下文窗口允许的最大数量之间的任何值,所有模型的上下文窗口大小均为 nctx=2048,通常可容纳 10 到 100 个示例。较大的 K 值通常但并不总是更好,因此当有单独的开发集和测试集时,我们在开发集上试验几个 K 值,然后在测试集上运行最佳值。对于某些任务,除了演示之外,我们还会使用自然语言提示。

On tasks that involve choosing one correct completion from several options (multiple choice), we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P(completion|context)P(completion|answer_context), where answer_context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.

对于涉及从多个选项中选择一个正确补全的任务,我们提供 K 个上下文加正确补全的示例,然后是一个仅有上下文的示例,并比较每个补全的 LM 似然。对于大多数任务,我们比较每个词元的似然,然而在少数数据集上,我们通过使用每个补全的无条件概率进行归一化,从而在开发集上获得了额外的收益,计算公式为:P(completion|context)P(completion|answer_context), 其中 answer_context 是字符串 "Answer: " 或 "A: ",用于提示补全应该是一个答案,但除此之外是通用的。

On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. "True" or "False" rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by previous work (see Appendix G) for details.

对于涉及二元分类的任务,我们为选项赋予更具语义意义的名字(例如“真”或“假”而非 0 或 1),然后将任务视为多项选择;我们有时也按照先前工作的方式来构建任务,详情见附录 G。

On tasks with free-form completion, we use beam search with the same parameters as previous work: a beam width of 4 and a length penalty of α=0.6. We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

对于自由形式补全的任务,我们使用与先前工作相同的参数进行束搜索:束宽度为 4,长度惩罚 α=0.6。我们根据所处理数据集的常用标准,使用 F1 相似度分数、BLEU 或精确匹配对模型进行评分。

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.

最终结果在测试集公开可用时报告,针对每种模型规模和学习设置(零样本、单样本和少样本)。当测试集不公开时,我们的模型通常太大而无法放入测试服务器,因此我们在开发集上报告结果。我们确实在少数能够成功提交的数据集上向测试服务器提交了结果,并且仅提交了 200B 的少样本结果,其余所有结果均在开发集上报告。

Measuring and Preventing Memorization Of Benchmarks

Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research without established best practices. While it is common practice to train large models without investigating contamination, given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to.

由于我们的训练数据集来源于互联网,模型有可能在部分基准测试集上进行了训练。准确地从互联网规模的数据集中检测测试数据污染是一个新的研究领域,尚无成熟的最佳实践。尽管在训练大型模型时不调查数据污染是常见做法,但鉴于预训练数据集的规模不断扩大,我们认为这个问题正变得越来越重要。

This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data detected and removed a training document which overlapped with one of their evaluation datasets. Other work such as GPT-2 also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that although models did perform moderately better on data that overlapped between training and testing, this did not significantly impact reported results due to the small fraction of data which was contaminated.

这种担忧并非空穴来风。最早在 Common Crawl 数据上训练语言模型的论文之一就检测并移除了一个与其评估数据集重叠的训练文档。其他工作,如 GPT-2,也进行了事后重叠分析。他们的研究结果相对令人鼓舞,发现尽管模型在训练和测试重叠的数据上表现略好,但由于被污染的数据占比很小,这并未显著影响报告的结果。

GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was deduplicated. Thus, we expect that contamination is likely to be frequent, but that its effects may not be as large as feared.

GPT-3 所处的环境有所不同。一方面,其数据集和模型规模比 GPT-2 大了约两个数量级,并且包含了大量 Common Crawl 数据,这增加了污染和记忆的可能性。另一方面,恰恰由于数据量巨大,即使 GPT-3 175B 相对于经过去重的保留验证集而言,也没有显著过拟合其训练集。因此,我们预期污染可能很常见,但其影响可能没有担心的那么大。

We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn't feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts results.

我们最初试图通过主动搜索并移除训练数据与本文研究的所有基准的开发集和测试集之间的任何重叠来解决污染问题。不幸的是,一个错误导致仅从训练数据中部分移除了所有检测到的重叠。由于训练成本,重新训练模型并不可行。为了解决这个问题,我们详细研究了剩余的检测到的重叠如何影响结果。

For each benchmark, we produce a 'clean' version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pretraining set. The goal is to very conservatively flag anything that could potentially be contamination, so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in Appendix C.

对于每个基准,我们制作了一个"干净"版本,移除了所有可能泄露的示例。目标是极其保守地标记任何可能被污染的内容,以便高置信度地生成一个无污染的干净子集。具体步骤详见附录 C。

We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high, in most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance.

然后,我们在这些干净的基准上评估 GPT-3,并与原始分数进行比较。如果干净子集上的分数与整个数据集的分数相似,则表明即使存在污染,也不会对报告的结果产生显著影响。如果干净子集上的分数较低,则表明污染可能夸大了结果。结果总结于图 4.2。尽管潜在污染率通常很高,但在大多数情况下,性能变化微乎其微,并且我们没有看到污染水平与性能差异相关的证据。我们得出的结论是,要么我们保守的方法大大高估了污染,要么污染对性能影响很小。

Below, we review in more detail the few specific cases where either the model performs significantly worse on the cleaned version, or potential contamination is very high, which makes measuring the performance difference difficult.

下面,我们更详细地回顾少数几种具体情况:模型在干净版本上表现显著变差,或者潜在污染率非常高,导致难以衡量性能差异。

Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension, PIQA, Winograd, language modeling tasks, and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:

我们的分析标记了六组需要进一步研究的基准:词汇重排、阅读理解、PIQA、Winograd、语言建模任务和德译英。由于我们的重叠分析设计得极为保守,我们预计会产生一些误报。以下总结每组任务的结果:

  • Reading Comprehension: Our initial analysis flagged >90% of task examples from QuAC, SQuAD2, and DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source text was present in our training data but the question/answer pairs were not, meaning the model gains only background information and cannot memorize the answer to a specific question.
  • 阅读理解:我们的初步分析标记 QuAC、SQuAD2 和 DROP 中超过 90% 的任务示例为潜在污染,比例如此之高,以至于难以衡量干净子集上的差异。然而,经过手动检查,我们发现对于这三个数据集中我们检查的每一个重叠,源文本确实出现在我们的训练数据中,但问题/答案对并未出现,这意味着模型仅获得了背景信息,无法记忆特定问题的答案。
  • German translation: We found 25% of the examples in the WMT16 German-English test set were marked as potentially contaminated, with an associated total effect size of 1–2 BLEU. Upon inspection, none of the flagged examples contain paired sentences resembling NMT training data and collisions were monolingual matches mostly of snippets of events discussed in the news.
  • 德译英:我们发现 WMT16 德英测试集中 25% 的示例被标记为潜在污染,相关的总效应大小为 1-2 BLEU。经检查,被标记的示例中没有一个包含类似于 NMT 训练数据的配对句子,碰撞主要是新闻中讨论的事件片段在单语上的匹配。
  • Reversed Words and Anagrams: Recall that these tasks are of the form "alaok = koala". Due to the short length of these tasks, we used 2-grams for filtering. After inspecting the flagged overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set, but rather palindromes or trivial unscramblings, e.g. "kayak = kayak". The amount of overlap was small, but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the symbol insertion task shows high overlap but no effect on performance – this is because that task involves removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to many spurious matches.
  • 倒序词与变位词:回想一下,这些任务的形式是 "alaok = koala"。由于这些任务长度较短,我们使用 2-gram 进行过滤。检查被标记的重叠后,我们发现它们通常不是训练集中真正的倒序或重排实例,而是回文或琐碎的重新排列,例如 "kayak = kayak"。重叠量很小,但移除这些琐碎的任务导致难度增加,从而产生虚假信号。与此相关的是,符号插入任务显示出高重叠但对性能没有影响——这是因为该任务涉及从单词中移除非字母字符,而重叠分析本身忽略了此类字符,导致许多虚假匹配。
  • PIQA: The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point absolute decrease in performance on the clean subset. Though the test dataset was released after our training set was created and its labels are hidden, some of the web pages used by the crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential contamination.
  • PIQA:重叠分析标记了 29% 的示例为污染,并观察到干净子集上的性能绝对下降了 3 个百分点。尽管测试数据集是在我们的训练集创建之后发布的,并且其标签是隐藏的,但众包数据集创建者使用的一些网页包含在我们的训练集中。我们发现在一个容量小 25 倍、记忆能力弱得多的模型中也出现了类似的下降,这使我们怀疑这种变化可能是统计偏差而非记忆所致;被标注者复制的例子可能本身更简单。不幸的是,我们无法严格证明这一假设。因此,我们在 PIQA 结果上标注星号以表示这种潜在污染。
  • Winograd: The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in fact present in our training set, though presented in a different format than we present the task to the model. Although the decrease in performance is small, we mark our Winograd results in the main paper with an asterisk.
  • Winograd:重叠分析标记了 45% 的示例,并发现干净子集上的性能下降了 2.6%。对重叠数据点的手动检查显示,有 132 个 Winograd 模式确实出现在我们的训练集中,尽管其呈现格式与我们向模型呈现任务的方式不同。虽然性能下降幅度很小,但我们在主论文的 Winograd 结果上标注了星号。
  • Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the Children's Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably extract a clean subset here, we do not report results on these datasets, even though we intended to when starting this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language modeling benchmark.
  • 语言建模:我们发现 GPT-2 中测量的 4 个维基百科语言建模基准,加上儿童读物测试数据集,几乎完全包含在我们的训练数据中。由于我们无法在此处可靠地提取干净子集,因此我们不报告这些数据集的结果。我们注意到,Penn Tree Bank 因其年代久远而未受影响,因此成为我们的主要语言建模基准。

We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section.

我们还检查了污染率高但对性能影响接近于零的数据集,只是为了验证实际污染有多少。这些似乎经常包含误报。它们要么没有实际污染,要么污染并未泄露任务答案。一个值得注意的例外是 LAMBADA,它似乎存在大量真实污染,但对性能的影响非常小,干净子集的得分在整个数据集的 0.5% 以内。而且,严格来说,我们的填空格式排除了最简单的记忆形式。尽管如此,由于我们在 LAMBADA 上取得了巨大进步,因此在结果部分指出了潜在的污染。

An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.

我们污染分析的一个重要局限性是我们无法确定干净子集是否与原始数据集来自相同的分布。仍然存在这种可能性:记忆夸大了结果,但同时被某种导致干净子集更容易的统计偏差精确抵消。然而,接近零的偏移数量之多表明这种情况不太可能发生,而且我们观察到小型模型(不太可能记忆)的偏移没有明显差异,这也支持了我们的结论。

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.

总体而言,我们已经尽了最大努力来衡量和记录数据污染的影响,并根据严重程度标记或完全移除有问题的结果。要解决这个对整个领域都很重要且微妙的问题,在设计基准和训练模型时还有很多工作要做。有关我们分析的更详细说明,请读者参阅附录 C。

Limitations

GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for future work.

GPT-3 以及我们对其进行的分析存在一些局限性。下面我们描述其中的一些,并为未来的工作提出方向。

First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of GPT-3's limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with "common sense physics", despite doing well on some datasets that test this domain. Specifically GPT-3 has difficulty with questions of the type "If I put cheese into the fridge, will it melt?". Quantitatively, GPT-3's in-context learning performance has some notable gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when evaluated one-shot or even few-shot on some "comparison" tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another, as well as on a subset of reading comprehension tasks. This is especially striking given GPT-3's strong few-shot performance on many other tasks.

首先,尽管 GPT-3 在定量和定性上都有显著提升,特别是与其直接前身 GPT-2 相比,它在文本合成和几个 NLP 任务上仍然存在明显的弱点。在文本合成方面,尽管整体质量很高,但 GPT-3 的样本有时仍会在文档层面语义上重复自身,在足够长的段落中开始失去连贯性,自相矛盾,偶尔还会包含不合逻辑的句子或段落。我们将发布 500 个未经筛选的无条件样本集合,以帮助更好地了解 GPT-3 在文本合成方面的局限性和优势。在离散语言任务领域,我们非正式地注意到,尽管 GPT-3 在某些测试“常识物理”领域的数据集上表现良好,但它似乎对这个领域有特殊的困难。具体来说,GPT-3 难以回答诸如“如果我把奶酪放进冰箱,它会融化吗?”这类问题。定量地看,如第 3 节所述,GPT-3 的上下文学习性能在我们的基准套件上存在一些显著的差距,尤其是在某些“比较”任务上,例如判断两个词在句子中是否用法相同,或一个句子是否蕴含另一个句子,以及一部分阅读理解任务上,它在单样本甚至少样本评估时表现仅略好于随机。鉴于 GPT-3 在许多其他任务上强劲的少样本性能,这一点尤其引人注目。

GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused on exploring in-context learning behavior in autoregressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent literature, which has documented improved fine-tuning performance when using these approaches over standard language models. Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer. This could be a possible explanation for GPT-3's lagging few-shot performance on a few of the tasks, such as WIC, ANLI, and several reading comprehension tasks. We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research, and could help achieve the "best of both worlds".

GPT-3 有几个结构和算法上的局限性,这可以解释上述一些问题。我们专注于探索自回归语言模型中的上下文学习行为,因为对该模型类进行采样和计算似然都很直接。因此,我们的实验不包括任何双向架构或其他训练目标,例如去噪。这与近期许多文献存在显著差异,这些文献记录了使用这些方法比标准语言模型在微调性能上有所改进。因此,我们的设计决策的代价是在经验上受益于双向性的任务上可能表现较差。这可能包括完形填空任务、涉及回顾和比较两段内容的任务,或者需要重读或仔细考虑长段落然后生成非常简短答案的任务。这可能是 GPT-3 在少数任务上(例如 WIC、ANLI 和几个阅读理解任务)少样本性能滞后的一个可能解释。我们还根据过去的文献推测,一个大规模的双向模型在微调方面会比 GPT-3 更强。制造一个 GPT-3 规模的双向模型,和/或尝试让双向模型适用于少样本或零样本学习,是未来研究的一个有前景的方向,并可能有助于实现“两全其美”。

A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether autoregressive or bidirectional – is that it may eventually run into the limits of the pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. Some work demonstrates benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans, fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of the world.

本文所描述的通用方法——扩展任何类似 LM 的模型,无论是自回归的还是双向的——的一个更根本的限制是,它最终可能会遇到预训练目标的极限。我们当前的目标平等地加权每个词元,缺乏关于什么最重要、什么不那么重要的概念。一些工作展示了根据感兴趣的实体定制预测的好处。此外,使用自监督目标,任务规范依赖于将期望的任务强制转化为预测问题,而最终,有用的语言系统或许更应该被视为采取有目标的行动,而不仅仅是做出预测。最后,大型预训练语言模型并未根植于其他经验领域,例如视频或现实世界的物理交互,因此缺乏大量关于世界的背景信息。基于所有这些原因,扩展纯自监督预测很可能会遇到瓶颈,并且很可能需要用不同的方法进行补充。这方面有前景的未来方向可能包括从人类那里学习目标函数、使用强化学习进行微调,或添加图像等额外模态以提供基础和对世界的更好模型。

Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 takes a step towards test-time sample efficiency closer to that of humans, it still sees much more text during pre-training than a human sees in the their lifetime. Improving pre-training sample efficiency is an important direction for future work, and might come from grounding in the physical world to provide additional information, or from algorithmic improvements.

语言模型普遍存在的另一个局限性是预训练期间的样本效率低下。虽然 GPT-3 朝着接近人类的测试时样本效率迈出了一步,但它在预训练期间看到的文本仍然比人类一生中看到的要多。提高预训练样本效率是未来工作的一个重要方向,可能来自于在物理世界中获得基础以提供额外信息,或来自算法改进。

A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot learning actually learns new tasks "from scratch" at inference time, or if it simply recognizes and identifies tasks that it has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on this spectrum may also vary from task to task. Synthetic tasks such as word scrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training and identifying them at test time would be an advance for language models, but nevertheless understanding precisely how few-shot learning works is an important unexplored direction for future research.

与 GPT-3 的少样本学习相关的一个局限性,或至少是不确定性,在于少样本学习是在推理时真正“从零开始”学习新任务,还是仅仅识别和识别它在训练期间学到的任务。这些可能性存在一个谱系,从训练集中的演示与测试时的演示分布完全相同,到识别相同但格式不同的任务,到适应通用任务(如 QA)的特定风格,再到完全从头学习一项技能。GPT-3 在这个谱系上的位置也可能因任务而异。像词汇重排或定义无意义词这样的合成任务似乎特别可能是从头学习的,而翻译显然必须在预训练期间学习,尽管可能来自组织方式和风格与测试数据截然不同的数据。最终,甚至不清楚人类哪些是从头学习,哪些是从先前的演示中学习。即使在预训练期间组织多样化的演示并在测试时识别它们,对语言模型来说也是一种进步,但精确理解少样本学习是如何工作的,是未来研究一个重要的、尚未探索的方向。

A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size.

与 GPT-3 规模模型相关的局限性,无论目标函数或算法如何,是执行推理既昂贵又不方便,这可能对当前形式下此类规模模型的实际应用性构成挑战。解决这个问题的一个可能的未来方向是将大型模型蒸馏到可管理的大小,以用于特定任务。像 GPT-3 这样的大型模型包含非常广泛的技能,其中大部分是特定任务不需要的,这表明原则上激进的蒸馏是可能的。蒸馏在总体上已被广泛探索,但尚未在数千亿参数的规模上尝试;将其应用于这种规模的模型可能会带来新的挑战和机遇。

Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts.

最后,GPT-3 具有大多数深度学习系统共有的一些局限性——其决策不易解释,对于新颖输入的预测不一定有良好的校准,正如在标准基准上观察到的比人类高得多的性能方差所证明的那样,并且它保留了其所训练数据的偏见。最后一个问题——数据中的偏见可能导致模型生成刻板或带有偏见的内容——从社会角度来看尤其值得关注,并将在下一节关于更广泛影响的讨论中与其他问题一起讨论。

Conclusion

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.

我们提出了一个拥有1750亿参数的语言模型,该模型在零样本、单样本和少样本设置下,在许多NLP任务和基准上展现出强劲性能,在某些情况下几乎能与最先进的微调系统相媲美,并且能够生成高质量样本,在即时定义的任务上也有出色的定性表现。我们记录了在不使用微调的情况下,性能随规模扩大而大致可预测的趋势。我们还讨论了这类模型的社会影响。尽管存在诸多局限和弱点,但这些结果表明,非常大的语言模型可能是开发具有适应性的通用语言系统的重要组成部分。