Language Models are Unsupervised Multitask Learners

Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8): 9.

LLM25500+OpenAI

语言模型是无监督的多任务学习者

Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

自然语言处理任务，如问答、机器翻译、阅读理解和摘要，通常采用在任务特定数据集上进行监督学习的方法。我们证明，当在一个名为WebText的数百万网页的新数据集上进行训练时，语言模型开始在没有明确监督的情况下学习这些任务。当以文档加问题为条件时，语言模型生成的答案在CoQA数据集上达到了55 F1——在不使用超过127,000个训练样本的情况下，匹配或超过了4个基线系统中3个的性能。语言模型的容量对于零样本任务迁移的成功至关重要，增加容量会以对数线性方式提升跨任务的性能。我们最大的模型GPT-2是一个15亿参数的Transformer，在零样本设置下，它在8个测试的语言建模数据集中有7个取得了最先进的结果，但仍然对WebText欠拟合。从模型生成的样本反映了这些改进，并包含连贯的文本段落。这些发现为构建能够从其自然发生的演示中学习执行任务的语言处理系统指明了一条有前景的道路。

Introduction

Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning. Yet these systems are brittle and sensitive to slight changes in the data distribution and task specification. Current systems are better characterized as narrow experts rather than competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.

机器学习系统现在通过结合大规模数据集、高容量模型和监督学习，在（期望意义上）其所训练的任务上表现出色。然而，这些系统是脆弱的，对数据分布和任务规范的微小变化很敏感。当前系统更适合被描述为狭隘的专家，而不是有能力的通才。我们希望朝着更通用的系统发展，这些系统能够执行许多任务——最终无需为每个任务手动创建和标注训练数据集。

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models, reading comprehension systems, and image classifiers on the diversity and variety of possible inputs highlights some of the shortcomings of this approach.

创建机器学习系统的主流方法是收集展示期望任务正确行为的训练示例数据集，训练系统模仿这些行为，然后在独立同分布的保留样本上测试其性能。这种方法在狭隘的专家系统上取得了良好进展。但图像描述模型、阅读理解系统和图像分类器在多样化的输入上表现出的不稳定行为，凸显了这种方法的一些缺点。

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE and decaNLP to begin studying this.

我们推测，在单一领域数据集上进行单一任务训练的普遍性是导致当前系统缺乏泛化能力的主要原因。利用现有架构向稳健系统迈进，很可能需要在广泛的领域和任务上进行训练和性能评估。最近，已经提出了几个基准，如GLUE和decaNLP，以开始研究这一点。

Multitask learning (Caruana) is a promising framework for improving general performance. However, multitask training in NLP is still nascent. Recent work reports modest performance improvements and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively. From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning.

多任务学习是提高通用性能的一个有前景的框架。然而，NLP中的多任务训练仍处于起步阶段。最近的研究报告了适度的性能提升，迄今为止最雄心勃勃的两项工作分别训练了10个和17个（数据集，目标）对。从元学习的角度来看，每个（数据集，目标）对都是从数据集和目标分布中采样出的单个训练示例。当前的机器学习系统需要数百到数千个示例来引出泛化良好的函数。这表明，多任务训练可能同样需要许多有效的训练对，才能用当前的方法实现其潜力。继续扩大数据集的创建和目标的设达到可能需要用当前技术强行推进的程度将非常困难。这激发了探索执行多任务学习的额外设置。

The current best performing systems on language tasks utilize a combination of pre-training and supervised fine-tuning. This approach has a long history with a trend towards more flexible forms of transfer. First, word vectors were learned and used as inputs to task-specific architectures, then the contextual representations of recurrent networks were transferred, and recent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient.

当前语言任务上表现最佳的系统利用了预训练和有监督微调的结合。这种方法历史悠久，并趋向于更灵活的迁移形式。首先，学习词向量并将其用作任务特定架构的输入，然后迁移循环网络的上下文表示，最近的研究表明任务特定架构已不再必要，迁移许多自注意力块就足够了。

These methods still require supervised training in order to perform a task. When only minimal or no supervised data is available, another line of work has demonstrated the promise of language models to perform specific tasks, such as commonsense reasoning and sentiment analysis.

这些方法仍然需要监督训练才能执行任务。当只有极少或没有监督数据可用时，另一条工作线已经证明了语言模型执行特定任务（如常识推理和情感分析）的潜力。

In this paper, we connect these two lines of work and continue the trend of more general methods of transfer. We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

在本文中，我们连接了这两条工作线，并延续了更通用的迁移方法的趋势。我们证明语言模型可以在零样本设置下执行下游任务——无需任何参数或架构修改。我们通过突出语言模型在零样本设置下执行广泛任务的能力，展示了这种方法的潜力。根据任务的不同，我们取得了有希望的、有竞争力的和最先进的结果。

Approach

At the core of our approach is language modeling. Language modeling is usually framed as unsupervised distribution estimation from a set of examples $(x_{1}, x_{2}, . . ., x_{n})$ each composed of variable length sequences of symbols $(s_{1}, s_{2}, . . ., s_{n})$ . Since language has a natural sequential ordering, it is common to factorize the joint probabilities over symbols as the product of conditional probabilities:

我们方法的核心是语言建模。语言建模通常被定义为从一组示例 $(x_{1}, x_{2}, . . ., x_{n})$ 中进行无监督分布估计，每个示例由可变长度的符号序列 $(s_{1}, s_{2}, . . ., s_{n})$ 组成。由于语言具有自然的顺序性，通常将符号上的联合概率分解为条件概率的乘积：

\begin{matrix} (1) & p (x) = \prod_{i = 1}^{n} p (s_{n} | s_{1}, . . ., s_{n - 1}) \end{matrix}

This approach allows for tractable sampling from and estimation of $p (x)$ as well as any conditionals of the form $p (s_{n - k}, . . ., s_{n} | s_{1}, . . ., s_{n - k - 1})$ . In recent years, there have been significant improvements in the expressiveness of models that can compute these conditional probabilities, such as self-attention architectures like the Transformer.

这种方法允许对 $p (x)$ 以及形如 $p (s_{n - k}, . . ., s_{n} | s_{1}, . . ., s_{n - k - 1})$ 的任何条件分布进行易于处理的采样和估计。近年来，能够计算这些条件概率的模型的表达能力有了显著提升，例如像 Transformer 这样的自注意力架构。

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution $p (o u t p u t | i n p u t)$ . Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model $p (o u t p u t | i n p u t, t a s k)$ . This has been variously formalized in multitask and meta-learning settings. Task conditioning is often implemented at an architectural level, such as the task specific encoders and decoders or at an algorithmic level such as the inner and outer loop optimization framework of MAML. But as exemplified in McCann et al., language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann et al. demonstrated it was possible to train a single model, the MQAN, to infer and perform many different tasks on examples with this type of format.

学习执行单个任务可以在概率框架中表示为估计条件分布 $p (o u t p u t | i n p u t)$ 。由于一个通用系统应该能够执行许多不同的任务，即使对于相同的输入，它也应该不仅以输入为条件，还要以要执行的任务为条件。也就是说，它应该对 $p (o u t p u t | i n p u t, t a s k)$ 进行建模。这在多任务和元学习设置中已有多种形式化。任务条件通常是在架构层面实现的，例如特定于任务的编码器和解码器，或者在算法层面实现的，例如 MAML 的内外循环优化框架。但正如 McCann 等人所例证的，语言提供了一种灵活的方式，可以将任务、输入和输出全部指定为符号序列。例如，一个翻译训练示例可以写成序列 (translate to french, english text, french text)。同样，一个阅读理解训练示例可以写成 (answer the question, document, question, answer)。McCann 等人证明了可以训练一个单一的模型 MQAN，来推断并对这种格式的示例执行许多不同的任务。

Language modeling is also able to, in principle, learn the tasks of McCann et al. without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. In this slightly toy setting, the concerns with density estimation as a principled training objective discussed are side stepped. The problem instead becomes whether we are able to, in practice, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toy-ish setup but learning is much slower than in explicitly supervised approaches.

原则上，语言建模也能够学习 McCann 等人的任务，而无需明确监督哪些符号是待预测的输出。由于监督目标与无监督目标是相同的，只是仅在序列的一个子集上进行评估，因此无监督目标的全局最小值也是监督目标的全局最小值。在这个略显玩具式的设定中，关于密度估计作为原则性训练目标的担忧被规避了。问题反而变成了我们是否能够在实践中优化无监督目标直至收敛。初步实验证实，足够大的语言模型能够在这种玩具式设定中执行多任务学习，但学习速度远慢于显式监督方法。

While it is a large step from the well-posed setup described above to the messiness of "language in the wild", Weston argues, in the context of dialog, for the need to develop systems capable of learning from natural language directly and demonstrated a proof of concept – learning a QA task without a reward signal by using forward prediction of a teacher's outputs. While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

尽管从上述良态设定到混乱的"现实世界语言"还有很大一步，但 Weston 在对话的背景下论证了开发能够直接从自然语言中学习的系统的必要性，并展示了一个概念验证——通过使用教师输出的前向预测，在没有奖励信号的情况下学习问答任务。虽然对话是一种有吸引力的方法，但我们担心它限制性过强。互联网包含大量被动可用的信息，无需交互式通信。我们的推测是，一个具有足够容量的语言模型将开始学习推断和执行自然语言序列中展示的任务，以便更好地预测它们，无论这些任务是如何获取的。如果一个语言模型能够做到这一点，它实际上就是在执行无监督的多任务学习。我们通过分析语言模型在零样本设置下对广泛任务的性能来检验这一点。

2.1. Training Dataset

Most prior work trained language models on a single domain of text, such as news articles, Wikipedia, or fiction books. Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.

2.1. 训练数据集

大多数先前的工作都在单一领域的文本上训练语言模型，例如新闻文章、维基百科或虚构类书籍。我们的方法则提倡构建尽可能大且多样化的数据集，以便收集尽可能在不同领域和情境下的自然语言任务演示。

A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl. While these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues. Trinh & Le used Common Crawl in their work on commonsense reasoning but noted a large amount of documents “whose content are mostly unintelligible”. We observed similar data issues in our initial experiments with Common Crawl. Trinh & Le’s best results were achieved using a small subsample of Common Crawl which included only documents most similar to their target dataset, the Winograd Schema Challenge. While this is a pragmatic approach to improve performance on a specific task, we want to avoid making assumptions about the tasks to be performed ahead of time.

一个有前景的多样化且近乎无限文本的来源是网络抓取，例如 Common Crawl。虽然这些档案比当前的语言建模数据集大好几个数量级，但它们存在显著的数据质量问题。Trinh & Le 在他们关于常识推理的工作中使用了 Common Crawl，但指出大量文档“内容大多难以理解”。我们在使用 Common Crawl 的初步实验中也观察到了类似的数据问题。Trinh & Le 的最佳结果是使用 Common Crawl 的一个小子集取得的，该子集只包含与其目标数据集（Winograd Schema Challenge）最相似的文档。虽然这是提升特定任务性能的一种实用方法，但我们希望避免预先对要执行的任务做出假设。

Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

因此，我们创建了一个新的网络抓取数据集，强调文档质量。为此，我们只抓取了经过人工筛选的网页。手动过滤整个网络抓取数据成本极高，因此我们首先抓取了 Reddit（一个社交媒体平台）上获得至少 3 个 karma 的所有出站链接。这可以看作是一个启发式指标，表明其他用户是否认为该链接有趣、有教育意义或只是好玩。

The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet and Newspaper content extractors. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

最终得到的数据集 WebText 包含了这 4500 万个链接的文本子集。为了从 HTML 响应中提取文本，我们结合使用了 Dragnet 和 Newspaper 内容提取器。本文展示的所有结果均使用 WebText 的初步版本，该版本不包含 2017 年 12 月之后创建的链接，并且经过去重和一些基于启发式的清理后，包含略多于 800 万份文档，总计 40 GB 的文本。我们从 WebText 中移除了所有维基百科文档，因为它是其他数据集的常见数据源，并且可能与测试评估任务产生训练数据重叠，从而使分析复杂化。

2.2. Input Representation

A general language model (LM) should be able to compute the probability of (and also generate) any string. Current large scale LMs include pre-processing steps such as lower-casing, tokenization, and out-of-vocabulary tokens which restrict the space of model-able strings. While processing Unicode strings as a sequence of UTF-8 bytes elegantly fulfills this requirement as exemplified in work such as Gillick et al., current byte-level LMs are not competitive with word-level LMs on large scale datasets such as the One Billion Word Benchmark. We observed a similar performance gap in our own attempts to train standard byte-level LMs on WebText.

2.2. 输入表示

一个通用的语言模型应该能够计算（并生成）任何字符串的概率。当前的大规模语言模型包含诸如小写化、分词和未登录词标记等预处理步骤，这限制了可建模字符串的空间。虽然将 Unicode 字符串作为 UTF-8 字节序列处理优雅地满足了这一要求（如 Gillick 等人的工作所示），但当前的字节级语言模型在大规模数据集（如十亿词基准）上无法与词级语言模型竞争。我们在 WebText 上训练标准字节级语言模型的尝试中也观察到了类似的性能差距。

Byte Pair Encoding (BPE) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences. Despite its name, reference BPE implementations often operate on Unicode code points and not byte sequences. These implementations would require including the full space of Unicode symbols in order to model all Unicode strings. This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256. However, directly applying BPE to the byte sequence results in suboptimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog, dog! dog?. This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence. We add an exception for spaces which significantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.

字节对编码是一种实用的介于字符级和词级语言建模之间的方法，它有效地在频繁符号序列的词级输入和不频繁符号序列的字符级输入之间进行插值。尽管名称如此，参考的 BPE 实现通常操作于 Unicode 码点而非字节序列。这些实现需要包含完整的 Unicode 符号空间才能对所有 Unicode 字符串进行建模。这将导致在添加任何多符号标记之前，基础词汇表就超过 13 万。这与 BPE 常用的 32,000 到 64,000 词元的词汇表相比过大。相比之下，字节级版本的 BPE 只需要大小为 256 的基础词汇表。然而，直接将 BPE 应用于字节序列会导致次优的合并，因为 BPE 使用基于贪婪频率的启发式方法来构建词元词汇表。我们观察到 BPE 会包含许多常见词的多个版本，比如 dog 会出现多种变体，如 dog、dog!、dog?。这导致有限的词汇槽和模型容量的次优分配。为避免这种情况，我们阻止 BPE 在任何字节序列中跨字符类别进行合并。我们对空格添加了一个例外，这显著提高了压缩效率，同时仅增加了极小的跨多个词元词元的词碎片化。

This input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.

这种输入表示使我们能够结合词级语言模型的经验优势与字节级方法的通用性。由于我们的方法可以为任何 Unicode 字符串分配概率，这使得我们能够评估我们的语言模型在任何数据集上的表现，无论其预处理、分词方式或词汇表大小如何。

2.3. Model

We use a Transformer based architecture for our LMs. The model largely follows the details of the OpenAI GPT model with a few modifications. Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of $1 / \sqrt{N}$ where $N$ is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

2.3. 模型

我们为语言模型采用基于 Transformer 的架构。该模型在很大程度上遵循了 OpenAI GPT 模型的细节，但进行了一些修改。层归一化被移至每个子块的输入，类似于预激活残差网络，并在最后的自注意力块之后添加了一个额外的层归一化。我们使用了一种修改后的初始化方法，考虑了残差路径随模型深度的累积。我们在初始化时将残差层的权重缩放 $1 / \sqrt{N}$ 倍，其中 $N$ 是残差层的数量。词汇表扩展到 50,257。我们还将上下文大小从 512 增加到 1024 个词元，并使用更大的批大小 512。

Generalization vs Memorization

Recent work in computer vision has shown that common image datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has 3.3% overlap between train and test images (Barz & Denzler, 2019). This results in an over-reporting of the generalization performance of machine learning systems. As the size of datasets increases this issue becomes increasingly likely which suggests a similar phenomena could be happening with WebText. Therefore it is important to analyze how much test data also shows up in the training data.

近期计算机视觉领域的研究表明，常见的图像数据集中存在相当数量的近乎重复的图像。例如，CIFAR-10 的训练集和测试集之间有 3.3% 的重叠。这导致机器学习系统的泛化性能被高估。随着数据集规模的增长，这个问题变得越来越可能发生，这暗示着 WebText 也可能存在类似现象。因此，分析有多少测试数据也出现在训练数据中是很重要的。

To study this we created Bloom filters containing 8-grams of WebText training set tokens. To improve recall, strings were normalized to contain only lower-cased alphanumeric words with a single space as a delimiter. The Bloom filters were constructed such that the false positive rate is upper bounded by $\frac{1}{10^{8}}$ . We further verified the low false positive rate by generating 1M strings, of which zero were found by the filter.

为了研究这一点，我们创建了包含 WebText 训练集词元 8-gram 的布隆过滤器。为了提高召回率，字符串被标准化为仅包含小写字母数字单词，并用单个空格作为分隔符。布隆过滤器的构建确保假阳性率上限为 $\frac{1}{10^{8}}$ 。我们通过生成 100 万个字符串进一步验证了低假阳性率，其中没有被过滤器找到的字符串。

These Bloom filters let us calculate, given a dataset, the percentage of 8-grams from that dataset that are also found in the WebText training set. Table 6 shows this overlap analysis for the test sets of common LM benchmarks. Common LM datasets’ test sets have between 1-6% overlap with WebText train, with an average of overlap of 3.2%. Somewhat surprisingly, many datasets have larger overlaps with their own training splits, with an average of 5.9% overlap.

这些布隆过滤器让我们能够计算给定数据集中有多少百分比的 8-gram 也出现在 WebText 训练集中。表 6 显示了常见 LM 基准测试集的这种重叠分析。常见的 LM 数据集的测试集与 WebText 训练集有 1-6% 的重叠，平均重叠率为 3.2%。有点令人惊讶的是，许多数据集与它们自己的训练集有更大的重叠，平均重叠率为 5.9%。

Our approach optimizes for recall, and while manual inspection of the overlaps shows many common phrases, there are many longer matches that are due to duplicated data. This is not unique to WebText. For instance, we discovered that the test set of WikiText-103 has an article which is also in the training dataset. Since there are only 60 articles in the test set there is at least an overlap of 1.6%. Potentially more worryingly, 1BW has an overlap of nearly 13.2% with its own training set according to our procedure.

我们的方法优化了召回率，虽然对重叠的手动检查显示了许多常见短语，但也有许多更长的匹配是由于重复数据造成的。这并非 WebText 独有。例如，我们发现 WikiText-103 的测试集中有一篇文章也在训练数据集中。由于测试集中只有 60 篇文章，因此至少存在 1.6% 的重叠。更令人担忧的是，根据我们的流程，1BW 与其自身训练集的重叠率接近 13.2%。

For the Winograd Schema Challenge, we found only 10 schemata which had any 8-gram overlaps with the WebText training set. Of these, 2 were spurious matches. Of the remaining 8, only 1 schema appeared in any contexts that gave away the answer.

对于 Winograd Schema Challenge，我们发现只有 10 个模式与 WebText 训练集有任何 8-gram 重叠。其中，2 个是虚假匹配。在剩下的 8 个中，只有 1 个模式出现在任何泄露答案的上下文中。

For CoQA, about 15% of documents in the news domain are already in WebText and the model performs about 3 F1 better on these. CoQA's development set metric reports the average performance over 5 different domains and we measure a gain of about 0.5-1.0 F1 due to overlap across the various domains. However, no actual training questions or answers are in WebText since CoQA was released after the cutoff date for links in WebText.

对于 CoQA，新闻领域中大约 15% 的文档已经存在于 WebText 中，模型在这些文档上的 F1 得分高出约 3 分。CoQA 的开发集指标报告了 5 个不同领域的平均性能，我们衡量出由于各个领域的重叠带来的增益约为 0.5-1.0 F1。然而，WebText 中没有实际的训练问题或答案，因为 CoQA 是在 WebText 链接截止日期之后发布的。

On LAMBADA, the average overlap is 1.2%. GPT-2 performs about 2 perplexity better on examples with greater than 15% overlap. Recalculating metrics when excluding all examples with any overlap shifts results from 8.6 to 8.7 perplexity and reduces accuracy from 63.2% to 62.9%. This very small change in overall results is likely due to only 1 in 200 examples having significant overlap.

在 LAMBADA 上，平均重叠率为 1.2%。GPT-2 在重叠率大于 15% 的样本上的困惑度大约低 2 点。当排除所有存在任何重叠的样本后重新计算指标，困惑度从 8.6 变为 8.7，准确率从 63.2% 降至 62.9%。整体结果的这种微小变化可能是因为只有 1/200 的样本有显著重叠。

Overall, our analysis suggests that data overlap between WebText training data and specific evaluation datasets provides a small but consistent benefit to reported results. However, for most datasets we do not notice significantly larger overlaps than those already existing between standard training and test sets, as Table 6 highlights.

总体而言，我们的分析表明，WebText 训练数据与特定评估数据集之间的数据重叠为报告的结果提供了微小但一致的增益。然而，对于大多数数据集，我们没有观察到比标准训练集和测试集之间已经存在的重叠显著更大的重叠，如表 6 所强调的。

Understanding and quantifying how highly similar text impacts performance is an important research question. Better de-duplication techniques such as scalable fuzzy matching could also help better answer these questions. For now, we recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.

理解并量化高度相似的文本如何影响性能是一个重要的研究问题。更好的去重技术，如可扩展的模糊匹配，也有助于更好地回答这些问题。目前，我们建议在创建新的 NLP 数据集的训练和测试划分时，使用基于 n-gram 重叠的去重作为重要的验证步骤和合理性检查。

Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. As shown in Figure 4, performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still underfitting on WebText in many ways.

另一种判断 WebText LM 的性能是否归因于记忆的潜在方法是检查它们在自身保留集上的表现。如图 4 所示，WebText 的训练集和测试集上的性能相似，并且随着模型规模的增加而一起提升。这表明即使是 GPT-2 在许多方面仍然对 WebText 欠拟合。

GPT-2 is also able to write news articles about the discovery of talking unicorns. An example is provided in Table 13.

GPT-2 还能够撰写关于发现会说话的独角兽的新闻文章。表 13 中提供了一个示例。

A significant portion of this work measured the performance of larger language models trained on larger datasets. This is similar to the work of Jozefowicz et al. which scaled RNN based language models on the 1 Billion Word Benchmark. Bajgar et al. also previously improved results on the Children's Book Test by creating a much larger training dataset out of Project Gutenberg to supplement the standard training dataset. Hestness et al. conducted a thorough analysis of how the performance of various deep learning models changes as a function of both model capacity and dataset size. Our experiments, while much noisier across tasks, suggest similar trends hold for sub-tasks of an objective and continue into the 1B+ parameter regime.

本工作的一大部分衡量了在更大数据集上训练的更大语言模型的性能。这与 Jozefowicz 等人的工作类似，他们在十亿词基准上扩展了基于 RNN 的语言模型。Bajgar 等人此前也通过从古腾堡计划中创建更大的训练数据集来补充标准训练集，从而改进了儿童读物测试的结果。Hestness 等人对不同深度学习模型的性能如何随模型容量和数据集大小变化进行了全面分析。我们的实验虽然在各个任务上噪声更大，但表明类似的趋势对于目标的子任务成立，并且一直延续到超过 10 亿参数的规模。

Interesting learned functionality in generative models has been documented before such as the cells in an RNN language model performing line-width tracking and quote/comment detection Karpathy et al. More inspirational to our work was the observation of Liu et al. that a model trained to generate Wikipedia articles also learned to translate names between languages.

生成模型中有趣的学习功能此前已有记录，例如 RNN 语言模型中的单元执行行宽跟踪和引号/评论检测。对我们的工作更有启发性的是 Liu 等人的观察：一个训练用于生成维基百科文章的模型也学会了跨语言翻译名称。

Previous work has explored alternative approaches to filtering and constructing a large text corpus of web pages, such as the iWeb Corpus (Davies).

先前的工作探索了过滤和构建大型网络文本语料库的替代方法，例如 iWeb 语料库。

There has been extensive work on pre-training methods for language tasks. In addition to those mentioned in the introduction, GloVe scaled word vector representation learning to all of Common Crawl. An influential early work on deep representation learning for text was Skip-thought Vectors (Kiros et al.). McCann et al. explored the use of representations derived from machine translation models and Howard & Ruder improved the RNN based fine-tuning approaches of Dai & Le. Conneau et al. studied the transfer performance of representations learned by natural language inference models and Subramanian et al. explored large-scale multitask training.

关于语言任务的预训练方法已有广泛的研究。除了引言中提到的那些，GloVe 将词向量表示学习扩展到整个 Common Crawl。一个影响深远的早期文本深度表示学习工作是 Skip-thought Vectors。McCann 等人探索了使用从机器翻译模型中导出的表示，Howard & Ruder 改进了 Dai & Le 的基于 RNN 的微调方法。Conneau 等人研究了由自然语言推理模型学到的表示的迁移性能，Subramanian 等人探索了大规模多任务训练。

Ramachandran et al. demonstrated that seq2seq models benefit from being initialized with pre-trained language models as encoders and decoders. More recent work has shown that LM pre-training is helpful when fine-tuned for difficult generation tasks like chit-chat dialog and dialog based question answering systems as well (Wolf et al.) (Dinan et al.).

Ramachandran 等人证明了 seq2seq 模型受益于使用预训练语言模型作为编码器和解码器进行初始化。最近的研究表明，LM 预训练在针对困难的生成任务（如闲聊式对话和基于对话的问答系统）进行微调时也是有帮助的。

Discussion

Much research has been dedicated to learning, understanding, and critically evaluating the representations of both supervised and unsupervised pre-training methods. Our results suggest that unsupervised task learning is an additional promising area of research to explore. These findings potentially help explain the widespread success of pre-training techniques for down-stream NLP tasks as we show that, in the limit, one of these pre-training techniques begins to learn to perform tasks directly without the need for supervised adaption or modification.

大量研究致力于学习、理解和批判性评估监督和无监督预训练方法的表示。我们的结果表明，无监督任务学习是一个额外有前景的研究方向。这些发现可能有助于解释预训练技术在下游NLP任务中的广泛成功，因为我们表明，在极限情况下，这些预训练技术之一开始直接学习执行任务，而无需监督适配或修改。

On reading comprehension the performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. However, on other tasks such as summarization, while it is qualitatively performing the task, its performance is still only rudimentary according to quantitative metrics. While suggestive as a research result, in terms of practical applications, the zero-shot performance of GPT-2 is still far from useable.

在阅读理解方面，GPT-2在零样本设置下的表现与监督基线相当。然而，在其他任务（如摘要）上，虽然它在定性上执行了任务，但根据定量指标，其性能仍然是初步的。虽然作为研究结果具有启发性，但就实际应用而言，GPT-2的零样本性能还远未达到可用水平。

We have studied the zero-shot performance of WebText LMs on many canonical NLP tasks, but there are many additional tasks that could be evaluated. There are undoubtedly many practical tasks where the performance of GPT-2 is still no better than random. Even on common tasks that we evaluated on, such as question answering and translation, language models only begin to outperform trivial baselines when they have sufficient capacity.

我们研究了WebText LM在许多典型NLP任务上的零样本性能，但还有许多额外的任务可以评估。毫无疑问，在许多实际任务上，GPT-2的性能仍然不比随机好。即使在我们评估过的常见任务上，如问答和翻译，语言模型只有在具备足够容量时才开始优于简单的基线。

While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with finetuning. On some tasks, GPT-2's fully abstractive output is a significant departure from the extractive pointer network based outputs which are currently state of the art on many question answering and reading comprehension datasets. Given the prior success of fine-tuning GPT, we plan to investigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to overcome the inefficiencies of uni-directional representations demonstrated by BERT.

虽然零样本性能为GPT-2在许多任务上的潜在性能建立了基线，但微调的上限尚不清楚。在某些任务上，GPT-2完全抽象的输出与目前在许多问答和阅读理解数据集上最先进的基于抽取式指针网络的输出有显著不同。考虑到之前微调GPT的成功，我们计划研究在decaNLP和GLUE等基准上的微调，特别是因为目前尚不清楚GPT-2的额外训练数据和容量是否足以克服BERT所展示的单向表示的低效性。

Conclusion

When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modeling datasets. The diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision.

当一个大型语言模型在足够大且多样化的数据集上进行训练时，它能够在许多领域和数据集中表现良好。GPT-2在8个测试的语言建模数据集中有7个实现了零样本下的最先进性能。该模型在零样本设置下能够执行的任务多样性表明，训练用于最大化足够多样化文本语料库似然的高容量模型，开始学习如何执行数量惊人的任务，而无需显式监督。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

Language Models are Unsupervised Multitask Learners

语言模型是无监督的多任务学习者

Abstract

Introduction

Approach

2.1. Training Dataset

2.1. 训练数据集

2.2. Input Representation

2.2. 输入表示

2.3. Model

2.3. 模型

Generalization vs Memorization

Discussion

Conclusion

🤖 Rasa

Language Models are Unsupervised Multitask Learners ​

语言模型是无监督的多任务学习者 ​

Abstract ​

Introduction ​

Approach ​

2.1. Training Dataset ​

2.1. 训练数据集 ​

2.2. Input Representation ​

2.2. 输入表示 ​

2.3. Model ​

2.3. 模型 ​

Generalization vs Memorization ​

Related Work ​

Discussion ​

Conclusion ​

Language Models are Unsupervised Multitask Learners

语言模型是无监督的多任务学习者

Abstract

Introduction

Approach

2.1. Training Dataset

2.1. 训练数据集

2.2. Input Representation

2.2. 输入表示

2.3. Model

2.3. 模型

Generalization vs Memorization

Related Work

Discussion

Conclusion