Skip to content


通过生成式预训练提升语言理解

Abstract

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

自然语言理解包含广泛多样的任务,如文本蕴含、问答、语义相似性评估和文档分类。尽管大量无标签文本语料库唾手可得,但用于学习这些特定任务的有标签数据却十分稀缺,这使得判别式训练的模型难以有足够的表现。我们证明,通过在多样化的无标签文本语料库上对语言模型进行生成式预训练,然后在每个特定任务上进行判别式微调,可以在这些任务上取得巨大提升。与以往方法不同,我们在微调过程中利用任务感知的输入转换来实现有效迁移,同时只需对模型架构进行最小限度的改动。我们在广泛的语言理解基准上证明了我们方法的有效性。我们的通用任务无关模型在研究的12个任务中,有9个显著超越了使用专门为每个任务精心设计架构的判别式训练模型,改进了当前最优水平。例如,我们在常识推理、问答和文本蕴含任务上分别取得了8.9%、5.7%和1.5%的绝对提升。

Introduction

The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources. In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pre-trained word embeddings to improve performance on a range of NLP tasks.

从原始文本中有效学习的能力对于减轻自然语言处理中对监督学习的依赖至关重要。大多数深度学习方法需要大量的人工标注数据,这限制了它们在许多缺乏标注资源的领域中的适用性。在这种情况下,能够利用无标签数据中的语言信息的模型为收集更多标注(可能既耗时又昂贵)提供了一种有价值的替代方案。此外,即使在有大量监督信息可用的情况下,以无监督方式学习良好的表示也能带来显著的性能提升。迄今为止,最有力的证据是广泛使用预训练的词嵌入来提升一系列自然语言处理任务的性能。

Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling, machine translation, and discourse coherence, with each method outperforming the others on different tasks. Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture, using intricate learning schemes and adding auxiliary learning objectives. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.

然而,从无标签文本中利用超越词级别的信息具有挑战性,主要有两个原因。首先,目前尚不清楚哪种优化目标最有效地学习对迁移有用的文本表示。近期研究关注了各种目标,如语言建模、机器翻译和篇章连贯性,每种方法在不同任务上表现各有优劣。其次,关于如何将这些学习到的表示最有效地迁移到目标任务上,尚无共识。现有技术包括对模型架构进行任务特定的修改、使用复杂的学习方案以及添加辅助学习目标。这些不确定性使得为语言处理开发有效的半监督学习方法变得困难。

In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.

在本文中,我们探索了一种结合无监督预训练和有监督微调的半监督方法,用于语言理解任务。我们的目标是学习一种通用的表示,只需很少的调整就能迁移到广泛的任务上。我们假设可以访问一个大型无标签文本语料库以及若干带有手动标注训练样本的数据集(目标任务)。我们的设置不要求这些目标任务与无标签语料库处于同一领域。我们采用两阶段训练过程。首先,我们在无标签数据上使用语言建模目标来学习神经网络模型的初始参数。随后,我们使用相应的监督目标将这些参数适配到目标任务上。

For our model architecture, we use the Transformer, which has been shown to perform strongly on various tasks such as machine translation, document generation, and syntactic parsing. This model choice provides us with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style approaches, which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model.

对于我们的模型架构,我们使用Transformer,该架构已被证明在机器翻译、文档生成和句法分析等多种任务上表现强劲。与循环网络等替代方案相比,这种模型选择为我们提供了更结构化的记忆来处理文本中的长程依赖,从而在不同任务上产生稳健的迁移性能。在迁移过程中,我们利用从遍历式方法派生的任务特定输入适配,这些方法将结构化文本输入作为单个连续的词元序列进行处理。正如我们在实验中所展示的,这些适配使我们能够有效地进行微调,同时只需对预训练模型的架构进行最小改动。

We evaluate our approach on four types of language understanding tasks – natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model outperforms discriminatively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), 1.5% on textual entailment (MultiNLI) and 5.5% on the recently introduced GLUE multi-task benchmark. We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks.

我们在四种类型的语言理解任务上评估了我们的方法——自然语言推理、问答、语义相似性和文本分类。我们的通用任务无关模型优于使用专门为每个任务精心设计架构的判别式训练模型,在研究的12个任务中,有9个显著改进了当前最优水平。例如,我们在常识推理、问答、文本蕴含和最近推出的GLUE多任务基准上分别取得了8.9%、5.7%、1.5%和5.5%的绝对提升。我们还分析了预训练模型在四种不同设置下的零样本行为,并证明它获得了对下游任务有用的语言知识。

Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling or text classification. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model. Over the last few years, researchers have demonstrated the benefits of using word embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics.

NLP中的半监督学习 我们的工作大致属于自然语言处理中的半监督学习范畴。这一范式引起了广泛关注,应用于序列标注或文本分类等任务。早期的方法使用无标签数据计算词级或短语级统计信息,然后将其用作监督模型中的特征。在过去几年中,研究人员已经证明了使用在无标签语料库上训练的词嵌入来提升各种任务性能的好处。然而,这些方法主要传递词级信息,而我们的目标是捕获更高级的语义。

Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks.

最近的研究探索了从无标签数据中学习和利用超越词级的语义。可以使用无标签语料库训练的短语级或句子级嵌入已被用于将文本编码为适用于各种目标任务的向量表示。

Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification and regression tasks. Subsequent research demonstrated that pre-training acts as a regularization scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification, speech recognition, entity disambiguation and machine translation.

无监督预训练 无监督预训练是半监督学习的一个特例,其目标是找到一个好的初始化点,而不是修改监督学习目标。早期的工作探索了该技术在图像分类和回归任务中的应用。随后的研究表明,预训练充当了一种正则化方案,使得深度神经网络能够更好地泛化。在最近的工作中,该方法已被用于帮助在各种任务上训练深度神经网络,如图像分类、语音识别、实体消歧和机器翻译。

The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. and Howard and Ruder follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longer-range linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion. Other approaches use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer.

与我们最接近的工作涉及使用语言建模目标对神经网络进行预训练,然后在目标任务上进行有监督微调。Dai等人以及Howard和Ruder遵循这种方法来改进文本分类。然而,尽管预训练阶段有助于捕获一些语言信息,但他们使用的LSTM模型将其预测能力限制在短距离内。相比之下,我们选择的Transformer网络允许我们捕获更长的语言结构,正如我们的实验所证明的那样。此外,我们还展示了我们的模型在更广泛的任务上的有效性,包括自然语言推理、释义检测和故事补全。其他方法在目标任务上训练监督模型时,使用来自预训练语言或机器翻译模型的隐藏表示作为辅助特征。这需要为每个单独的目标任务引入大量新参数,而我们在迁移过程中只需对模型架构进行最小程度的更改。

Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks.

辅助训练目标 添加辅助的无监督训练目标是半监督学习的另一种形式。Collobert和Weston的早期工作使用了多种辅助NLP任务,如词性标注、组块分析、命名实体识别和语言建模,来改进语义角色标注。最近,Rei在其目标任务目标中添加了辅助语言建模目标,并证明了在序列标注任务上的性能提升。我们的实验也使用了辅助目标,但正如我们所示,无监督预训练已经学习到了与目标任务相关的多个语言方面。

Framework

Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data.

我们的训练过程包括两个阶段。第一阶段是在大规模文本语料库上学习一个高容量的语言模型。随后是微调阶段,我们将模型适配到带有标签数据的判别式任务上。

3.1 Unsupervised pre-training

Given an unsupervised corpus of tokens U={u1,,un}, we use a standard language modeling objective to maximize the following likelihood:

3.1 无监督预训练

给定一个无监督的词元语料库 U={u1,,un},我们使用标准的语言建模目标来最大化以下似然:

(1)L1(U)=ilogP(ui|uik,,ui1;Θ)

where k is the size of the context window, and the conditional probability P is modeled using a neural network with parameters Θ. These parameters are trained using stochastic gradient descent.

其中 k 是上下文窗口的大小,条件概率 P 使用参数为 Θ 的神经网络建模。这些参数使用随机梯度下降进行训练。

In our experiments, we use a multi-layer Transformer decoder for the language model, which is a variant of the transformer. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

在我们的实验中,我们使用多层 Transformer 解码器作为语言模型,它是 Transformer 的一个变体。该模型对输入上下文词元应用多头自注意力操作,然后通过逐位前馈层,产生目标词元上的输出分布:

h0=UWe+Wp(2)hl=transformer_block(hl1)i[1,n]P(u)=softmax(hnWeT)

where U=(uk,,u1) is the context vector of tokens, n is the number of layers, We is the token embedding matrix, and Wp is the position embedding matrix.

其中 U=(uk,,u1) 是词元的上下文向量,n 是层数,We 是词元嵌入矩阵,Wp 是位置嵌入矩阵。

3.2 Supervised fine-tuning

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset C, where each instance consists of a sequence of input tokens, x1,,xm, along with a label y. The inputs are passed through our pre-trained model to obtain the final transformer block’s activation hlm, which is then fed into an added linear output layer with parameters Wy to predict y:

3.2 有监督微调

在使用公式 (1) 的目标训练完模型后,我们将参数适配到有监督的目标任务上。假设有一个带标签的数据集 C,其中每个实例由一个输入词元序列 x1,,xm 和一个标签 y 组成。输入通过我们的预训练模型,得到最后一个 Transformer 块的激活值 hlm,然后将其输入到一个新增的线性输出层(参数为 Wy)以预测 y

(3)P(yx1,,xm)=softmax(hlmWy)

This gives us the following objective to maximize:

由此我们得到要最大化的目标:

(4)L2(C)=(x,y)logP(yx1,,xm)

We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work, who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight λ):

我们还发现,将语言建模作为辅助目标加入到微调中有助于学习,具体体现在 (a) 提高了监督模型的泛化能力,(b) 加速了收敛。这与先前的工作一致,他们也观察到这种辅助目标能提升性能。具体来说,我们优化以下目标(带权重 λ):

(5)L3(C)=L2(C)+λL1(C)

Overall, the only extra parameters we require during fine-tuning are Wy, and embeddings for delimiter tokens (described below in Section 3.3).

总体而言,微调期间我们需要的额外参数只有 Wy 以及分隔词元的嵌入(下文第 3.3 节描述)。

3.3 Task-specific input transformations

For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach, where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens (s, e).

3.3 任务特定的输入转换

对于某些任务,如文本分类,我们可以直接按照上述方式微调模型。而其他一些任务,如问答或文本蕴含,具有结构化输入,例如有序句子对,或者文档、问题和答案的三元组。由于我们的预训练模型是在连续的文本序列上训练的,我们需要做一些修改才能将其应用于这些任务。先前的工作提出在迁移表示之上学习任务特定的架构。这种方法会引入大量任务特定的定制化,并且没有对这些额外的架构组件使用迁移学习。相反,我们采用一种遍历式方法,将结构化输入转换为预训练模型可以处理的有序序列。这些输入转换使我们能够避免在不同任务之间对架构进行大量修改。下面我们简要描述这些输入转换,图 1 提供了直观的图示。所有转换都包括添加随机初始化的起始和结束词元(se)。

Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between.

文本蕴含 对于蕴含任务,我们将前提 p 和假设 h 的词元序列连接起来,中间用一个分隔符 ($) 隔开。

Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations him which are added element-wise before being fed into the linear output layer.

相似性 对于相似性任务,待比较的两个句子没有固有的顺序。为了反映这一点,我们修改输入序列,使其包含两种可能的句子顺序(中间用分隔符隔开),并独立处理每个序列,产生两个序列表示 him,在输入到线性输出层之前将它们逐元素相加。

Question Answering and Commonsense Reasoning For these tasks, we are given a context document z, a question q, and a set of possible answers {ak}. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get [z;q;$;ak]. Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers.

问答与常识推理 对于这些任务,我们有一个上下文文档 z、一个问题 q 和一组可能的答案 {ak}。我们将文档上下文和问题与每个可能的答案连接起来,中间添加分隔符,得到 [z;q;$;ak]。每个这样的序列都用我们的模型独立处理,然后通过 softmax 层进行归一化,得到可能答案上的输出分布。

Conclusion

We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and discriminative fine-tuning. By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets we study. Using unsupervised (pre-)training to boost performance on discriminative tasks has long been an important goal of Machine Learning research. Our work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach. We hope that this will help enable new research into unsupervised learning, for both natural language understanding and other domains, further improving our understanding of how and when unsupervised learning works.

我们提出了一个通过生成式预训练和判别式微调,用单个任务无关模型实现强大自然语言理解的框架。通过在包含长程连续文本的多样化语料库上进行预训练,我们的模型获得了丰富的世界知识和处理长距离依赖的能力,这些能力随后被成功迁移到解决判别式任务上,如问答、语义相似性评估、蕴含判断和文本分类,在我们研究的12个数据集中有9个改进了当前最优水平。利用无监督预训练来提升判别式任务的性能一直是机器学习研究的一个重要目标。我们的工作表明,实现显著的性能提升确实是可能的,并指出了哪种模型(Transformer)和哪种数据集(具有长距离依赖的文本)最适合这种方法。我们希望这将有助于推动无监督学习的新研究,无论是针对自然语言理解还是其他领域,进一步提升我们对无监督学习如何以及何时起作用的理解。