Training language models to follow instructions with human feedback

Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in neural information processing systems, 2022, 35: 27730-27744.
https://arxiv.org/abs/2203.02155

27300+OpenAI

通过人类反馈训练语言模型遵循指令

Abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through a language model API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

单纯增大语言模型的规模并不能使它们更好地遵循用户的意图。例如，大型语言模型可能生成不真实、有害或对用户毫无帮助的输出。换句话说，这些模型与其用户并未对齐。在本文中，我们展示了一种通过人类反馈进行微调，使语言模型在广泛任务上与用户意图对齐的途径。我们从一组标注者编写的提示和通过语言模型 API 提交的提示开始，收集了一个包含标注者演示期望模型行为的数据集，并使用监督学习对 GPT-3 进行微调。接着，我们收集了一个模型输出排名的数据集，并使用基于人类反馈的强化学习进一步微调这个监督模型。我们将得到的模型称为 InstructGPT。在我们提示分布上的人类评估中，尽管参数量少了 100 倍，1.3B 参数的 InstructGPT 模型的输出仍比 175B 的 GPT-3 的输出更受欢迎。此外，InstructGPT 模型在公共 NLP 数据集上性能下降最小的情况下，展现出了真实性的提升和有害输出生成的减少。尽管 InstructGPT 仍会犯简单错误，但我们的结果表明，基于人类反馈的微调是使语言模型与人类意图对齐的一个有前景的方向。

Introduction

Large language models (LMs) can be prompted to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions. This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective "follow the user's instructions helpfully and safely". Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications.

大型语言模型可以通过输入一些任务示例来执行一系列自然语言处理任务。然而，这些模型常常表现出非预期的行为，例如编造事实、生成带有偏见或有毒的文本，或者干脆不遵循用户指令。这是因为近期许多大型语言模型所使用的语言建模目标——预测互联网网页上的下一个词元——与“有益且安全地遵循用户指令”这一目标不同。因此，我们说语言建模目标是错位的。对于部署并在数百种应用中使用的语言模型而言，避免这些非预期行为尤为重要。

We make progress on aligning language models by training them to act in accordance with the user's intention. This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. Using the language of Askell et al., we want language models to be helpful (they should help the user solve their task), honest (they shouldn't fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment). We elaborate on the evaluation of these criteria in Section 3.5.

我们通过训练语言模型按照用户意图行事，在模型对齐方面取得了进展。这既包括遵循指令等显性意图，也包括保持真实性、不带有偏见、毒性或其他伤害等隐性意图。用Askell等人的术语来说，我们希望语言模型是有益的（它们应帮助用户完成任务）、诚实的（它们不应编造信息或误导用户）且无害的（它们不应对人或环境造成身体、心理或社会伤害）。我们将在第3.5节详细阐述这些标准的评估。

We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.3 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to a language model API and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm. We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of "human values"; we discuss this further in Appendix G.2. We call the resulting models InstructGPT.

我们专注于使用微调方法对齐语言模型。具体来说，我们使用基于人类反馈的强化学习对GPT-3进行微调，使其遵循广泛类别的书面指令。该技术使用人类偏好作为奖励信号来微调我们的模型。我们首先根据筛选测试的表现雇佣了一个由40名承包商组成的团队来标注我们的数据。然后，我们收集了一个由人工编写的期望输出行为演示数据集，这些演示基于提交给语言模型API的提示以及一些标注者编写的提示，并用它来训练我们的监督学习基线模型。接着，我们收集了一个在更大规模API提示集上对模型输出进行人工标注比较的数据集。我们在此数据集上训练一个奖励模型，以预测我们的标注者会更偏好哪个模型输出。最后，我们使用这个奖励模型作为奖励函数，利用PPO算法微调我们的监督学习基线以最大化此奖励。我们在图2中说明了这一过程。该过程将GPT-3的行为与特定人群的偏好对齐，而非任何更广泛的“人类价值观”；我们将在附录G.2中进一步讨论这一点。我们将得到的模型称为InstructGPT。

We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out users (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets. We train three model sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main findings are:

我们主要通过让标注者在我们的测试集上对模型输出质量进行评分来评估我们的模型，该测试集包含来自未参与训练数据的保留用户的提示。我们还对一系列公共NLP数据集进行了自动评估。我们训练了三种模型规模，所有模型均使用GPT-3架构。我们的主要发现如下：

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. Outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having over 100x fewer parameters. These models have the same architecture, and differ only by the fact that InstructGPT is fine-tuned on our human data. This result holds true even when we add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B InstructGPT are preferred to 175B GPT-3 outputs $85 \pm 3 %$ of the time, and preferred $71 \pm 4 %$ of the time to few-shot 175B GPT-3. InstructGPT also generates more appropriate outputs according to our labelers.

标注者显著更偏好InstructGPT的输出而非GPT-3的输出。 尽管参数数量少了100多倍，1.3B参数的InstructGPT模型的输出比175B的GPT-3的输出更受欢迎。这些模型具有相同的架构，唯一区别在于InstructGPT在我们的数据上进行了微调。即使我们给GPT-3添加少样本提示以使其更好地遵循指令，这一结果依然成立。我们的175B InstructGPT的输出在 $85 \pm 3 %$ 的情况下比175B GPT-3的输出更受欢迎，并且在 $71 \pm 4 %$ 的情况下比带少样本提示的175B GPT-3更受欢迎。根据我们的标注者，InstructGPT还生成了更合适的输出。

InstructGPT models show improvements in truthfulness over GPT-3. On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers more often than GPT-3. On "closed-domain" tasks from our API prompt distribution, where the output should not contain information that is not present in the input, InstructGPT models make up information not present in the input about half as often as GPT-3 (a 21% vs. 41% hallucination rate, respectively).

InstructGPT模型在真实性上比GPT-3有所提升。 在TruthfulQA基准上，InstructGPT比GPT-3更常生成真实且信息丰富的答案。在我们的API提示分布中的“闭域”任务上，输出不应包含输入中不存在的信息，InstructGPT模型编造输入中不存在信息的频率约为GPT-3的一半（幻觉率分别为21% vs. 41%）。

InstructGPT shows small improvements in toxicity over GPT-3, but not bias. To measure toxicity, we use the RealToxicityPrompts dataset and conduct both automatic and human evaluations. InstructGPT models generate about 25% fewer toxic outputs than GPT-3 when prompted to be respectful. InstructGPT does not significantly improve over GPT-3 on the Winogender and CrowSPairs datasets.

InstructGPT在毒性上比GPT-3有微小改进，但在偏见上没有。 为了衡量毒性，我们使用RealToxicityPrompts数据集，并进行了自动和人工评估。在被要求保持礼貌时，InstructGPT模型生成的有毒输出比GPT-3少约25%。在Winogender和CrowSPairs数据集上，InstructGPT相较于GPT-3没有显著改进。

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pretraining distribution (PPO-ptx), without compromising labeler preference scores.

我们可以通过修改RLHF微调程序，最小化在公共NLP数据集上的性能倒退。 在RLHF微调期间，我们观察到在某些公共NLP数据集上性能相较于GPT-3有所倒退。通过将PPO更新与增加预训练分布对数似然的更新混合，我们可以在不损害标注者偏好分数的情况下大大减少这些数据集上的性能倒退。

Our models generalize to the preferences of "held-out" labelers that did not produce any training data. To test the generalization of our models, we conduct a preliminary experiment with held-out labelers, and find that they prefer InstructGPT outputs to outputs from GPT-3 at about the same rate as our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

我们的模型能够泛化到未参与任何训练数据的保留标注者的偏好。 为了测试模型的泛化能力，我们对保留标注者进行了初步实验，发现他们偏好InstructGPT输出而非GPT-3输出的比率与我们的训练标注者大致相同。然而，还需要更多工作来研究这些模型在更广泛用户群体上的表现，以及在人类对期望行为有分歧的输入上的表现。

Public NLP datasets are not reflective of how our language models are used. We compare GPT-3 fine-tuned on our human preference data (i.e. InstructGPT) to GPT-3 fine-tuned on two different compilations of public NLP tasks: the FLAN and T0 (in particular, the T0++ variant). These datasets consist of a variety of NLP tasks, combined with natural language instructions for each task. On our API prompt distribution, our FLAN and T0 models perform slightly worse than our SFT baseline, and labelers significantly prefer InstructGPT to these models.

公共NLP数据集不能反映我们语言模型的使用方式。 我们将基于人类偏好数据微调的GPT-3与在两种不同公共NLP任务组合上微调的GPT-3进行了比较。这些数据集包含各种NLP任务，并为每个任务结合了自然语言指令。在我们的API提示分布上，FLAN和T0模型的表现略逊于我们的监督学习基线，且标注者显著更偏好InstructGPT而非这些模型。

InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution. We qualitatively probe InstructGPT's capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. This result is exciting because it suggests that our models are able to generalize the notion of "following instructions." They retain some alignment even on tasks for which they get very little direct supervision.

InstructGPT模型展现出对RLHF微调分布之外的指令有良好的泛化能力。 我们定性地探究了InstructGPT的能力，发现它能够遵循总结代码的指令、回答关于代码的问题，并且有时能用不同语言遵循指令，尽管这些指令在微调分布中非常罕见。这一结果令人兴奋，因为它表明我们的模型能够泛化“遵循指令”的概念。即使在没有直接监督的任务上，它们也保留了一定的对齐性。

InstructGPT still makes simple mistakes. For example, InstructGPT can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises.

InstructGPT仍然会犯简单错误。 例如，InstructGPT可能仍然无法遵循指令、编造事实、对简单问题给出冗长含糊的回答，或者未能检测到带有错误前提的指令。

Overall, our results indicate that fine-tuning large language models using human preferences significantly improves their behavior on a wide range of tasks, though much work remains to be done to improve their safety and reliability.

总体而言，我们的结果表明，使用人类偏好微调大型语言模型能显著改善其在广泛任务上的行为，尽管在提升其安全性和可靠性方面仍有大量工作要做。

Research on alignment and learning from human feedback. We build on previous techniques to align models with human intentions, particularly reinforcement learning from human feedback (RLHF). Originally developed for training simple robots in simulated environments and Atari games, it has recently been applied to fine-tuning language models to summarize text. This work is in turn influenced by similar work using human feedback as a reward in domains such as dialogue, translation, semantic parsing, story generation, review generation, and evidence extraction. In concurrent work, Askell et al. propose language assistants as a testbed for alignment research, and train models using RLHF. Our work can be seen as a direct application of RLHF to aligning language models on a broad distribution of language tasks.

关于对齐和从人类反馈中学习的研究。 我们基于先前将模型与人类意图对齐的技术，特别是基于人类反馈的强化学习。该技术最初用于在模拟环境和Atari游戏中训练简单机器人，近期已被应用于微调语言模型以进行文本摘要。这项工作反过来又受到在对话、翻译、语义解析、故事生成、评论生成和证据提取等领域使用人类反馈作为奖励的类似工作的影响。在同期工作中，Askell等人提出将语言助手作为对齐研究的试验台，并使用RLHF训练模型。我们的工作可视为将RLHF直接应用于在广泛的语言任务分布上对齐语言模型。

Training language models to follow instructions. Our work is also related to research on cross-task generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain, which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details.

训练语言模型遵循指令。 我们的工作也与语言模型跨任务泛化的研究相关，其中LM在广泛的公共NLP数据集上进行微调，并在另一组NLP任务上进行评估。这一领域已有大量工作，它们在训练和评估数据、指令格式、预训练模型规模以及其他实验细节上有所不同。

Mitigating the harms of language models. A goal of modifying the behavior of language models is to mitigate the harms of these models when they're deployed in the real world. These risks have been extensively documented. Language models can produce biased outputs, leak private data, generate misinformation, and be used maliciously; for a thorough review we direct the reader to Weidinger et al. There are many ways to mitigate these harms, including by fine-tuning on a small, value-targeted dataset, filtering the pretraining dataset, or human-in-the-loop data collection.

减轻语言模型的危害。 修改语言模型行为的一个目标是减轻这些模型在现实世界部署时造成的危害。这些风险已被广泛记录。语言模型可能产生有偏见的输出、泄露私人数据、生成错误信息并被恶意利用；如需全面回顾，我们建议读者参考Weidinger等人的工作。减轻这些危害的方法有很多，包括在小型、针对价值观的数据集上进行微调、过滤预训练数据集，或采用人在回路的数据收集方式。

Methods and experimental details

3.1 High-level methodology

Our methodology follows that of Ziegler et al. and Stiennon et al., who applied it in the stylistic continuation and summarization domains. We start with a pretrained language model, a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Section 3.3 for details). We then apply the following three steps (Figure 2).

3.1 总体方法

我们的方法遵循Ziegler等人以及Stiennon等人的工作，他们将其应用于风格续写和摘要领域。我们从预训练语言模型、我们希望模型在其上产生对齐输出的提示分布以及一组训练有素的人类标注者开始。然后我们应用以下三个步骤。

Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demonstrations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

步骤1：收集演示数据，训练监督策略。 我们的标注者在输入提示分布上提供期望行为的演示。然后我们使用监督学习在这个数据集上微调预训练的GPT-3模型。

Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

步骤2：收集比较数据，训练奖励模型。 我们收集模型输出之间比较的数据集，标注者指出对于给定输入他们更偏好哪个输出。然后我们训练一个奖励模型来预测人类偏好的输出。

Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm.

步骤3：使用PPO针对奖励模型优化策略。 我们将奖励模型的输出作为标量奖励。我们使用PPO算法微调监督策略以最大化此奖励。

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

步骤2和3可以连续迭代；在当前的 best 策略上收集更多比较数据，用于训练新的奖励模型，进而训练新的策略。在实践中，我们的大部分比较数据来自监督策略，部分来自PPO策略。

3.2 Dataset

Our prompt dataset consists primarily of text prompts submitted to a commercial language model API, as well as a small number of labeler-written prompts. These prompts are very diverse and include generation, question answering, dialog, summarization, extractions, and other natural language tasks (see Appendix A). Our dataset is over 96% English. We heuristically deduplicate prompts, and ensure that the validation and test sets contain no data from users whose data is in the training set. We also filter prompts containing personally identifiable information (PII).

3.2 数据集

我们的提示数据集主要包括提交给商业语言模型API的文本提示，以及少量标注者编写的提示。这些提示非常多样，包括生成、问答、对话、摘要、提取和其他自然语言任务。我们的数据集超过96%是英文。我们对提示进行启发式去重，并确保验证集和测试集不包含训练集中用户的数据。我们还过滤了包含个人身份信息的提示。

From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API). More details on dataset sizes are provided in Table 3.

从这些提示中，我们生成了微调过程中使用的三个不同数据集：(1) 监督微调数据集，包含用于训练监督微调模型的标注者演示；(2) 奖励模型数据集，包含用于训练奖励模型的模型输出标注者排名；(3) PPO数据集，没有任何人工标签，用作RLHF微调的输入。监督微调数据集包含约1.3万个训练提示，奖励模型数据集包含3.3万个训练提示，PPO数据集包含3.1万个训练提示。数据集大小的更多细节见表3。

3.3 Human data collection

To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. Compared to earlier work that collects human preference data on the task of summarization, our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes (see Appendix B.1). As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test.

3.3 人类数据收集

为了生成我们的演示和比较数据，并进行主要评估，我们通过Upwork和ScaleAI雇佣了一个约40名承包商组成的团队。与早期在摘要任务上收集人类偏好数据的工作相比，我们的输入涵盖更广泛的任务，有时可能包含有争议和敏感的话题。我们的目标是选择一组对不同人口群体的偏好敏感，并且善于识别可能有害输出的标注者。因此，我们进行了一项筛选测试，旨在衡量标注者在这些方面的表现。作为初步研究，为了了解我们的模型在其他标注者偏好上的泛化能力，我们雇佣了另一组不产生任何训练数据的标注者。这些标注者来自相同的供应商，但未经过筛选测试。

Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other $72.6 \pm 1.5 %$ of the time, while for held-out labelers this number is $77.3 \pm 1.3 %$ . For comparison, in the summarization work of Stiennon et al. researcher-researcher agreement was $73 \pm 4 %$ .

尽管任务复杂，我们发现标注者间的一致性相当高：训练标注者彼此一致的时间为 $72.6 \pm 1.5 %$ ，而保留标注者的这一数字为 $77.3 \pm 1.3 %$ 。作为比较，在Stiennon等人的摘要工作中，研究者之间的一致性为 $73 \pm 4 %$ 。

3.4 Models

Starting from GPT-3, we train models with three different techniques:

3.4 模型

从GPT-3开始，我们使用三种不同技术训练模型：

Supervised fine-tuning (SFT). We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set. Similarly to Wu et al., we find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings.

监督微调。 我们使用监督学习在标注者演示上微调GPT-3。我们训练了16个周期，使用余弦学习率衰减，残差dropout为0.2。我们根据验证集上的奖励模型分数进行最终的监督微调模型选择。与Wu等人类似，我们发现监督微调模型在1个周期后在验证损失上过拟合；然而，我们发现训练更多周期有助于提高奖励模型分数和人类偏好评分。

Reward modeling (RM). We fine-tune GPT-3 to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix D for more details).

奖励建模。 我们对GPT-3进行微调，使其接收提示和响应，并输出标量奖励。在本文中，我们仅使用6B的奖励模型，因为这节省了大量计算，并且我们发现175B的奖励模型训练可能不稳定，因此不太适合在RL期间用作价值函数。

In Stiennon et al., the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler. In order to speed up comparison collection, we have labelers rank between $K = 4$ and $K = 9$ responses, and train on all $(\binom{K}{2})$ comparisons from each prompt as a single batch element, for computational efficiency (see Appendix D). The loss function for the RM becomes:

在Stiennon等人的工作中，奖励模型是在同一输入上两个模型输出之间的比较数据集上训练的。他们使用交叉熵损失，以比较作为标签——奖励的差值表示人类标注者偏好一个响应相对于另一个的对数几率。为了加快比较收集速度，我们让标注者对 $K = 4$ 到 $K = 9$ 个响应进行排序，并在每个提示的所有 $(\binom{K}{2})$ 个比较上训练，作为单个批次元素以提高计算效率。奖励模型的损失函数变为：

\begin{matrix} (1) & loss (θ) = - \frac{1}{(\binom{K}{2})} E_{(x, y_{w}, y_{l}) \sim D} [\log (σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l})))] \end{matrix}

where $r_{θ} (x, y)$ is the scalar output of the reward model for prompt $x$ and completion $y$ with parameters $θ$ , $y_{w}$ is the preferred completion out of the pair of $y_{w}$ and $y_{l}$ , and $D$ is the comparison dataset.

其中 $r_{θ} (x, y)$ 是奖励模型对提示 $x$ 和补全 $y$ 的标量输出，参数为 $θ$ ， $y_{w}$ 是 $y_{w}$ 和 $y_{l}$ 这对中更受偏好的补全， $D$ 是比较数据集。

Reinforcement learning (RL). Again following Stiennon et al., we fine-tuned the SFT model using PPO. The environment is a bandit environment which presents a random user prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model. The value function is initialized from the RM. We call these models “PPO.”

强化学习。 再次遵循Stiennon等人，我们使用PPO对监督微调模型进行微调。环境是一个老虎机环境，呈现一个随机用户提示，并期望对提示做出响应。给定提示和响应，它产生由奖励模型确定的奖励，并结束该轮。此外，我们在每个词元上添加来自监督微调模型的逐词元KL惩罚，以减轻对奖励模型的过度优化。价值函数从奖励模型初始化。我们将这些模型称为“PPO”。

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets (see Appendix D.4). We call these models “PPO-ptx.” Unless otherwise specified, in this paper InstructGPT refers to the PPO-ptx models.

我们还尝试将预训练梯度混合到PPO梯度中，以修复在公共NLP数据集上的性能倒退。我们将这些模型称为“PPO-ptx”。除非另有说明，本文中InstructGPT指的是PPO-ptx模型。

Baselines. We compare the performance of our PPO models to our SFT models and GPT-3. We also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt’ it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction.

基线。 我们将PPO模型的性能与监督微调模型和GPT-3进行比较。我们还与提供少样本前缀以“提示”其进入指令遵循模式的GPT-3进行比较。此前缀附加在用户指定的指令之前。

We additionally compare InstructGPT to fine-tuning 175B GPT-3 on the FLAN and T0 datasets, which both consist of a variety of NLP tasks, combined with natural language instructions for each task (they differ in the NLP datasets included, and the style of instructions used). We fine-tune them on approximately 1 million examples and choose the checkpoint which obtains the highest RM score on the validation set (see Appendix D for more details).

我们还将InstructGPT与在FLAN和T0数据集上微调的175B GPT-3进行比较，这两个数据集都包含各种NLP任务，并为每个任务结合了自然语言指令。我们在大约100万个示例上对它们进行微调，并选择在验证集上获得最高奖励模型分数的检查点。

3.5 Evaluation

Following Askell et al., we say our models are aligned if they are helpful, truthful, and harmless (we elaborate in Appendix C.2). We divide our quantitative evaluations into two parts:

3.5 评估

遵循Askell等人，如果我们的模型是有益的、诚实的和无害的，我们就说它们是对齐的。我们将定量评估分为两部分：

Evaluations on API distribution. Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution. When using prompts from the API for evaluation, we only select prompts by users we haven't included in training. For each model we calculate how often its outputs are preferred to a baseline policy; we choose our 175B SFT model as the baseline since its performance is near the middle of the pack. Additionally, we ask labelers to judge the overall quality of each response on a 1-7 Likert scale and collect a range of metadata for each model output (see Table 11). In particular, we collect data that aims to capture different aspects of behavior in a deployed model that could end up being harmful: we have labelers evaluate whether an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content.

API分布上的评估。 我们的主要指标是在来自与训练分布相同来源的保留提示集上的人类偏好评分。当使用来自API的提示进行评估时，我们只选择未包含在训练中的用户的提示。对于每个模型，我们计算其输出相对于基线策略被偏好的频率；我们选择175B监督微调模型作为基线，因为其性能处于中等水平。此外，我们要求标注者使用1-7李克特量表判断每个响应的整体质量，并为每个模型输出收集一系列元数据。特别是，我们收集旨在捕获部署模型中可能导致危害的行为不同方面的数据：我们让标注者评估输出在客户助理环境中是否不恰当、是否诋毁受保护类别、或是否包含色情或暴力内容。

Evaluations on public NLP datasets. We evaluate on two types of public datasets: those that capture an aspect of language model safety, particularly truthfulness, toxicity, and bias, and those that capture zero-shot performance on traditional NLP tasks like question answering, reading comprehension, and summarization. We also conduct human evaluations on the RealToxicityPrompts dataset.

公共NLP数据集上的评估。 我们评估两类公共数据集：那些捕捉语言模型安全某一方面（特别是真实性、毒性和偏见）的数据集，以及那些捕捉传统NLP任务（如问答、阅读理解和摘要）零样本性能的数据集。我们还在RealToxicityPrompts数据集上进行人工评估。

Results

4.1 Results on the API distribution

Labelers significantly prefer InstructGPT outputs over outputs from GPT-3. On our test set, our labelers significantly prefer InstructGPT outputs across model sizes (Figure 1). We find that GPT-3 outputs perform the worst, and one can obtain significant step-size improvements by using a well-crafted few-shot prompt (GPT-3 (prompted)), then by training on demonstrations using supervised learning (SFT), and finally by training on comparison data using PPO. Adding updates on the pretraining mix during PPO does not lead to large changes in labeler preference. To illustrate the magnitude of our gains: when compared directly, 175B InstructGPT outputs are preferred to GPT-3 outputs $85 \pm 3 %$ of the time, and preferred $71 \pm 4 %$ of the time to few-shot GPT-3.

4.1 API分布上的结果

标注者显著更偏好InstructGPT的输出而非GPT-3的输出。 在我们的测试集上，我们的标注者显著更偏好各种规模下的InstructGPT输出。我们发现GPT-3的输出表现最差，通过使用精心设计的少样本提示、然后使用监督学习在演示上进行训练，最后使用PPO在比较数据上进行训练，可以获得显著的逐步改进。在PPO期间添加预训练混合的更新不会导致标注者偏好发生大的变化。为了说明我们增益的幅度：当直接比较时，175B InstructGPT的输出在 $85 \pm 3 %$ 的情况下比GPT-3的输出更受欢迎，并且在 $71 \pm 4 %$ 的情况下比少样本GPT-3更受欢迎。

In Figure 4 we show that labelers also rate InstructGPT outputs favorably along several more concrete axes. Specifically, compared to GPT-3, InstructGPT outputs are more appropriate in the context of a customer assistant, more often follow explicit constraints defined in the instruction (e.g. "Write your answer in 2 paragraphs or less."), are less likely to fail to follow the correct instruction entirely, and make up facts ('hallucinate') less often in closed-domain tasks.

在图4中，我们展示了标注者在几个更具体的维度上也对InstructGPT输出给予了积极评价。具体来说，与GPT-3相比，InstructGPT输出在客户助理环境中更合适，更经常遵循指令中定义的显式约束，更少完全无法遵循正确指令，并且在闭域任务中编造事实的频率更低。

Our models generalize to the preferences of "held-out" labelers that did not produce any training data. Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3). In particular, according to held-out workers, all of our InstructGPT models still greatly outperform the GPT-3 baselines. Thus, our InstructGPT models aren't simply overfitting to the preferences of our training labelers.

我们的模型能够泛化到未产生任何训练数据的保留标注者的偏好。 保留标注者与我们用于生成训练数据的标注者有着相似的排序偏好。特别是，根据保留标注者的意见，我们所有的InstructGPT模型仍然大大优于GPT-3基线。因此，我们的InstructGPT模型并不仅仅是过拟合于我们训练标注者的偏好。

Public NLP datasets are not reflective of how our language models are used. In Figure 5a, we also compare InstructGPT to our 175B GPT-3 baselines fine-tuned on the FLAN and T0 datasets (see Appendix D for details). We find that these models perform better than GPT-3, on par with GPT-3 with a well-chosen prompt, and worse than our SFT baseline. This indicates that these datasets are not sufficiently diverse to improve performance on our API prompt distribution. We believe this is partly because academic datasets focus on tasks where performance is easily measured, like classification and QA, while our API distribution consists of mostly (about 57%) open-ended generation tasks.

公共NLP数据集不能反映我们语言模型的使用方式。 在图5a中，我们还将InstructGPT与在FLAN和T0数据集上微调的175B GPT-3基线进行了比较。我们发现这些模型的表现优于GPT-3，与带有精心选择提示的GPT-3相当，但差于我们的监督学习基线。这表明这些数据集不够多样化，无法在我们的API提示分布上提升性能。我们认为这部分是因为学术数据集侧重于性能易于衡量的任务，如分类和问答，而我们的API分布主要由开放式生成任务组成。

4.2 Results on public NLP datasets

InstructGPT models show improvements in truthfulness over GPT-3. As measured by human evaluations on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 5b). This behavior is the default: our models do not have to be specifically instructed to tell the truth to exhibit improved truthfulness. Interestingly, the exception is our 1.3B PPO-ptx model, which performs slightly worse than a GPT-3 model of the same size. Our improvements in truthfulness are also evidenced by the fact that our PPO models hallucinate less often on closed-domain tasks (Figure 4).

4.2 公共NLP数据集上的结果

InstructGPT模型在真实性上比GPT-3有所提升。 根据在TruthfulQA数据集上进行的人工评估，与GPT-3相比，我们的PPO模型在生成真实且信息丰富的输出方面表现出微小但显著的改进。这是默认行为：我们的模型无需专门被指示说实话就能表现出改进的真实性。有趣的是，例外是我们的1.3B PPO-ptx模型，其表现略逊于相同规模的GPT-3模型。我们在真实性上的改进还体现在我们的PPO模型在闭域任务上幻觉频率更低这一事实上。

InstructGPT shows small improvements in toxicity over GPT-3, but not bias. We first evaluate our models on the RealToxicityPrompts dataset using human evaluations. Our results are in Figure 5c. We find that, when instructed to produce a safe and respectful output ("respectful prompt"), InstructGPT models generate less toxic outputs than those from GPT-3 according to the Perspective API. This advantage disappears when the respectful prompt is removed ("no prompt"). We see similar results when evaluating using the Perspective API.

InstructGPT在毒性上比GPT-3有微小改进，但在偏见上没有。 我们首先使用人工评估在RealToxicityPrompts数据集上评估我们的模型。结果如图5c所示。我们发现，当被指示产生安全且尊重的输出时，InstructGPT模型生成的输出比GPT-3的毒性更低。当移除要求尊重的提示时，这种优势消失了。使用Perspective API进行评估时，我们看到了类似的结果。

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. In Figure 25 we show that adding pretraining updates to our PPO fine-tuning (PPO-ptx) mitigates performance regressions on public NLP datasets, and even surpasses GPT-3 on HellaSwag. The performance of the PPO-ptx model still lags behind GPT-3 on DROP, SQuADv2, and translation; more work is needed to study and further eliminate these performance regressions. We also find that mixing in pretraining updates performs better than the simpler solution of increasing the KL coefficient.

我们可以通过修改RLHF微调程序，最小化在公共NLP数据集上的性能倒退。 在图25中，我们展示了在PPO微调中添加预训练更新可以缓解公共NLP数据集上的性能倒退，甚至在HellaSwag上超越了GPT-3。PPO-ptx模型的性能在DROP、SQuADv2和翻译上仍落后于GPT-3；需要更多工作来研究和进一步消除这些性能倒退。我们还发现，混合预训练更新比简单地增大KL系数的效果更好。

4.3 Qualitative results

InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution. In particular, we find that InstructGPT shows ability to follow instructions in non-English languages, and perform summarization and question-answering for code. This is interesting because non-English languages and code form a tiny minority of our fine-tuning data, and it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise. We show some qualitative examples in Figure 26.

4.3 定性结果

InstructGPT模型展现出对RLHF微调分布之外的指令有良好的泛化能力。 特别是，我们发现InstructGPT展现出用非英语语言遵循指令，以及为代码执行摘要和问答的能力。这很有趣，因为非英语语言和代码仅占我们微调数据的极小一部分，这表明在某些情况下，对齐方法可以泛化到人类未直接监督的输入上产生期望的行为。我们在图26中展示了一些定性示例。

InstructGPT still makes simple mistakes. In interacting with our 175B PPO-ptx model, we have noticed it can still make simple mistakes, despite its strong performance on many different language tasks. To give a few examples: (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model's performance degrades when instructions contain multiple explicit constraints (e.g. "list 10 movies made in the 1930's set in France") or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences).

InstructGPT仍然会犯简单错误。 在与我们的175B PPO-ptx模型交互时，我们注意到尽管它在许多不同语言任务上表现强劲，但它仍然会犯简单错误。举几个例子：(1) 当给定一个带有错误前提的指令时，模型有时会错误地假设前提为真；(2) 模型可能过度含糊其辞；当给定一个简单问题时，它有时会说这个问题没有单一答案，并给出多个可能的答案，即使从上下文中可以得出一个相当明确的答案；(3) 当指令包含多个显式约束时，模型的性能会下降。

We show some examples of these behaviors in Figure 27. We suspect that behavior (2) emerges partly because we instruct labelers to reward epistemic humility; thus, they may tend to reward outputs that hedge, and this gets picked up by our reward model. We suspect that behavior (1) occurs because there are few prompts in the training set that assume false premises, and our models don't generalize well to these examples. We believe both these behaviors could be dramatically reduced with adversarial data collection.

我们在图27中展示了这些行为的一些示例。我们推测行为(2)的出现部分是因为我们指示标注者奖励认知上的谦逊；因此，他们可能倾向于奖励那些含糊其辞的输出，而这被我们的奖励模型捕捉到了。我们推测行为(1)的出现是因为训练集中很少有假设错误前提的提示，我们的模型不能很好地泛化到这些示例。我们相信通过对抗性数据收集，这两种行为都可以显著减少。

Discussion

5.1 Implications for alignment research

Our approach to alignment research in this work is iterative: we are improving the alignment of current AI systems instead of focusing abstractly on aligning AI systems that don't yet exist, which provides us with a clear empirical feedback loop of what works and what does not. We believe that this feedback loop is essential to refine our alignment techniques, and it forces us to keep pace with progress in machine learning.

5.1 对齐研究的启示

我们在本工作中采用的对齐研究方法是迭代式的：我们致力于改进当前AI系统的对齐性，而非抽象地关注对齐尚不存在的AI系统，这为我们提供了关于哪些方法有效、哪些无效的清晰实证反馈循环。我们相信，这一反馈循环对于完善对齐技术至关重要，并促使我们跟上机器学习的进步步伐。

From this work, we can draw lessons for alignment research more generally. First, the cost of increasing model alignment is modest relative to pretraining. Training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3. At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a 100x model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models. Second, we've seen some evidence that InstructGPT generalizes 'following instructions' to settings that we don't supervise it in. This is an important property because it's prohibitively expensive to have humans supervise models on every task they perform. Finally, we were able to mitigate most of the performance degradations introduced by our fine-tuning. If this was not the case, these performance degradations would constitute an alignment tax—an additional cost for aligning the model. Any alignment technique with a high tax might not see adoption, and thus such a tax is important to avoid.

从这项工作中，我们可以汲取更广泛的对齐研究经验。首先，与预训练相比，提升模型对齐的成本相对较低。训练我们的175B监督微调模型需要4.9 petaflop/s-days，训练175B PPO-ptx模型需要60 petaflop/s-days，而GPT-3需要3,640 petaflop/s-days。同时，我们的结果表明，RLHF在使语言模型对用户更有帮助方面非常有效，其效果甚至超过模型规模增加100倍。这表明，目前增加对现有语言模型对齐的投资比训练更大的模型更具成本效益。其次，我们观察到一些证据表明InstructGPT能将“遵循指令”泛化到我们未监督的情境中。这是一个重要的特性，因为让人工监督模型执行的每一项任务成本过高。最后，我们能够缓解微调引入的大部分性能下降。若非如此，这些性能下降将构成对齐税——即对齐模型的额外成本。任何税负过高的对齐技术都可能无法被采纳，因此避免这样的税负至关重要。

5.2 Limitations

Methodology. The behavior of our InstructGPT models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people affected by these models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.

5.2 局限性

方法论。 我们InstructGPT模型的行为部分取决于从承包商处获得的人类反馈。某些标注任务依赖于价值判断，这些判断可能受到承包商身份、信仰、文化背景和个人经历的影响。我们保持承包商团队较小，因为这有助于与一小部分全职从事该任务的承包商进行高带宽沟通。然而，这个群体显然不能代表受这些模型影响的全体人群。举个简单的例子，我们的标注者主要使用英语，我们的数据几乎完全由英文指令组成。

Models. Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 27. Perhaps the greatest limitation of our models is that, in most cases, they follow the user's instruction, even if that could lead to harm in the real world. For example, when prompting the models to be maximally biased, InstructGPT generates more toxic outputs than equivalently-sized GPT-3 models.

模型。 我们的模型既未完全对齐，也未完全安全；它们仍然会生成有毒或有偏见的输出，编造事实，并在没有明确提示的情况下生成色情和暴力内容。它们也可能在某些输入上无法生成合理的输出；我们在图27中展示了一些这样的例子。或许我们模型最大的局限在于，大多数情况下，它们会遵循用户的指令，即使这在现实世界中可能导致伤害。例如，当提示模型最大化地表现出偏见时，InstructGPT比同等规模的GPT-3模型生成更多有毒的输出。

5.3 Broader impacts

This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do. Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless. In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations.

5.3 更广泛的影响

这项工作的动机是希望通过训练大型语言模型执行特定人群希望它们做的事情，来增加其积极影响。默认情况下，语言模型优化的是下一个词预测目标，而这只是我们期望这些模型所做事情的代理指标。我们的结果表明，我们的技术有望使语言模型更有帮助、更真实、更无害。从长远来看，对齐失败可能导致更严重的后果，尤其是在安全攸关情境中部署这些模型时。

However, making language models better at following user intentions also makes them easier to misuse. It may be easier to use these models to generate convincing misinformation, or hateful or abusive content. Alignment techniques are not a panacea for resolving safety issues associated with large language models; rather, they should be used as one tool in a broader safety ecosystem. Aside from intentional misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying people based on protected characteristics, determining eligibility for credit, employment, or housing, generating political advertisements, and law enforcement.

然而，使语言模型更好地遵循用户意图也使其更容易被滥用。使用这些模型生成令人信服的错误信息或仇恨、辱骂性内容可能变得更加容易。对齐技术并非解决与大型语言模型相关的安全问题的万能药；相反，它们应作为更广泛安全生态系统中的一种工具。除了故意滥用之外，还有许多领域应极其谨慎地部署大型语言模型，或根本不应部署。例如高风险领域，如医疗诊断、基于受保护特征对人进行分类、确定信贷、就业或住房资格、生成政治广告以及执法等。

Finally, the question of who these models are aligned to is extremely important, and will significantly affect whether the net impact of these models is positive or negative; we discuss this in Appendix G.2.

最后，这些模型与谁对齐的问题极其重要，并将显著影响这些模型的最终影响是积极还是消极；我们在附录G.2中讨论这一点。

🤖 Rasa

Training language models to follow instructions with human feedback ​

通过人类反馈训练语言模型遵循指令 ​

Abstract ​

Introduction ​

Related work ​

Methods and experimental details ​

3.1 High-level methodology ​

3.1 总体方法 ​

3.2 Dataset ​

3.2 数据集 ​

3.3 Human data collection ​

3.3 人类数据收集 ​

3.4 Models ​

3.4 模型 ​

3.5 Evaluation ​

3.5 评估 ​

Results ​

4.1 Results on the API distribution ​

4.1 API分布上的结果 ​

4.2 Results on public NLP datasets ​

4.2 公共NLP数据集上的结果 ​

4.3 Qualitative results ​

4.3 定性结果 ​

Discussion ​

5.1 Implications for alignment research ​

5.1 对齐研究的启示 ​

5.2 Limitations ​

5.2 局限性 ​

5.3 Broader impacts ​

5.3 更广泛的影响 ​