Skip to content


Abstract

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1,000th the compute of GPT-4.

我们报告了GPT-4的开发情况,这是一个大规模、多模态模型,可以接受图像和文本输入并生成文本输出。虽然在许多现实场景中不如人类,但GPT-4在各种专业和学术基准上表现出人类水平的性能,包括在模拟律师资格考试中以大约前10%考生的分数通过。GPT-4是一个基于Transformer的模型,经过预训练以预测文档中的下一个词元。训练后的对齐过程在事实性和对期望行为的遵从性方面提升了性能。该项目的一个核心组成部分是开发能够在大范围内行为可预测的基础设施和优化方法。这使得我们能够基于仅使用不超过GPT-4 1/1000计算量训练的模型,准确预测GPT-4性能的某些方面。

Introduction

This technical report presents GPT-4, a large multimodal model capable of processing image and text inputs and producing text outputs. Such models are an important area of study as they have the potential to be used in a wide range of applications, such as dialogue systems, text summarization, and machine translation. As such, they have been the subject of substantial interest and progress in recent years.

本技术报告介绍了GPT-4,一个能够处理图像和文本输入并生成文本输出的大型多模态模型。这类模型是一个重要的研究领域,因为它们有可能被广泛应用于对话系统、文本摘要和机器翻译等各种应用场景。因此,近年来它们一直是人们极大兴趣和进展的主题。

One of the main goals of developing such models is to improve their ability to understand and generate natural language text, particularly in more complex and nuanced scenarios. To test its capabilities in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In these evaluations it performs quite well and often outscores the vast majority of human test takers. For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers. This contrasts with GPT-3.5, which scores in the bottom 10%.

开发此类模型的主要目标之一是提升它们理解和生成自然语言文本的能力,尤其是在更复杂和微妙的场景中。为了测试GPT-4在此类场景中的能力,我们在多种原本为人类设计的考试中对其进行了评估。在这些评估中,它表现相当出色,其得分常常超过绝大多数人类考生。例如,在模拟律师资格考试中,GPT-4的得分进入了前10%考生的行列。相比之下,GPT-3.5的得分位于后10%。

On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering). On the MMLU benchmark, an English-language suite of multiple-choice questions covering 57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4 surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these model capability results, as well as model safety improvements and results, in more detail in later sections.

在一系列传统NLP基准测试中,GPT-4的性能超越了先前的大型语言模型以及大多数最先进的系统(这些系统通常有针对特定基准的训练或人工设计)。在涵盖57个学科的多选题英语套件MMLU基准上,GPT-4不仅在英语上以相当大的优势超越了现有模型,而且在其他语言上也展现出强劲的性能。在MMLU的翻译变体上,GPT-4在所考察的26种语言中的24种上超过了英语的最先进水平。我们将在后续章节更详细地讨论这些模型能力结果,以及模型安全性的改进和成果。

This report also discusses a key challenge of the project, developing deep learning infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to make predictions about the expected performance of GPT-4 (based on small runs trained in similar ways) that were tested against the final run to increase confidence in our training.

本报告还讨论了该项目的一个关键挑战,即开发能够在大范围规模上可预测行为的深度学习基础设施和优化方法。这使我们能够基于以类似方式训练的小规模运行来预测GPT-4的预期性能,并通过最终运行进行验证,从而增强对训练的信心。

Despite its capabilities, GPT-4 has similar limitations to earlier GPT models: it is not fully reliable (e.g. can suffer from "hallucinations"), has a limited context window, and does not learn from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important.

尽管能力强大,但GPT-4与早期的GPT模型存在类似的局限性:它并非完全可靠,上下文窗口有限,且无法从经验中学习。在使用GPT-4的输出时应谨慎,特别是在可靠性至关重要的环境中。

GPT-4's capabilities and limitations create significant and novel safety challenges, and we believe careful study of these challenges is an important area of research given the potential societal impact. This report includes an extensive system card (after the Appendix) describing some of the risks we foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more. It also describes interventions we made to mitigate potential harms from the deployment of GPT-4, including adversarial testing with domain experts, and a model-assisted safety pipeline.

GPT-4的能力和局限性带来了重大且新颖的安全挑战,考虑到其潜在的社会影响,我们相信仔细研究这些挑战是一个重要的研究领域。本报告包含一份详尽的系统卡,描述了我们预见到的围绕偏见、虚假信息、过度依赖、隐私、网络安全、扩散等方面的风险。它还描述了为减轻GPT-4部署可能带来的危害而采取的措施,包括与领域专家进行的对抗性测试,以及一个模型辅助的安全流程。

Scope and Limitations of this Technical Report

This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

本报告聚焦于GPT-4的能力、局限性和安全特性。GPT-4是一个Transformer风格的模型,通过预训练预测文档中的下一个词元,使用了公开可用的数据和从第三方提供商处获得的数据。随后,该模型使用基于人类反馈的强化学习进行了微调。考虑到竞争环境以及像GPT-4这样的大规模模型的安全性影响,本报告不包含关于架构、硬件、训练计算量、数据集构建、训练方法或类似内容的进一步细节。

We are committed to independent auditing of our technologies, and shared some initial steps and ideas in this area in the system card accompanying this release. We plan to make further technical details available to additional third parties who can advise us on how to weigh the competitive and safety considerations above against the scientific value of further transparency.

我们致力于对我们技术的独立审计,并在随本次发布附上的系统卡中分享了这方面的初步步骤和想法。我们计划向其他第三方提供进一步的技术细节,以便他们能够就如何在上述竞争和安全性考虑与进一步透明度的科学价值之间进行权衡向我们提供建议。

Predictable Scaling

A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1,000×10,000 less compute.

GPT-4项目的一个重点目标是构建一个可预测扩展的深度学习栈。主要原因在于,对于像GPT-4这样非常庞大的训练任务,进行广泛的特定模型调优是不可行的。为了解决这个问题,我们开发了在多种规模下具有高度可预测行为的基础设施和优化方法。这些改进使我们能够基于使用计算量少 1,00010,000 倍的较小模型,可靠地预测GPT-4性能的某些方面。

3.1 Loss Prediction

The final loss of properly-trained large language models is thought to be well approximated by power laws in the amount of compute used to train the model.

3.1 损失预测

人们认为,经过恰当训练的大型语言模型的最终损失,可以用训练模型所用计算量的幂律进行很好的近似。

To verify the scalability of our optimization infrastructure, we predicted GPT-4's final loss on our internal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term: L(C)=aCb+c, from models trained using the same methodology but using at most 10,000x less compute than GPT-4. This prediction was made shortly after the run started, without use of any partial results. The fitted scaling law predicted GPT-4's final loss with high accuracy (Figure 1).

为了验证我们优化基础设施的可扩展性,我们通过拟合一个包含不可约损失项的缩放定律,预测了GPT-4在内部代码库上的最终损失:L(C)=aCb+c,该定律基于使用相同方法但计算量最多比GPT-4少10,000倍的模型。这个预测是在训练运行开始后不久,未使用任何部分结果的情况下做出的。拟合出的缩放定律高精度地预测了GPT-4的最终损失。

3.2 Scaling of Capabilities on HumanEval

Having a sense of the capabilities of a model before training can improve decisions around alignment, safety, and deployment. In addition to predicting final loss, we developed methodology to predict more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset, which measures the ability to synthesize Python functions of varying complexity. We successfully predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained with at most 1,000× less compute (Figure 2).

3.2 HumanEval能力的扩展

在训练前了解模型的能力可以改进关于对齐、安全和部署的决策。除了预测最终损失,我们还开发了预测更具可解释性能力指标的方法。其中一个指标是在HumanEval数据集上的通过率,该数据集衡量合成不同复杂度Python函数的能力。我们通过外推使用最多少 1,000 倍计算量训练的模型,成功预测了HumanEval数据子集上的通过率。

For an individual problem in HumanEval, performance may occasionally worsen with scale. Despite these challenges, we find an approximate power law relationship EP[log(pass_rate(C))]= αCk where k and α are positive constants, and P is a subset of problems in the dataset. We hypothesize that this relationship holds for all problems in this dataset. In practice, very low pass rates are difficult or impossible to estimate, so we restrict to problems P and models M such that given some large sample budget, every problem is solved at least once by every model.

对于HumanEval中的单个问题,其性能有时可能会随规模扩大而变差。尽管存在这些挑战,我们仍发现一个近似的幂律关系:EP[log(pass_rate(C))]=αCk,其中 kα 是正常数,P 是数据集中问题的子集。我们假设此关系对该数据集中的所有问题成立。在实践中,非常低的通过率难以或无法估计,因此我们将问题 P 和模型 M 限制为:在给定大样本预算的情况下,每个模型至少解决一次每个问题。

We registered predictions for GPT-4's performance on HumanEval before training completed, using only information available prior to training. All but the 15 hardest HumanEval problems were split into 6 difficulty buckets based on the performance of smaller models. The results on the 3rd easiest bucket are shown in Figure 2, showing that the resulting predictions were very accurate for this subset of HumanEval problems where we can accurately estimate log(pass_rate) for several smaller models. Predictions on the other five buckets performed almost as well, the main exception being GPT-4 underperforming our predictions on the easiest bucket.

我们在训练完成之前,仅使用训练前可用的信息,登记了对GPT-4在HumanEval上性能的预测。根据较小模型的性能,将除15个最难HumanEval问题之外的所有问题划分为6个难度区间。在第三容易的区间上的结果如图2所示,表明对于这一部分我们能准确估计多个较小模型的 log(pass_rate) 的HumanEval问题,最终预测非常准确。对其他五个区间的预测也几乎同样好,主要的例外是在最容易的区间上GPT-4的表现低于我们的预测。

Certain capabilities remain hard to predict. For example, the Inverse Scaling Prize proposed several tasks for which model performance decreases as a function of scale. Similarly to a recent result by Wei et al., we find that GPT-4 reverses this trend, as shown on one of the tasks called Hindsight Neglect in Figure 3.

某些能力仍然难以预测。例如,反向缩放奖提出了几个任务,在这些任务中模型性能随规模扩大而下降。与Wei等人最近的结果类似,我们发现GPT-4逆转了这一趋势,如图3中名为"事后忽视"的任务所示。

We believe that accurately predicting future capabilities is important for safety. Going forward we plan to refine these methods and register performance predictions across various capabilities before large model training begins, and we hope this becomes a common goal in the field.

我们相信,准确预测未来能力对于安全性至关重要。展望未来,我们计划改进这些方法,并在开始大型模型训练之前登记对各种能力的性能预测,我们希望这能成为该领域的共同目标。

Capabilities

We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.

我们在多样化的基准上测试了 GPT-4,包括模拟原本为人类设计的考试。我们没有为这些考试进行特定训练。考试中有一小部分问题在训练期间被模型看到过;对于每次考试,我们运行一个移除了这些问题的变体,并报告两者中较低的分数。我们相信结果具有代表性。关于数据污染的更多细节,请参见附录 C。

Exams were sourced from publicly-available materials. Exam questions included both multiple-choice and free-response questions; we designed separate prompts for each format, and images were included in the input for questions which required it. The evaluation setup was designed based on performance on a validation set of exams, and we report final results on held-out test exams. Overall scores were determined by combining multiple-choice and free-response question scores using publicly available methodologies for each exam. We estimate and report the percentile each overall score corresponds to. See Appendix A for further details on the exam evaluation methodology.

考试题目来源于公开可用的材料。考试题目包括多项选择题和自由回答题;我们为每种格式设计了单独的提示,并且在需要时在输入中包含了图像。评估设置基于验证集上的表现进行设计,并在保留的测试考试上报告最终结果。总体分数是通过结合多项选择和自由回答题的分数,使用各考试公开可用的评分方法确定的。我们估算并报告每个总体分数对应的百分位数。有关考试评估方法的更多细节,请参见附录 A。

GPT-4 exhibits human-level performance on the majority of these professional and academic exams. Notably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers (Table 1, Figure 4).

GPT-4 在大部分这些专业和学术考试中表现出人类水平的性能。值得注意的是,它在模拟的律师资格考试中取得了前 10% 考生的分数。

The model's capabilities on exams appear to stem primarily from the pre-training process and are not significantly affected by RLHF. On multiple choice questions, both the base GPT-4 model and the RLHF model perform equally well on average across the exams we tested (see Appendix B).

模型在考试上的能力似乎主要源自预训练过程,并未受到 RLHF 的显著影响。在多项选择题上,基础 GPT-4 模型和 RLHF 模型在我们测试的考试中平均表现相当。

We also evaluated the pre-trained base GPT-4 model on traditional benchmarks designed for evaluating language models. For each benchmark we report, we ran contamination checks for test data appearing in the training set (see Appendix D for full details on per-benchmark contamination). We used few-shot prompting for all benchmarks when evaluating GPT-4.

我们还评估了预训练的基础 GPT-4 模型在专为评估语言模型设计的传统基准上的表现。对于报告的每个基准,我们检查了训练集中是否出现测试数据。我们在评估 GPT-4 时对所有基准使用了少样本提示。

GPT-4 considerably outperforms existing language models, as well as previously state-of-the-art (SOTA) systems which often have benchmark-specific crafting or additional training protocols (Table 2).

GPT-4 显著优于现有的语言模型,以及之前最先进的系统(这些系统通常有针对特定基准的设计或额外的训练协议)。

Many existing ML benchmarks are written in English. To gain an initial understanding of GPT-4's capabilities in other languages, we translated the MMLU benchmark – a suite of multiple-choice problems spanning 57 subjects – into a variety of languages using Azure Translate (see Appendix F for example translations and prompts). We find that GPT-4 outperforms the English-language performance of GPT-3.5 and existing language models (Chinchilla and PaLM) for the majority of languages we tested, including low-resource languages such as Latvian, Welsh, and Swahili (Figure 5).

许多现有的机器学习基准都是用英语编写的。为了初步了解 GPT-4 在其他语言中的能力,我们使用 Azure 翻译将 MMLU 基准(涵盖 57 个学科的多选题套件)翻译成了多种语言。我们发现,对于测试的大部分语言,包括拉脱维亚语、威尔士语和斯瓦希里语等低资源语言,GPT-4 在英语上的表现优于 GPT-3.5 以及现有的语言模型。

GPT-4 substantially improves over previous models in the ability to follow user intent. On a dataset of 5,214 prompts submitted to ChatGPT and the OpenAI API, the responses generated by GPT-4 were preferred over the responses generated by GPT-3.5 on 70.2% of prompts.

GPT-4 在遵循用户意图的能力上相较于先前模型有显著提升。在一个包含 5,214 个提交给 ChatGPT 和 OpenAI API 的提示的数据集上,GPT-4 生成的响应在 70.2% 的提示中比 GPT-3.5 生成的响应更受青睐。

We are open-sourcing OpenAI Evals, our framework for creating and running benchmarks for evaluating models like GPT-4 while inspecting performance sample by sample. Evals is compatible with existing benchmarks, and can be used to track performance of models in deployment. We plan to increase the diversity of these benchmarks over time to represent a wider set of failure modes and a harder set of tasks.

我们正在开源 OpenAI Evals,这是一个用于创建和运行基准的框架,可以逐样本检查 GPT-4 等模型的性能表现。Evals 与现有基准兼容,可用于跟踪部署中模型的性能。我们计划随着时间的推移增加这些基准的多样性,以代表更广泛的失败模式和更困难的任务。

4.1 Visual Inputs

GPT-4 accepts prompts consisting of both images and text, which – parallel to the text-only setting – lets the user specify any vision or language task. Specifically, the model generates text outputs given inputs consisting of arbitrarily interlaced text and images. Over a range of domains – including documents with text and photographs, diagrams, or screenshots – GPT-4 exhibits similar capabilities as it does on text-only inputs. An example of GPT-4's visual input can be found in Table 3. The standard test-time techniques developed for language models (e.g. few-shot prompting, chain-of-thought, etc) are similarly effective when using both images and text - see Appendix G for examples.

4.1 视觉输入

GPT-4可以接受包含图像和文本的提示,这使得用户能够像在纯文本环境中一样,指定任何视觉或语言任务。具体来说,模型能够在给定任意交错文本和图像的输入时生成文本输出。在一系列领域中,GPT-4展现出与纯文本输入相似的能力。GPT-4视觉输入的一个示例可以在表3中找到。为语言模型开发的标准测试时技术(例如少样本提示、思维链等)在同时使用图像和文本时同样有效。

Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog post. We plan to release more information about GPT-4's visual capabilities in follow-up work.

关于一小部分学术视觉基准的初步结果可以在GPT-4博客文章中找到。我们计划在后续工作中发布更多关于GPT-4视觉能力的信息。

Limitations

Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it "hallucinates" facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of specific applications. See our System Card for details.

尽管能力强大,GPT-4仍存在与早期GPT模型类似的局限性。最重要的是,它仍然不完全可靠。在使用语言模型输出时应格外小心,尤其是在高风险环境中,应遵循匹配特定应用需求的精确协议。详细信息请参阅我们的系统卡。

GPT-4 significantly reduces hallucinations relative to previous GPT-3.5 models (which have themselves been improving with continued iteration). GPT-4 scores 19 percentage points higher than our latest GPT-3.5 on our internal, adversarially-designed factuality evaluations (Figure 6).

与之前的GPT-3.5模型相比,GPT-4显著减少了幻觉现象。在我们内部的对抗性设计的事实性评估中,GPT-4的得分比我们最新的GPT-3.5高出19个百分点。

GPT-4 makes progress on public benchmarks like TruthfulQA, which tests the model's ability to separate fact from an adversarially-selected set of incorrect statements (Figure 7). These questions are paired with factually incorrect answers that are statistically appealing. The GPT-4 base model is only slightly better at this task than GPT-3.5; however, after RLHF post-training we observe large improvements over GPT-3.5. Table 4 shows both a correct and an incorrect answer. GPT-4 resists selecting common sayings (you can't teach an old dog new tricks), however it still can miss subtle details (Elvis Presley was not the son of an actor, so Perkins is the correct answer).

GPT-4在TruthfulQA等公共基准上取得了进展,该基准测试模型从一组对抗性选择的错误陈述中区分事实的能力。GPT-4基础模型在此任务上仅比GPT-3.5略好;然而,经过RLHF训练后,我们观察到相对于GPT-3.5有大幅改进。表4展示了正确和错误的答案。GPT-4能够抵制选择常见谚语,但它仍然可能遗漏细微的细节。

GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its pre-training data cuts off in September 2021, and does not learn from its experience. It can sometimes make simple reasoning errors which do not seem to comport with competence across so many domains, or be overly gullible in accepting obviously false statements from a user. It can fail at hard problems the same way humans do, such as introducing security vulnerabilities into code it produces.

GPT-4普遍缺乏对其大部分预训练数据截止日期(2021年9月)之后发生的事件的了解,并且不能从其经验中学习。它有时会犯简单的推理错误,或者过度轻信用户明显错误的陈述。它可能像人类一样在困难问题上失败,例如在其生成的代码中引入安全漏洞。

GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it's likely to make a mistake. Interestingly, the pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, after the post-training process, the calibration is reduced (Figure 8).

GPT-4在预测时也可能自信地犯错,在可能出错时没有仔细检查。有趣的是,预训练模型的校准度很高。然而,在训练后过程中,校准度降低了。

GPT-4 has various biases in its outputs that we have taken efforts to correct but which will take some time to fully characterize and manage. We aim to make GPT-4 and other systems be build have reasonable default behaviors that reflect a wide swath of users' values, allow those systems to be customized within some broad bounds, and get public input on what those bounds should be. See OpenAI for more details.

GPT-4的输出中存在各种偏见,我们已努力纠正,但完全描述和管理这些偏见仍需时间。我们的目标是使GPT-4和其他系统具有合理的默认行为,反映广大用户的价值观,允许这些系统在宽泛范围内进行定制,并就该范围的界定征求公众意见。更多详情请参阅OpenAI。

Risks & mitigations

We invested significant effort towards improving the safety and alignment of GPT-4. Here we highlight our use of domain experts for adversarial testing and red-teaming, and our model-assisted safety pipeline and the improvement in safety metrics over prior models.

我们投入了大量精力来提升GPT-4的安全性和对齐性。在此,我们重点介绍使用领域专家进行对抗性测试和红队测试,以及我们的模型辅助安全流程,以及相较于先前模型在安全指标上的改进。

Adversarial Testing via Domain Experts: GPT-4 poses similar risks as smaller language models, such as generating harmful advice, buggy code, or inaccurate information. However, the additional capabilities of GPT-4 lead to new risk surfaces. To understand the extent of these risks, we engaged over 50 experts from domains such as long-term AI alignment risks, cybersecurity, biorisk, and international security to adversarially test the model. Their findings specifically enabled us to test model behavior in high-risk areas which require niche expertise to evaluate, as well as assess risks that will become relevant for very advanced AIs such as power seeking. Recommendations and training data gathered from these experts fed into our mitigations and improvements for the model; for example, we've collected additional data to improve GPT-4's ability to refuse requests on how to synthesize dangerous chemicals (Table 5).

通过领域专家进行对抗性测试: GPT-4与较小的语言模型存在类似风险,例如生成有害建议、有缺陷的代码或不准确的信息。然而,GPT-4的额外能力带来了新的风险面。为了解这些风险的程度,我们邀请了来自长期AI对齐风险、网络安全、生物风险和国际安全等领域的50多位专家对模型进行对抗性测试。他们的发现使我们能够测试高风险区域的模型行为,并评估与非常先进的AI相关的风险。从这些专家那里收集的建议和训练数据被用于我们的缓解措施和模型改进;例如,我们收集了额外的数据来提升GPT-4拒绝合成危险化学品请求的能力。

Model-Assisted Safety Pipeline: As with prior GPT models, we fine-tune the model's behavior using reinforcement learning with human feedback (RLHF) to produce responses better aligned with the user's intent. However, after RLHF, our models can still be brittle on unsafe inputs as well as sometimes exhibit undesired behaviors on both safe and unsafe inputs. These undesired behaviors can arise when instructions to labelers were underspecified during reward model data collection portion of the RLHF pipeline. When given unsafe inputs, the model may generate undesirable content, such as giving advice on committing crimes. Furthermore, the model may also become overly cautious on safe inputs, refusing innocuous requests or excessively hedging. To steer our models towards appropriate behaviour at a more fine-grained level, we rely heavily on our models themselves as tools. Our approach to safety consists of two main components, an additional set of safety-relevant RLHF training prompts, and rule-based reward models (RBRMs).

模型辅助安全流程: 与之前的GPT模型一样,我们使用基于人类反馈的强化学习对模型行为进行微调,以产生更符合用户意图的响应。然而,在RLHF之后,我们的模型在不安全输入上可能仍然脆弱,有时在安全和输入上都会表现出不良行为。当RLHF流程的奖励模型数据收集部分给标注者的指示不明确时,就可能出现这些不良行为。给定不安全输入时,模型可能生成不良内容。此外,模型在安全输入上可能变得过于谨慎,拒绝无害请求或过度回避。为了在更精细的层面上引导我们的模型采取适当行为,我们严重依赖模型本身作为工具。我们的安全方法包含两个主要组成部分:一套额外的安全相关RLHF训练提示,以及基于规则的奖励模型。

Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning that targets correct behavior, such as refusing to generate harmful content or not refusing innocuous requests. The RBRM takes three inputs: the prompt (optional), the output from the policy model, and a human-written rubric for how this output should be evaluated. Then, the RBRM classifies the output based on the rubric. For example, we can provide a rubric that instructs the model to classify a response as one of: (a) a refusal in the desired style, (b) a refusal in the undesired style, (c) containing disallowed content, or (d) a safe non-refusal response. Then on the set of safety-relevant training prompts, which request harmful content such as illicit advice, we can reward GPT-4 for refusing these requests. Conversely, we can reward GPT-4 for not refusing requests on a subset of prompts guaranteed to be safe and answerable. This technique is related to prior work. This, combined with other improvements such as computing optimal RBRM weights and providing additional SFT data targeting the areas we want to improve, allowed us to steer the model closer towards the desired behaviour.

我们的基于规则的奖励模型是一组零样本GPT-4分类器。这些分类器在RLHF微调期间为GPT-4策略模型提供额外的奖励信号,以针对正确行为,例如拒绝生成有害内容或不拒绝无害请求。RBRM接受三个输入:提示、来自策略模型的输出,以及关于如何评估该输出的人工编写的评分标准。然后,RBRM根据评分标准对输出进行分类。例如,我们可以提供一个评分标准,指示模型将响应分类为期望风格的拒绝、不期望风格的拒绝、包含不允许内容或安全的非拒绝响应。然后,在一组请求有害内容的训练提示上,我们可以奖励GPT-4拒绝这些请求。相反,在一组保证安全且可回答的提示子集上,我们可以奖励GPT-4不拒绝请求。这项技术与先前的工作相关。结合其他改进,使我们能够将模型引导至更接近期望的行为。

Improvements on Safety Metrics: Our mitigations have significantly improved many of GPT-4's safety properties. We've decreased the model's tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often (Figure 9). On the RealToxicityPrompts dataset, GPT-4 produces toxic generations only 0.73% of the time, while GPT-3.5 generates toxic content 6.48% of time.

安全指标的改进: 我们的缓解措施显著改善了GPT-4的许多安全特性。与GPT-3.5相比,我们将模型响应不允许内容请求的倾向降低了82%,并且GPT-4在敏感请求上的响应更符合我们的政策。在RealToxicityPrompts数据集上,GPT-4生成有毒内容的几率仅为0.73%,而GPT-3.5为6.48%。

Overall, our model-level interventions increase the difficulty of eliciting bad behavior but doing so is still possible. For example, there still exist "jailbreaks" to generate content which violate our usage guidelines. So long as these limitations exist, it's important to complement them with deployment-time safety techniques like monitoring for abuse as well as a pipeline for fast iterative model improvement.

总体而言,我们的模型级干预增加了诱导不良行为的难度,但这样做仍然是可能的。例如,仍然存在生成违反我们使用指南内容的越狱方式。只要存在这些局限性,就必须辅以部署时的安全技术,如滥用监控和快速迭代模型改进流程。

GPT-4 and successor models have the potential to significantly influence society in both beneficial and harmful ways. We are collaborating with external researchers to improve how we understand and assess potential impacts, as well as to build evaluations for dangerous capabilities that may emerge in future systems. We will soon publish recommendations on steps society can take to prepare for AI's effects and initial ideas for projecting AI's possible economic impacts.

GPT-4及其后续模型有可能以有益和有害的方式显著影响社会。我们正在与外部研究人员合作,以改进我们理解和评估潜在影响的方式,并为未来系统中可能出现危险能力建立评估体系。我们不久将发布建议,说明社会可以采取哪些步骤为AI的影响做准备,并初步设想AI可能带来的经济影响。

Conclusion

We characterize GPT-4, a large multimodal model with human-level performance on certain difficult professional and academic benchmarks. GPT-4 outperforms existing large language models on a collection of NLP tasks, and exceeds the vast majority of reported state-of-the-art systems (which often include task-specific fine-tuning). We find that improved capabilities, whilst usually measured in English, can be demonstrated in many different languages. We highlight how predictable scaling allowed us to make accurate predictions on the loss and capabilities of GPT-4.

我们对GPT-4进行了全面介绍,这是一个大型多模态模型,在部分困难的专业和学术基准上展现出人类水平的性能。GPT-4在一系列NLP任务上超越了现有的大型语言模型,并且超过了绝大多数已报道的最先进系统。我们发现,改进后的能力虽然通常用英语衡量,但可以在多种不同语言中得到体现。我们强调了可预测的扩展如何使我们能够准确预测GPT-4的损失和能力。

GPT-4 presents new risks due to increased capability, and we discuss some of the methods and results taken to understand and improve its safety and alignment. Though there remains much work to be done, GPT-4 represents a significant step towards broadly useful and safely deployed AI systems.

GPT-4因其增强的能力带来了新的风险,我们讨论了一些用于理解和提升其安全性与对齐性的方法和结果。尽管仍有大量工作要做,但GPT-4代表着向广泛有用且安全部署的AI系统迈出的重要一步。