测量大规模多任务语言理解
Abstract
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.
我们提出了一项新的测试来衡量文本模型的多任务准确率。该测试涵盖57项任务,包括基础数学、美国历史、计算机科学、法律等。要在此测试中取得高准确率,模型必须具备广泛的世界知识和问题解决能力。我们发现,虽然大多数近期模型的准确率接近随机水平,但最大的GPT-3模型平均比随机水平提高了近20个百分点。然而,在这57项任务中的每一项上,最佳模型在达到专家级准确率之前仍需大幅改进。模型的表现也参差不齐,并且常常不知道自己的错误所在。更糟糕的是,在一些具有重要社会意义的学科上,例如道德和法律,它们的准确率仍然接近随机水平。通过全面评估模型学术和专业理解的广度和深度,我们的测试可用于分析模型在多任务上的表现,并识别其重要缺陷。
Introduction
Natural Language Processing (NLP) models have achieved superhuman performance on a number of recently proposed benchmarks. However, these models are still well below human level performance for language understanding as a whole, suggesting a disconnect between our benchmarks and the actual capabilities of these models. The General Language Understanding Evaluation benchmark (GLUE) was introduced in 2018 to evaluate performance on a wide range of NLP tasks, and top models achieved superhuman performance within a year. To address the shortcomings of GLUE, researchers designed the SuperGLUE benchmark with more difficult tasks. About a year since the release of SuperGLUE, performance is again essentially human-level. While these benchmarks evaluate linguistic skills more than overall language understanding, an array of commonsense benchmarks have been proposed to measure basic reasoning and everyday knowledge. However, these recent benchmarks have similarly seen rapid progress. Overall, the near human-level performance on these benchmarks suggests that they are not capturing important facets of language understanding.
自然语言处理模型在近期提出的多项基准上已取得超越人类的表现。然而,这些模型在整体语言理解方面仍远低于人类水平,这表明我们的基准与这些模型的实际能力之间存在脱节。通用语言理解评估基准于2018年被提出,用于评估模型在广泛NLP任务上的性能,顶尖模型在一年内就达到了超人类水平。为了解决GLUE的不足,研究人员设计了包含更难任务的SuperGLUE基准。SuperGLUE发布约一年后,其性能再次基本达到人类水平。虽然这些基准更多评估的是语言技能而非整体语言理解,但研究者也提出了一系列常识基准来衡量基础推理和日常知识。然而,这些近期基准同样进展迅速。总体而言,在这些基准上接近人类的表现表明,它们并未捕捉到语言理解的重要方面。
Transformer models have driven this recent progress by pretraining on massive text corpora, including all of Wikipedia, thousands of books, and numerous websites. These models consequently see extensive information about specialized topics, most of which is not assessed by existing NLP benchmarks. It consequently remains an open question just how capable current language models are at learning and applying knowledge from many domains.
Transformer模型通过在包含整个维基百科、数千本书籍和无数网站的海量文本语料库上进行预训练,推动了这一近期进展。因此,这些模型接触了大量关于专业主题的信息,而其中大部分是现有NLP基准未评估的。因此,当前语言模型学习和应用多领域知识的能力究竟如何,仍然是一个悬而未决的问题。
To bridge the gap between the wide-ranging knowledge that models see during pretraining and the existing measures of success, we introduce a new benchmark for assessing models across a diverse set of subjects that humans learn. We design the benchmark to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model's blind spots.
为了弥合模型在预训练期间接触的广泛知识与现有成功衡量标准之间的差距,我们引入了一个新的基准,用于评估模型在人类学习的多样化科目上的表现。我们设计该基准旨在通过仅在零样本和少样本设置下评估模型,来衡量预训练期间获得的知识。这使得基准更具挑战性,也更接近我们评估人类的方式。该基准涵盖STEM、人文学科、社会科学等领域的57个科目。其难度从初级水平到高级专业水平不等,测试内容既包括世界知识也包括问题解决能力。科目范围从数学和历史等传统领域,到法律和伦理学等更专业的领域。科目的细致性和广度使该基准成为识别模型盲点的理想选择。
We find that meaningful progress on our benchmark has only become possible in recent months. In particular, few-shot models up to 13 billion parameters achieve random chance performance of 25% accuracy, but the 175 billion parameter GPT-3 model reaches a much higher 43.9% accuracy (see Figure 1b). On the other hand, unlike human professionals GPT-3 does not excel at any single subject. Instead, we find that performance is lopsided, with GPT-3 having almost 70% accuracy for its best subject but near-random performance for several other subjects.
我们发现,直到最近几个月,在我们的基准上才取得有意义的进展。特别是,高达130亿参数的少样本模型仅达到25%准确率的随机水平,而拥有1750亿参数的GPT-3模型则达到了43.9%的更高准确率。另一方面,与人类专家不同,GPT-3并未在任何单一科目上表现卓越。相反,我们发现其表现并不均衡,GPT-3在其最佳科目上准确率接近70%,但在其他几个科目上表现却接近随机。
Our results indicate that while recent advances have been impressive, state-of-the-art models still struggle at learning and applying knowledge from pretraining. The tasks with near-random accuracy include calculation-heavy subjects such as physics and mathematics and subjects related to human values such as law and morality. This second weakness is particularly concerning because it will be important for future models to have a strong understanding of what is legal and what is ethical. Worryingly, we also find that GPT-3 does not have an accurate sense of what it does or does not know since its average confidence can be up to 24% off from its actual accuracy. We comprehensively evaluate the breadth and depth of a model's text understanding by covering numerous topics that humans are incentivized to learn. Since our test consists in 57 tasks, it can be used to analyze aggregate properties of models across tasks and to track important shortcomings.
我们的结果表明,尽管近期进展令人印象深刻,但最先进的模型在学习和应用预训练知识方面仍然存在困难。那些准确率接近随机的任务包括物理学和数学等计算密集型科目,以及法律和道德等与人类价值观相关的科目。第二个弱点尤其令人担忧,因为未来模型能否深刻理解合法与道德至关重要。令人担忧的是,我们还发现GPT-3对其知道或不知道什么并没有准确的认识,其平均置信度与实际准确率的偏差高达24%。我们通过涵盖众多人类有动力去学习的主题,全面评估了模型文本理解的广度和深度。由于我们的测试包含57项任务,它可用于分析模型跨任务的聚合属性,并追踪其重要缺陷。
Related Work
Pretraining. The dominant paradigm in NLP is to pretrain large models on massive text corpora including educational books and websites. In the process, these models are exposed to information about a wide range of topics. Petroni et al. (2019) found that recent models learn enough information from pretraining that they can serve as knowledge bases. However, no prior work has comprehensively measured the knowledge models have across many real-world domains.
预训练。 NLP的主流范式是在包含教育书籍和网站的大规模文本语料库上预训练大型模型。在这个过程中,模型接触到了广泛主题的信息。Petroni等人发现,近期模型从预训练中学到的信息足够多,以至于它们可以作为知识库使用。然而,此前没有工作全面衡量模型在众多现实世界领域中所拥有的知识。
Until recently, researchers primarily used fine-tuned models on downstream tasks (Devlin et al., 2019). However, larger pretrained models like GPT-3 (Brown et al., 2020) have made it possible to achieve competitive performance without fine-tuning by using few-shot learning, which removes the need for a large fine-tuning set. With the advent of strong zero-shot and few-shot learning, it is now possible to curate a diverse set of tasks for evaluation and remove the possibility of models on “spurious cues” (Geirhos et al., 2020; Hendrycks et al., 2019b) in a dataset to achieve high performance.
直到最近,研究人员主要在下游任务上使用微调模型。然而,像GPT-3这样更大的预训练模型使得通过少样本学习实现有竞争力的性能成为可能,而无需微调,这消除了对大型微调集的需求。随着强大的零样本和少样本学习的出现,现在可以策划多样化的任务集进行评估,并消除模型利用数据集中的"虚假线索"来获得高性能的可能性。
Benchmarks. Many recent benchmarks aim to assess a model's general world knowledge and basic reasoning ability by testing its "commonsense." A number of commonsense benchmarks have been proposed in the past year, but recent models are already nearing human-level performance on several of these, including HellaSwag, Physical IQA, and CosmosQA. By design, these datasets assess abilities that almost every child has. In contrast, we include harder specialized subjects that people must study to learn.
基准。 许多近期基准旨在通过测试模型的"常识"来评估其一般世界知识和基本推理能力。过去一年中提出了许多常识基准,但最近的模型在其中的几个基准上已经接近人类水平的表现,包括HellaSwag、Physical IQA和CosmosQA。根据设计,这些数据集评估的是几乎每个孩子都具备的能力。相比之下,我们包含了人们必须通过学习才能掌握的、难度更高的专业学科。
Some researchers have suggested that the future of NLP evaluation should focus on Natural Language Generation (NLG), an idea that reaches back to the Turing Test. However, NLG is notoriously difficult to evaluate and lacks a standard metric. Consequently, we instead create a simple-to-evaluate test that measures classification accuracy on multiple choice questions.
一些研究人员提出,NLP评估的未来应聚焦于自然语言生成,这一思想可追溯到图灵测试。然而,自然语言生成的评估是出了名的困难,并且缺乏一个标准度量。因此,我们转而创建了一个易于评估的测试,衡量多项选择题的分类准确率。
While several question answering benchmarks exist, they are comparatively limited in scope. Most either cover easy topics like grade school subjects for which models can already achieve strong performance, or are focused on linguistic understanding in the form of reading comprehension. In contrast, we include a wide range of difficult subjects that go far beyond linguistic understanding.
尽管存在几个问答基准,但它们的范围相对有限。大多数要么涵盖模型已经能够取得强劲表现的、如中小学科目这样的简单主题,要么以阅读理解的形式专注于语言理解。相比之下,我们包含了远远超出语言理解的广泛且困难的学科。
A multitask Test
We create a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. There are 57 tasks in total, which is also the number of Atari games, all of which are listed in Appendix B. The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination. It also includes questions designed for undergraduate courses and questions designed for readers of Oxford University Press books. Some tasks cover a subject, like psychology, but at a specific level of difficulty, such as "Elementary," "High School," "College," or "Professional." For example, the "Professional Psychology" task draws on questions from freely available practice questions for the Examination for Professional Practice in Psychology, while the "High School Psychology" task has questions like those from Advanced Placement Psychology examinations.
我们创建了一个大规模的多任务测试,由来自不同知识领域的多项选择题组成。该测试涵盖人文学科、社会科学、硬科学以及其他一些对某些人来说很重要的学习领域。总共包含57项任务,与雅达利游戏的数量相同,所有这些任务都列在附录B中。数据集中的问题由研究生和本科生从网上免费资源中手动收集。这些资源包括诸如研究生入学考试和美国医师执照考试等考试的练习题。它还包括为本科课程设计的问题以及为牛津大学出版社书籍读者设计的问题。有些任务涵盖一个学科,如心理学,但处于特定的难度级别,例如“初级”、“高中”、“大学”或“专业”。例如,“专业心理学”任务取材于心理学专业实践考试的免费练习题,而“高中心理学”任务则包含类似于大学先修心理学考试的问题。
We collected 15908 questions in total, which we split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions. Each subject contains 100 test examples at the minimum, which is longer than most exams designed to assess people.
我们总共收集了15908个问题,并将其分为少样本开发集、验证集和测试集。少样本开发集每个科目有5个问题,验证集可用于选择超参数,包含1540个问题,测试集包含14079个问题。每个科目至少包含100个测试样本,这比大多数用于评估人类的考试都要长。
Human-level accuracy on this test varies. Unspecialized humans from Amazon Mechanical Turk obtain 34.5% accuracy on this test. Meanwhile, expert-level performance can be far higher. For example, real-world test-taker human accuracy at the 95th percentile is around 87% for US Medical Licensing Examinations, and these questions make up our "Professional Medicine" task. If we take the 95th percentile human test-taker accuracy for exams that build up our test, and if we make an educated guess when such information is unavailable, we then estimate that expert-level accuracy is approximately 89.8%.
人类在此测试上的准确率各不相同。来自亚马逊土耳其机器人的非专业人类在此测试上获得了34.5%的准确率。与此同时,专家级别的表现可能要高得多。例如,在美国医师执照考试中,真实世界考生在第95百分位的人类准确率约为87%,而这些问题构成了我们的“专业医学”任务。如果我们取构成我们测试的考试的第95百分位人类考生准确率,并在无法获得此类信息时进行有根据的推测,我们估计专家级准确率约为89.8%。
Since our test aggregates different subjects and several levels of difficulty, we measure more than straightforward commonsense or narrow linguistic understanding. Instead, we measure arbitrary real-world text understanding. Since models are pretrained on the Internet, this enables us to test how well they can extract useful knowledge from massive corpora. Future models that use this test could be single models or a mixture of experts model. To succeed at our test, future models should be well-rounded, possess extensive world knowledge, and develop expert-level problem solving ability. These properties make the test likely to be an enduring and informative goalpost.
由于我们的测试聚合了不同的学科和几个难度级别,我们衡量的不仅仅是直接的常识或狭隘的语言理解。相反,我们衡量的是任意的现实世界文本理解。由于模型在互联网上进行预训练,这使我们能够测试它们从海量语料库中提取有用知识的能力。未来使用此测试的模型可以是单一模型,也可以是混合专家模型。要在我们的测试中取得成功,未来的模型应该是全面的,拥有广泛的世界知识,并发展出专家级的问题解决能力。这些特性使得该测试很可能成为一个持久且信息丰富的标杆。
3.1 Humanities
The humanities is a group of disciplines that make use of qualitative analysis and analytic methods rather than scientific empirical methods. Branches of the humanities include law, philosophy, history, and so on (Appendix B). Mastering these subjects requires a variety of skills. For example, legal understanding requires knowledge of how to apply rules and standards to complex scenarios, and also provide answers with stipulations and explanations. We illustrate this in Figure 2. Legal understanding is also necessary for understanding and following rules and regulations, a necessary capability to constrain open-world machine learning models. For philosophy, our questions cover concepts like logical fallacies, formal logic, and famous philosophical arguments. It also covers moral scenarios, including questions from the ETHICS dataset that test a model's understanding of normative statements through predicting widespread moral intuitions about diverse everyday scenarios. Finally, our history questions cover a wide range of time periods and geographical locations, including prehistory and other advanced subjects.
3.1 人文学科
人文学科是一组利用定性分析和分析方法而非科学实证方法的学科。人文学科的分支包括法律、哲学、历史等。掌握这些学科需要多种技能。例如,法律理解需要知道如何将规则和标准应用于复杂场景,并给出附带规定和解释的答案。我们在图2中对此进行了说明。法律理解对于理解和遵守规则法规也是必要的,这是约束开放世界机器学习模型所必需的能力。对于哲学,我们的问题涵盖逻辑谬误、形式逻辑和著名哲学论点等概念。它还涵盖道德场景,包括来自ETHICS数据集的问题,该数据集通过预测关于各种日常场景的广泛道德直觉来测试模型对规范性陈述的理解。最后,我们的历史问题涵盖广泛的时间段和地理位置,包括史前史和其他高级主题。
3.2 Social Science
Social science includes branches of knowledge that examine human behavior and society. Subject areas include economics, sociology, politics, geography, psychology, and so on. See Figure 3 for an example question. Our economics questions include microeconomics, macroeconomics, and econometrics, and cover different types of problems, including questions that require a mixture of world knowledge, qualitative reasoning, or quantitative reasoning. We also include important but more esoteric topics such as security studies in order to test the boundaries of what is experienced and learned during pretraining. Social science also includes psychology, a field that may be especially important for attaining a nuanced understanding of humans.
3.2 社会科学
社会科学包括考察人类行为和社会行为的知识分支。学科领域包括经济学、社会学、政治学、地理学、心理学等。示例问题见图3。我们的经济学问题包括微观经济学、宏观经济学和计量经济学,涵盖不同类型的问题,包括需要结合世界知识、定性推理或定量推理的问题。我们还包含了重要但更晦涩的主题,如安全研究,以测试预训练期间经历和学习的边界。社会科学还包括心理学,这个领域对于获得对人类的细致理解可能尤其重要。
3.3 Science, Technology, Engineering, And Mathematics (STEM)
STEM subjects include physics, computer science, mathematics, and more. Two examples are shown in Figure 4. Conceptual physics tests understanding of simple physics principles and may be thought of as a harder version of the physical commonsense benchmark Physical IQA. We also test mathematical problem solving ability at various levels of difficulty, from the elementary to the college level. College mathematics questions, like those found on the GRE mathematics subject test, often require chains of reasoning and abstract knowledge. To encode mathematics expressions, we use LaTeX or symbols such as * and ^ for multiplication and exponentiation respectively. STEM subjects require knowledge of empirical methods, fluid intelligence, and procedural knowledge.
3.3 科学、技术、工程和数学
STEM学科包括物理学、计算机科学、数学等。图4展示了两个示例。概念物理学测试对简单物理原理的理解,可以看作是物理常识基准Physical IQA的更难版本。我们还测试不同难度级别的数学问题解决能力,从初级到大学水平。大学数学问题,如GRE数学科目考试中的问题,通常需要推理链和抽象知识。为了编码数学表达式,我们使用LaTeX或符号,例如分别使用*和^表示乘法和幂运算。STEM学科需要经验方法、流体智力和程序性知识的知识。
3.4 Other
There is a long tail of subjects that either do not neatly fit into any of the three preceding categories or for which there are not thousands of freely available questions. We put these subjects into Other. This section includes the Professional Medicine task, which has difficult questions that require humans many years of study to master. An example is depicted in Figure 5. This section also contains business topics like finance, accounting, and marketing, as well as knowledge of global facts. The latter includes statistics about poverty in different countries over time, which may be necessary for having an accurate model of the world internationally.
3.4 其他
有一长尾学科要么不完全符合上述三个类别中的任何一个,要么没有数千个免费可用的问题。我们将这些学科归入“其他”类别。本节包括专业医学任务,其中包含需要人类多年学习才能掌握的难题。图5描绘了一个示例。本节还包含商业主题,如金融、会计和市场营销,以及全球事实知识。后者包括不同国家随时间变化的贫困统计数据,这对于拥有准确的国际世界模型可能是必要的。
Discussion
Multimodal Understanding. While text is capable of conveying an enormous number of concepts about the world, many important concepts are conveyed mainly through other modalities, such as images, audio, and physical interaction. Existing large-scale NLP models, such as GPT-3, do not incorporate multimodal information, so we design our benchmark to capture a diverse array of tasks in a text-only format. However, as models gain the ability to process multimodal inputs, benchmarks should be designed to reflect this change. One such benchmark could be a "Turk Test," consisting of Amazon Mechanical Turk Human Intelligence Tasks. These are well-defined tasks that require models to interact with flexible formats and demonstrate multimodal understanding.
多模态理解。 虽然文本能够传达关于世界的海量概念,但许多重要概念主要通过其他模态传达,例如图像、音频和物理交互。现有的如GPT-3等大规模NLP模型并未整合多模态信息,因此我们设计了仅以文本格式捕捉多样化任务的基准。然而,随着模型获得处理多模态输入的能力,基准也应设计以反映这一变化。一个可能的基准可以是"图灵测试",由亚马逊土耳其机器人的人类智能任务组成。这些是定义明确的任务,要求模型与灵活的格式交互并展示多模态理解。
The Internet as a Training Set. A major distinction between our benchmark and previous multitask NLP benchmarks is that we do not require large training sets. Instead, we assume that models have acquired the requisite knowledge from reading vast quantities of diverse text from the Internet. This process is typically called pretraining, but it can be thought of as training in its own right, where the downstream evaluation is demonstrating whatever knowledge we would expect a human to pick up from reading the same text.
互联网作为训练集。 我们的基准与以往多任务NLP基准的一个主要区别在于,我们不要求大型训练集。相反,我们假设模型通过从互联网上阅读大量多样化文本获得了所需的知识。这个过程通常称为预训练,但它本身可以被视为训练,其中下游评估展示的是我们期望人类通过阅读相同文本所能获取的知识。
This motivates us to propose a methodological change so that models are trained more like how humans learn. While most previous machine learning benchmarks have models learn from a large question bank, humans primarily learn new subjects by reading books and listening to others talk about the topic. For specialized subjects such as Professional Law, massive legal corpora are available, such as the 164-volume legal encyclopedia Corpus Juris Secundum, but there are fewer than 5,000 multistate bar exam questions available. Learning the entire law exclusively through a small number of practice tests is implausible, so future models must learn more during pretraining.
这促使我们提出一种方法论上的改变,使模型的训练更像人类的学习方式。以往大多数机器学习基准让模型从大型题库中学习,而人类主要通过阅读书籍和倾听他人谈论该主题来学习新学科。对于像专业法律这样的专业学科,存在大量的法律语料库,如164卷的法律百科全书《美国法律释义》,但可供的多州律师资格考试题目不足5000道。仅通过少量模拟考试来学习全部法律是不现实的,因此未来的模型必须在预训练期间学习更多内容。
For this reason we assess pretrained models in a zero-shot, few-shot, or transfer setting and we provide a dev, val, and test set for each task. The dev set is used for few-shot prompts, the val set could be used for hyperparameter tuning, and the test set is used to compute the final accuracy. Importantly, the format of our evaluation is not identical to the format in which information is acquired during pretraining. This has the benefit of obviating concerns about spurious training set annotation artifacts and is in stark contrast to the previous paradigm of identically distributed training and test sets. This change also enables collecting a much more extensive and diverse set of tasks for evaluation. We anticipate our methodology becoming more widespread as models improve at extracting information from diverse online sources.
出于这个原因,我们在零样本、少样本或迁移设置下评估预训练模型,并为每个任务提供了开发集、验证集和测试集。开发集用于少样本提示,验证集可用于超参数调优,测试集用于计算最终准确率。重要的是,我们评估的格式与预训练期间信息获取的格式并不完全相同。这有一个好处,即可以避免对训练集标注伪影的担忧,并且与之前同分布训练集和测试集的范式形成鲜明对比。这一改变还使得能够收集更广泛、更多样化的任务集进行评估。我们预期,随着模型从多样化在线来源提取信息能力的提升,我们的方法将变得更加普及。
Model Limitations. We find that current large-scale Transformers have wide room for improvement. They are notably poor at modeling human (dis)approval, as evident by the low performance on the Professional Law and Moral Scenarios tasks. For future systems to be aligned with human values, high performance on these tasks is crucial, so future research should especially aim to increase accuracy on these tasks. Models also have difficulty performing calculations, so much so that they exhibit poor performance on Elementary Mathematics and many other STEM subjects with "plug and chug" problems. Additionally, they do not match expert-level performance (90%) on any subject, so for all subjects it is subhuman. On average, models are only now starting to move beyond random-chance accuracy levels.
模型局限性。 我们发现当前的大型Transformer仍有很大的改进空间。它们在模拟人类赞同或不赞同方面表现尤其不佳,专业法律和道德场景任务的低性能证明了这一点。为了使未来系统与人类价值观对齐,在这些任务上取得高性能至关重要,因此未来的研究应特别着力于提高这些任务的准确率。模型在执行计算方面也存在困难,以至于在基础数学和许多其他需要"套公式计算"的STEM科目上表现不佳。此外,它们在任何一个科目上都未能达到专家级水平,因此在所有科目上都低于人类水平。平均而言,模型现在才开始超越随机水平的准确率。
Addressing these shortcomings may be challenging. To illustrate this, we attempted to create a better Professional Law model by pretraining on specialized data but achieved only limited success. We collected approximately 2,000 additional Professional Law training examples. After fine-tuning a RoBERTa-base model using this custom training set, our model attained 32.8% test accuracy. To test the impact of additional specialized training data, we also had RoBERTa continue pretraining on approximately 1.6 million legal case summaries using Harvard's Law Library case law corpus case.law, but after fine-tuning it only attained 36.1% accuracy. This suggests that while additional pretraining on relevant high quality text can help, it may not be enough to substantially increase the performance of current models.
解决这些缺陷可能具有挑战性。为了说明这一点,我们尝试通过在专业数据上进行预训练来创建一个更好的专业法律模型,但仅取得了有限的成功。我们收集了大约2000个额外的专业法律训练样本。在使用此定制训练集对RoBERTa-base模型进行微调后,我们的模型获得了32.8%的测试准确率。为了测试额外专业训练数据的影响,我们还让RoBERTa使用哈佛法学院判例法语料库 case.law 对大约160万条法律案例摘要进行了继续预训练,但微调后仅达到36.1%的准确率。这表明,虽然对相关高质量文本进行额外的预训练可能有帮助,但可能不足以大幅提高当前模型的性能。
It is unclear whether simply scaling up existing language models will solve the test. Current understanding indicates that a 10× increase in model size must be accompanied by an approximate 5× increase in data. Aside from the tremendous expense in creating multi-trillion parameter language models, data may also become a bottleneck, as there is far less written about esoteric branches of knowledge than about everyday situations.
目前尚不清楚简单地扩展现有语言模型规模是否能解决该测试。目前的理解表明,模型规模增加10倍必须伴随约5倍的数据量增加。除了创建数万亿参数语言模型的巨大成本外,数据也可能成为瓶颈,因为关于深奥知识分支的书面材料远少于关于日常情境的材料。
Conclusion
We introduced a new test that measures how well text models can learn and apply knowledge encountered during pretraining. By covering 57 subjects at varying levels of difficulty, the test assesses language understanding in greater breadth and depth than previous benchmarks. We found that it has recently become possible for models to make meaningful progress on the test, but that state-of-the-art models have lopsided performance and rarely excel at any individual task. We also showed that current models are uncalibrated and have difficulty with tasks that require calculations. Worryingly, models also perform especially poorly on socially relevant subjects including morality and law. Our expansive test can help researchers pinpoint important shortcomings of models, making it easier to gain a clearer picture of state-of-the-art capabilities.
我们引入了一项新测试,用于衡量文本模型在预训练期间学习和应用知识的能力。通过覆盖 57 个不同难度级别的学科,该测试比以往基准更广、更深地评估了语言理解能力。我们发现,模型近期才可能在此测试上取得有意义的进展,但最先进的模型表现参差不齐,且很少在任何单个任务上表现卓越。我们还表明,当前模型存在校准不足的问题,并且难以完成需要计算的任务。令人担忧的是,模型在道德和法律等社会相关学科上的表现尤其糟糕。我们广泛的测试可以帮助研究人员精准定位模型的重要缺陷,从而更清晰地了解最先进模型的能力。