Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks D, Burns C, Kadavath S, et al. Measuring mathematical problem solving with the math dataset[J]. arXiv preprint arXiv:2103.03874, 2021.
https://github.com/hendrycks/math/

benchmark5500+

使用MATH数据集衡量数学问题求解能力

Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

许多智力工作都需要数学问题求解能力，但这项技能仍然是计算机所无法企及的。为了衡量机器学习模型中的这种能力，我们引入了MATH，一个包含12,500个具有挑战性的数学竞赛问题的新数据集。MATH中的每个问题都附有完整的分步解答，可用于教导模型生成答案推导过程和解释说明。为了促进未来研究并提高在MATH上的准确率，我们还贡献了一个大型辅助预训练数据集，它有助于教导模型掌握数学基础知识。尽管我们能够提高在MATH上的准确率，但我们的结果表明，即使使用巨大的Transformer模型，准确率仍然相对较低。此外，我们发现，如果当前的扩展趋势持续下去，仅仅增加预算和模型参数数量对于实现强大的数学推理将是不切实际的。虽然扩展Transformer能够自动解决大多数其他基于文本的任务，但目前扩展规模并不能解决MATH问题。为了在数学问题求解上取得更大进展，我们可能需要更广泛的研究社区提出新的算法进展。

Introduction

Mathematics is a highly effective tool in many intellectual endeavors. It enables us to count and quantify objects, and it can be relied upon because it is consistent and based on logic. Mathematics pervades the sciences and can be used to model planetary orbits, atomic motion, signal frequencies, and much more. These phenomena can be encoded with mathematics precisely and concisely. This has even led some to describe mathematics as being "unreasonably effective" (Wigner, 1960). These observations speak to the broad reach and domain-generality of mathematics.

数学是许多智力活动中极为有效的工具。它使我们能够计数和量化物体，并且由于其一致性和基于逻辑的特性，我们可以信赖它。数学渗透于科学领域，可用于模拟行星轨道、原子运动、信号频率等等。这些现象可以用数学精确而简洁地编码。这甚至让一些人将数学描述为“不合理地有效”（Wigner, 1960）。这些观察表明了数学的广泛影响力和领域通用性。

In machine learning, mathematics is a valuable testbed for problem-solving ability: the ability to analyze a problem, pick out good heuristics from a large set of possibilities, and chain them together to produce an answer. This contrasts with plug-and-chug calculations, a skill which ML models can already exhibit (Henighan et al., 2020). Visual or linguistic reasoning may involve limited problem-solving ability for tasks such as image classification, but unlike math this is not the focus of these domains.

在机器学习中，数学是衡量问题求解能力的宝贵试验场：这种能力包括分析问题、从大量可能性中挑选出良好的启发式方法，并将它们串联起来以产生答案。这与“代入计算”形成对比，后者是机器学习模型已经能够展现的技能（Henighan et al., 2020）。视觉或语言推理可能涉及有限的问题求解能力，例如在图像分类等任务中，但与数学不同的是，这不是这些领域的焦点。

To measure the problem-solving ability of machine learning models, we introduce the MATH dataset, which consists of 12,500 problems from high school math competitions. Given a problem from MATH, machine learning models generate a sequence, such as $\frac{2}{3}$, that encodes the final answer. These answers are unique after normalization, allowing MATH to be scored with exact match rather than with heuristic metrics such as BLEU. In addition, MATH problems are tagged by difficulty from 1 to 5, and span seven subjects including geometry, where diagrams can be specified in text with the Asymptote language. This enables a fine-grained assessment of mathematical problem-solving ability across difficulties and subjects. Finally, problems come with full step-by-step solutions. By training on these, models can learn to generate their own step-by-step solutions, which can facilitate learning and make model outputs more interpretable.

为了衡量机器学习模型的问题求解能力，我们引入了MATH数据集，它包含12,500个来自高中数学竞赛的问题。给定一个MATH问题，机器学习模型生成一个序列，例如 $\frac{2}{3}$，该序列编码了最终答案。这些答案在标准化后是唯一的，使得MATH能够使用精确匹配而非BLEU等启发式指标进行评分。此外，MATH问题按难度从1到5进行标记，涵盖七个学科，包括几何学，其中图形可以用Asymptote语言以文本形式指定。这使得能够对不同难度和学科下的数学问题求解能力进行细粒度的评估。最后，问题附带完整的分步解答。通过在这些解答上进行训练，模型可以学习生成自己的分步解答，这有助于学习并使模型输出更具可解释性。

The MATH dataset is challenging: large language models achieved accuracies ranging from 3.0% to 6.9%. Despite these low accuracies, models clearly possess some mathematical knowledge: they achieve up to 15% accuracy on the easiest difficulty level, and they are able to generate step-by-step solutions that are coherent and on-topic even when incorrect. We also evaluated humans on MATH, and found that a computer science PhD student who does not especially like mathematics attained approximately 40% on MATH, while a three-time IMO gold medalist attained 90%, indicating that MATH can be challenging for humans as well.

MATH数据集具有挑战性：大型语言模型取得的准确率在3.0%到6.9%之间。尽管准确率低，但模型显然具备一定的数学知识：它们在最简单难度级别上能达到高达15%的准确率，并且能够生成连贯且切题的分步解答（即使解答本身是错误的）。我们还评估了人类在MATH上的表现，发现一位并不特别喜欢数学的计算机科学博士生在MATH上取得了约40%的准确率，而一位三次IMO金牌得主则达到了90%，这表明MATH对人类来说也可能具有挑战性。

The presence of step-by-step solutions allows models to utilize "scratch space": rather than having to generate a final answer immediately, models can first generate solutions that may contain intermediate computations. Interestingly, we found that having models generate step-by-step solutions before producing an answer actually decreased accuracy relative to immediately outputting a final answer without generating solutions, indicating the solutions are currently not useful for models at test time. In contrast, having models train on solutions increases relative accuracy by 10% compared to training on the questions and answers directly. We also find that models do better with hints in the form of partial solutions. Our results show that models can make use of actual step-by-step solutions provided to them in various ways, but that they are still unable to effectively use their own generated solutions. Bridging this gap poses an interesting direction for further research.

分步解答的存在允许模型利用“草稿空间”：模型不必立即生成最终答案，而是可以先生成可能包含中间计算的解答。有趣的是，我们发现，让模型在生成答案之前先生成分步解答，实际上相对于不生成解答而直接输出最终答案会降低准确率，这表明目前这些解答在测试时对模型没有用处。相比之下，与直接在问答上进行训练相比，让模型在解答上进行训练能使相对准确率提高10%。我们还发现，模型在得到部分解答形式的提示时表现更好。我们的结果表明，模型能够以各种方式利用提供给它们的真实分步解答，但它们仍然无法有效地利用自己生成的解答。弥合这一差距为后续研究提供了一个有趣的方向。

While MATH covers advanced problem-solving techniques, models may first need to be trained thoroughly on the fundamentals of mathematics. To address this, we create the first large-scale mathematics pretraining dataset with hundreds of thousands of step-by-step solutions in natural language and LaTeX. We call this dataset the Auxiliary Mathematics Problems and Solutions (AMPS) pretraining corpus, which consists of Khan Academy and Mathematica data. AMPS has over 100,000 Khan Academy problems with step-by-step solutions in LaTeX; these exercises are used to teach human students concepts ranging from basic addition to Stokes' Theorem. It also contains over 5 million problems generated using Mathematica scripts, based on 100 hand-designed modules covering topics such as conic sections, div grad and curl, KL divergence, eigenvalues, polyhedra, and Diophantine equations. In total AMPS contains 23GB of problems and solutions. Pretraining on AMPS enables a 0.1 billion parameter model to perform comparably to a fine-tuned model that is $130 \times$ larger.

虽然MATH涵盖了高级的问题求解技巧，但模型可能需要首先在数学基础上进行彻底训练。为了解决这个问题，我们创建了第一个大规模数学预训练数据集，包含数十万个用自然语言和LaTeX编写的分步解答。我们将此数据集称为辅助数学问题与解答（AMPS）预训练语料库，它由可汗学院和Mathematica数据组成。AMPS拥有超过10万个可汗学院问题及其LaTeX分步解答；这些练习用于教导人类学生从基础加法到斯托克斯定理的概念。它还包含超过500万个使用Mathematica脚本生成的问题，基于100个手工设计的模块，涵盖圆锥曲线、散度梯度与旋度、KL散度、特征值、多面体和丢番图方程等主题。AMPS总计包含23GB的问题和解答。在AMPS上进行预训练使得一个拥有1亿参数的模型能够与一个规模大 130 倍的微调模型性能相当。

Altogether, while large Transformer models (Vaswani et al., 2017) make some progress on the MATH dataset, such as by AMPS pretraining or by training with step-by-step solutions, accuracy nonetheless remains relatively low. While enormous Transformers pretrained on massive datasets can now solve most existing text-based tasks, this low accuracy indicates that our MATH dataset is distinctly harder. Accuracy also increases only modestly with model size: assuming a log-linear scaling trend, models would need around $10^{35}$ parameters to achieve 40% accuracy on MATH, which is impractical. Instead, to make large strides on the MATH dataset with a practical amount of resources, we will need new algorithmic advancements from the broader research community.

总的来说，尽管大型Transformer模型通过在AMPS上预训练或使用分步解答训练等方式，在MATH数据集上取得了一些进展，但准确率仍然相对较低。虽然在海量数据集上预训练的巨型Transformer现在能够解决大多数现有的基于文本的任务，但这种低准确率表明我们的MATH数据集明显更难。准确率随模型规模的增加也只是略有提升：假设存在对数线性扩展趋势，模型需要大约 $10^{35}$ 个参数才能在MATH上达到40%的准确率，这是不切实际的。相反，要利用实际可行的资源在MATH数据集上取得重大进展，我们将需要更广泛的研究社区提出新的算法进展。

Neural Theorem Provers. Much of the existing work on machine learning models for mathematical reasoning relies on automated theorem proving benchmarks. Huang et al. (2019) use the Coq theorem proving environment to create a machine learning benchmark with 1,602 theorems and lemmas. Bansal et al. (2019) introduce the HOList benchmark for automated theorem proving, which uses a formal language to enable automatic evaluation. Rather than use HOList, Polu and Sutskever (2020) use the Metamath formalization language for automated theorem proving with promising results. We show an example of Metamath in Figure 1. These benchmarks can be approached with seq2seq (Sutskever et al., 2014) Transformers which have traction on the problem (Polu and Sutskever, 2020; Rabe et al., 2020; Li et al., 2020).

神经定理证明器。 现有关于数学推理的机器学习模型工作大多依赖于自动定理证明基准。Huang等人（2019）使用Coq定理证明环境创建了一个包含1,602个定理和引理的机器学习基准。Bansal等人（2019）引入了用于自动定理证明的HOList基准，该基准使用形式语言以实现自动评估。Polu和Sutskever（2020）没有使用HOList，而是使用Metamath形式化语言进行自动定理证明，并取得了有希望的结果。我们在图1中展示了一个Metamath示例。这些基准可以通过seq2seq Transformer来处理，后者在该问题上已取得进展。

Rather than prove theorems with standard pretrained Transformers, McAllester (2020) proposes that the community create theorem provers that bootstrap their mathematical capabilities through open-ended self-improvement. For bootstrapping to be feasible, models will also need to understand mathematics as humans write it, as manually converting advanced mathematics to a proof generation language is extremely time-consuming. This is why Szegedy (2020) argues that working on formal theorem provers alone will be an impractical path towards world-class mathematical reasoning. We address Szegedy (2020)'s concern by creating a dataset to test understanding of mathematics written in natural language and commonplace mathematical notation. This also means that the answers in our dataset can be assessed without the need for a cumbersome theorem proving environment, which is another advantage of our evaluation framework.

McAllester（2020）提出，与其使用标准的预训练Transformer证明定理，社区应创建能够通过开放式自我改进来引导自身数学能力的定理证明器。为了使引导可行，模型还需要理解人类书写的数学，因为手动将高级数学转换为证明生成语言极其耗时。这就是Szegedy（2020）认为仅致力于形式定理证明器将是不切实际的通往世界级数学推理之路的原因。我们通过创建一个数据集来测试对自然语言和常见数学符号书写的数学的理解，从而回应了Szegedy（2020）的担忧。这也意味着我们的数据集中的答案无需繁琐的定理证明环境即可评估，这是我们评估框架的另一个优势。

Neural Calculators. Recent work shows that Transformers can sometimes perform laborious calculations around as well as calculators and computer algebra systems. Lample and Charton (2020) use Transformers to solve algorithmically generated symbolic integration problems and achieve greater than 95% accuracy. Amini et al. (2019); Ling et al. (2017) introduce plug-and-chug multiple choice mathematics problems and focus on sequence-to-program generation. Saxton et al. (2019) introduce the DeepMind Mathematics dataset, which consists of algorithmically generated plug-and-chug problems such as addition, list sorting, and function evaluation, as shown in Figure 1. Recently, Henighan et al. (2020) show that, excluding problems with astronomically large numbers, the vast majority of the problems in the DeepMind Mathematics dataset can be straightforwardly solved with large Transformers.

神经计算器。 近期研究表明，Transformer有时能像计算器和计算机代数系统一样执行繁重的计算。Lample和Charton（2020）使用Transformer解决算法生成的符号积分问题，准确率超过95%。Amini等人（2019）；Ling等人（2017）引入了代入计算型选择题数学问题，并专注于序列到程序的生成。Saxton等人（2019）引入了DeepMind Mathematics数据集，该数据集包含算法生成的代入计算型问题，如加法、列表排序和函数求值，如图1所示。最近，Henighan等人（2020）表明，排除数字天文数字的问题，DeepMind Mathematics数据集中的绝大多数问题都可以用大型Transformer直接解决。

Benchmarks for Enormous Transformers. There are few existing natural language benchmarks left to solve, as tasks that aggregate multiple subtasks such as SuperGLUE (Wang et al., 2019) are solved by simply training enormous Transformers (He et al., 2020). Kaplan et al. (2020); Henighan et al. (2020) show that the performance of Transformers predictably increases with an increase in model size and dataset size, raising the question of whether natural language processing can be solved by simply increasing compute and funding. Additionally, Chen et al. (2021); Austin et al. (2021) show that code generation models scale reliably across several orders of magnitude, and, should scaling continue, Chen et al. (2021)'s HumanEval code generation dataset should be solved in a few orders of magnitude. In the Supplementary Materials, we even find that large GPT-3 models can perform remarkably well on a sequence completion test similar to an IQ test, the C-Test (Hernández-Orallo, 1998; Legg and Hutter, 2007). Even difficult logical understanding tasks such as LogiQA (Liu et al., 2020) will soon be straightforwardly solved by enormous Transformers should trends continue, which we also show in the Supplementary Materials. Hendrycks et al. (2021) create a multiple-choice benchmark covering 57 subjects. However, unlike our benchmark, which is a text generation task with 12,500 mathematical reasoning questions, their benchmark is a multiple choice task that includes only a few hundred questions about mathematics. In contrast to these benchmarks, we find that our MATH benchmark is unusually challenging for current models and, if trends continue, simply using bigger versions of today's Transformers will not solve our task in the foreseeable future.

巨型Transformer的基准。 现有的自然语言基准已所剩无几，因为像SuperGLUE这样聚合多个子任务的任务，只需训练巨型Transformer即可解决。Kaplan等人（2020）；Henighan等人（2020）表明，Transformer的性能可预测地随模型规模和数据集规模的增加而提升，这引发了是否仅通过增加算力和资金就能解决自然语言处理的问题。此外，Chen等人（2021）；Austin等人（2021）表明，代码生成模型在几个数量级上可靠地扩展，并且如果扩展持续下去，Chen等人（2021）的HumanEval代码生成数据集应在几个数量级内得到解决。在补充材料中，我们甚至发现大型GPT-3模型在一个类似于IQ测试的序列补全测试（C-Test）上表现异常出色。即使是困难的逻辑理解任务（如LogiQA），如果趋势持续下去，很快也会被巨型Transformer直接解决，我们在补充材料中也展示了这一点。Hendrycks等人（2021）创建了一个涵盖57个学科的多项选择基准。然而，与我们的基准（包含12,500个数学推理问题的文本生成任务）不同，他们的基准是一个多项选择任务，仅包含几百个数学问题。与这些基准相比，我们发现我们的MATH基准对当前模型来说异常具有挑战性，并且如果趋势持续下去，仅仅使用当今Transformer的更大版本在可预见的未来也无法解决我们的任务。

Datasets

In this section, we introduce two new datasets, one for benchmarking mathematical problem-solving ability (MATH) and one for pretraining (AMPS).

在本节中，我们介绍两个新数据集，一个用于衡量数学问题求解能力（MATH），另一个用于预训练（AMPS）。

3.1 The MATH Dataset

The MATH dataset consists of problems from mathematics competitions including the AMC 10, AMC 12, AIME, and more. Many of these problems can be collected from aops.com/community/c3158_usa_contests. These competitions span decades and assess the mathematical problem-solving ability of the best young mathematical talent in the United States. Unlike most prior work, most problems in MATH cannot be solved with a straightforward application of standard K-12 mathematics tools. Instead, humans often solve such problem by applying problem solving techniques and "heuristics" (Pólya, 1945).

3.1 MATH 数据集

MATH 数据集包含来自数学竞赛的问题，包括 AMC 10、AMC 12、AIME 等。其中许多问题可以从 aops.com/community/c3158_usa_contests 收集。这些竞赛跨越数十年，评估美国最优秀的年轻数学人才的数学问题求解能力。与大多数先前工作不同，MATH 中的大多数问题无法通过直接应用标准的 K-12 数学工具来解决。相反，人类通常通过应用问题求解技巧和“启发式方法”来解决此类问题。

The Mathematics Aptitude Test of Heuristics dataset, abbreviated MATH, has 12,500 problems (7,500 training and 5,000 test). With this many training problems, models can learn many useful heuristics for problem solving. Each problem has a step-by-step solution and a final boxed answer. Example problems with step-by-step solutions are shown in Figure 1.

启发式数学能力倾向测试数据集，缩写为 MATH，包含 12,500 个问题（7,500 个训练集和 5,000 个测试集）。拥有如此多的训练问题，模型可以学习许多有用的问题求解启发式方法。每个问题都有一个分步解答和一个最终的框选答案。带有分步解答的示例问题如图 1 所示。

Categorizing Problems. Problems span various subjects and difficulties. The seven subjects are Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. While subjects like Prealgebra are generally easier than Precalculus, within a subject problems can take on different difficulty levels. We encode a problem's difficulty level from '1' to '5,' following AoPS. A subject's easiest problems for humans are assigned a difficulty level of '1,' and a subject's hardest problems are assigned a difficulty level of '5.' Concretely, the first few problems of an AMC 8 exam are often level 1, while AIME problems are level 5. This allows us to assess performance across both different subjects and different levels of difficulty.

问题分类。 问题涵盖不同的学科和难度。七个学科分别是：Prealgebra、Algebra、Number Theory、Counting and Probability、Geometry、Intermediate Algebra 和 Precalculus。虽然像 Prealgebra 这样的学科通常比 Precalculus 容易，但在一个学科内，问题可以有不同的难度级别。我们按照 AoPS 的方式，将问题的难度级别编码为从 '1' 到 '5'。一个学科中对人类来说最简单的问题被分配难度级别 '1'，而一个学科中最难的问题被分配难度级别 '5'。具体来说，AMC 8 考试的前几个问题通常是级别 1，而 AIME 问题是级别 5。这使我们能够评估在不同学科和不同难度级别上的表现。

Formatting. Problems and solutions are consistently formatted using LATEX and the Asymptote vector graphics language. Our usage of LATEX allows us to flexibly encode mathematical problems while avoiding unusual symbols or cumbersome formal languages. Meanwhile, mathematical figures are encoded in the Asymptote language rather than as raster images. This enables pure language models to process figures, diagrams, and graphics, making it possible to assess these models on subjects such as geometry for the first time.

格式化。 问题和解答统一使用 LATEX 和 Asymptote 矢量图形语言进行格式化。我们使用 LATEX 可以灵活地编码数学问题，同时避免使用不常见的符号或繁琐的形式语言。同时，数学图形使用 Asymptote 语言而非光栅图像进行编码。这使得纯语言模型能够处理图形、图表和图像，从而首次能够在几何等学科上评估这些模型。

To assess models using exact match, we force the final boxed answers to follow consistent formatting rules. Specifically, probabilities are expressed as simplified fractions. Moreover, matrix entry fractions are encoded with $x / y$ , while all other fractions are consistently encoded with the \frac{x}{y} command. Coefficients are encoded without a multiplication symbol (e.g. 5x not 5*x). Expressions with multiple variables are entered in alphabetical order; polynomials are expressed in decreasing degree order. Different fraction encodings equivalent, such as \frac{x}{y} and \dfrac{x}{y} and x/y. Different parenthesis encodings, such as \left( and (, are treated as equivalent.

为了使用精确匹配评估模型，我们强制最终的框选答案遵循一致的格式化规则。具体来说，概率表示为简化分数。此外，矩阵项中的分数用 $x / y$ 编码，而所有其他分数统一使用 \frac{x}{y} 命令编码。系数编码时不带乘法符号（例如 5x 而非 5*x）。具有多个变量的表达式按字母顺序输入；多项式按次数降序排列。不同的分数编码（如 \frac{x}{y}、\dfrac{x}{y} 和 x/y）视为等价。不同的括号编码如 \left( 和 (,视为等价。

We also allow units to be included or omitted from an answer, we ignore spaces, and we treat common equivalent ways of expressing the same number (e.g., 0.5 and 1/2, or 0.1 and .1) as the same. When the answer is a factorized polynomial, we permit different orderings of the factors, so that $4 (x + 1) (x - 1)$ is equivalent to $4 (x - 1) (x + 1)$ , and so on. These rules cover nearly all ways that different generated or actual solutions can be equivalent in practice.

我们还允许答案中包含或省略单位，忽略空格，并将表示相同数字的常见等价方式（例如 0.5 和 1/2，或 0.1 和 .1）视为相同。当答案是因式分解后的多项式时，我们允许因子顺序不同，因此 $4 (x + 1) (x - 1)$ 与 $4 (x - 1) (x + 1)$ 等价，依此类推。这些规则几乎涵盖了不同生成或实际解答在实践中可能等价的所有方式。

Automatically Assessing Generated Answers. Due to design choices in MATH, we can assess the answers generated by a model automatically, even though the space of model outputs is combinatorially large. Automatic assessment starts by determining the beginning and end of the answer. This is possible to do even if a model generates step-by-step solutions because the final answers in MATH are wrapped and delimited with the \boxed{} command. We can consequently evaluate a model's output by parsing what is inside the \boxed{} command and comparing that with the ground truth answer, while accounting for the equivalent ways of formatting a string described above. Together, the box delimiter and formatting rules provide a unique answer in a well-defined location, which allows us to test for equivalence and use accuracy as our primary metric.

自动评估生成的答案。 由于 MATH 的设计选择，即使模型输出的空间组合巨大，我们也可以自动评估模型生成的答案。自动评估首先确定答案的起始和结束。即使模型生成了分步解答，这也是可行的，因为 MATH 中的最终答案被包裹在 \boxed{} 命令中进行分隔。因此，我们可以通过解析 \boxed{} 命令内的内容，并将其与真实答案进行比较来评估模型的输出，同时考虑上述字符串格式化的等价方式。总之，框定界符和格式化规则在一个定义明确的位置提供了一个唯一的答案，这使我们能够测试等价性并使用准确率作为我们的主要指标。

Human-Level Performance. To provide a rough but informative comparison to human-level performance, we randomly sampled 20 problems from the MATH test set and gave them to humans. We artificially require that the participants have 1 hour to work on the problems and must perform calculations by hand. All participants are university students. One participant who does not like mathematics got $8 / 20 = 40 %$ correct. A participant ambivalent toward mathematics got $13 / 20$ . Two participants who like mathematics got $14 / 20$ and $15 / 20$ . A participant who got a perfect score on the AMC 10 exam and attended USAMO several times got $18 / 20$ . A three-time IMO gold medalist got $18 / 20 = 90 %$ , though missed questions were exclusively due to small errors of arithmetic. Expert-level performance is theoretically $100 %$ given enough time. Even $40 %$ would accuracy for a machine learning model would be impressive but have ramifications for cheating on homework.

人类水平表现。 为了提供一个粗略但具有参考价值的人类水平表现对比，我们从 MATH 测试集中随机抽取了 20 个问题，交给人类解答。我们人为要求参与者有 1 小时的时间解题，并且必须手工计算。所有参与者都是大学生。一位不喜欢数学的参与者答对了 $8 / 20 = 40 %$ 。一位对数学态度中立的参与者答对了 $13 / 20$ 。两位喜欢数学的参与者分别答对了 $14 / 20$ 和 $15 / 20$ 。一位在 AMC 10 考试中获得满分并多次参加 USAMO 的参与者答对了 $18 / 20$ 。一位三次 IMO 金牌得主答对了 $18 / 20 = 90 %$ ，尽管错过的题目完全是由于小的算术错误。专家级表现理论上在时间充足的情况下是 $100 %$ 。对于机器学习模型来说，即使达到 $40 %$ 的准确率也是令人印象深刻的，但可能会对防止作业作弊产生影响。

3.2 AMPS (Khan + Mathematica) Dataset

Since pretraining data can greatly influence performance (Hernandez et al., 2021; Gururangan et al., 2020) and since mathematics is a small fraction of online text, we introduce a large and diverse mathematics pretraining corpus. Our pretraining dataset, the Auxiliary Mathematics Problems and Solutions (AMPS) dataset, has problems and step-by-step solutions typeset in LaTeX. AMPS contains over 100,000 problems pulled from Khan Academy and approximately 5 million problems generated from manually designed Mathematica scripts.

3.2 AMPS（可汗学院 + Mathematica）数据集

由于预训练数据可以极大地影响性能，并且数学在在线文本中只占一小部分，我们引入了一个大规模且多样化的数学预训练语料库。我们的预训练数据集，即辅助数学问题与解答数据集，包含用 LaTeX 排版的问题和分步解答。AMPS 包含从可汗学院获取的超过 10 万个问题，以及从手动设计的 Mathematica 脚本生成的大约 500 万个问题。

Khan Academy. The Khan Academy subset of AMPS has 693 exercise types with over 100,000 problems and full solutions. Problem types range from elementary mathematics (e.g. addition) to multivariable calculus (e.g. Stokes' theorem), and are used to teach actual K-12 students. The exercises can be regenerated using code from github.com/Khan/khan-exercises/. We show the full list of problem types in the Supplementary Materials.

可汗学院。 AMPS 中的可汗学院子集包含 693 种练习类型，超过 10 万个问题及其完整解答。问题类型涵盖从基础数学（例如加法）到多变量微积分（例如斯托克斯定理），用于教导真实的 K-12 学生。这些练习可以使用 github.com/Khan/khan-exercises/ 的代码重新生成。我们在补充材料中展示了完整的问题类型列表。

Mathematica. To make AMPS larger, we also contribute our own Mathematica scripts to generate approximately $50 \times$ more problems than our Khan Academy dataset. With Mathematica, we designed 100 scripts that test distinct mathematics concepts, 37 of which include full step-by-step LaTeX solutions in addition to final answers. We generated around 50,000 exercises from each of our scripts, or around 5 million problems in total. This results in over 23 GB of mathematics problems, making it larger than the 16 GB of natural language used to train BERT (Devlin et al., 2019).

Mathematica。 为了扩大 AMPS 的规模，我们还贡献了自己的 Mathematica 脚本，以生成比可汗学院数据集多大约 $50 \times$ 的问题。使用 Mathematica，我们设计了 100 个测试不同数学概念的脚本，其中 37 个在最终答案之外还包含完整的、分步的 LaTeX 解答。我们从每个脚本生成了大约 50,000 个练习，总计约 500 万个问题。这产生了超过 23 GB 的数学问题，比用于训练 BERT 的 16 GB 自然语言数据还要大。

Problems include various aspects of algebra, calculus, counting and statistics, geometry, linear algebra, and number theory (see Table 1 for a sampling of topics). Unlike prior approaches to algorithmically generating mathematics problems, we use Mathematica's computer algebra system so that we can manipulate fractions, transcendental numbers, and analytic functions.

问题涵盖代数、微积分、计数与统计、几何、线性代数和数论的各个方面（主题示例见表 1）。与先前算法生成数学问题的方法不同，我们使用 Mathematica 的计算机代数系统，以便能够处理分数、超越数和解析函数。

Conclusion

In this paper, we laid groundwork for future research in machine learning for mathematical problem solving. We introduced the MATH benchmark, which enables the community to measure mathematical problem-solving ability. In addition to having answers, all MATH problems also include answer explanations, which models can learn from to generate their own step-by-step solutions. We also introduce AMPS, a diverse pretraining corpus that can enable future models to learn virtually all of K-12 mathematics. While most other text-based tasks are already nearly solved by enormous Transformers, MATH is notably different. We showed that accuracy is slowly increasing and, if trends continue, the community will need to discover conceptual and algorithmic breakthroughs to attain strong performance on MATH. Given the broad reach and applicability of mathematics, solving the MATH dataset with machine learning would be of profound practical and intellectual significance.

在本文中，我们为未来机器学习在数学问题求解领域的研究奠定了基础。我们引入了MATH基准，使研究社区能够衡量数学问题求解能力。除了答案之外，所有MATH问题都包含解答说明，模型可以从中学习以生成自己的分步解答。我们还引入了AMPS，一个多样化的预训练语料库，可以使未来的模型学习几乎所有的K-12数学知识。虽然大多数其他基于文本的任务已被巨型Transformer近乎解决，但MATH明显不同。我们表明准确率正在缓慢提升，并且如果当前趋势持续下去，研究社区将需要发现概念和算法上的突破，才能在MATH上取得强劲表现。鉴于数学的广泛影响力和适用性，用机器学习解决MATH数据集将具有深远的实践和智力意义。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

Measuring Mathematical Problem Solving With the MATH Dataset

使用MATH数据集衡量数学问题求解能力

Abstract

Introduction

Datasets

3.1 The MATH Dataset

3.1 MATH 数据集

3.2 AMPS (Khan + Mathematica) Dataset

3.2 AMPS（可汗学院 + Mathematica）数据集

Conclusion

🤖 Rasa

Measuring Mathematical Problem Solving With the MATH Dataset ​

使用MATH数据集衡量数学问题求解能力 ​

Abstract ​

Introduction ​

Related Work ​

Datasets ​

3.1 The MATH Dataset ​

3.1 MATH 数据集 ​

3.2 AMPS (Khan + Mathematica) Dataset ​

3.2 AMPS（可汗学院 + Mathematica）数据集 ​

Conclusion ​

Measuring Mathematical Problem Solving With the MATH Dataset

使用MATH数据集衡量数学问题求解能力

Abstract

Introduction

Related Work

Datasets

3.1 The MATH Dataset

3.1 MATH 数据集

3.2 AMPS (Khan + Mathematica) Dataset

3.2 AMPS（可汗学院 + Mathematica）数据集

Conclusion