Scaling Laws for Neural Language Models
神经语言模型的缩放定律
Abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
我们研究了语言模型在交叉熵损失上性能的经验性缩放定律。损失随着模型规模、数据集大小以及训练所用计算量的增加而呈幂律缩放,某些趋势跨越了七个多数量级。其他架构细节,如网络宽度或深度,在很大范围内影响甚微。简单的方程可以描述过拟合对模型/数据集大小的依赖关系,以及训练速度对模型大小的依赖关系。这些关系使我们能够确定固定计算预算下的最优分配方案。更大的模型具有显著更高的样本效率,因此计算效率最优的训练方式涉及在相对适量的数据上训练非常大的模型,并在明显收敛之前就停止训练。
Introduction
Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world's text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models approaching human-level performance on many specific tasks, including the composition of coherent multi-paragraph prompted text samples.
语言为人工智能研究提供了一个天然的领域,因为绝大多数推理任务可以高效地用语言表达和评估,并且世界上的文本为通过生成建模进行无监督学习提供了丰富的数据。深度学习近来在语言建模方面取得了快速进展,最先进的模型在许多特定任务上接近人类水平的表现,包括撰写连贯的多段落提示文本样本。
One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors, focusing on the Transformer architecture [VSP+17, LSP+18]. The high ceiling and low floor for performance on language tasks allows us to study trends over more than seven orders of magnitude in scale.
人们可能会预期语言建模性能取决于模型架构、神经模型的规模、用于训练它们的计算能力以及此训练过程可用的数据。在本工作中,我们将实证研究语言建模损失对所有上述因素的依赖性,重点关注Transformer架构。语言任务性能的高上限和低下限使我们能够研究跨越七个多数量级的规模趋势。
Throughout we will observe precise power-law scalings for performance as a function of training time, context length, dataset size, model size, and compute budget.
在整个过程中,我们将观察到性能作为训练时间、上下文长度、数据集大小、模型规模和计算预算的函数,呈现出精确的幂律缩放关系。
1.1 Summary
Our key findings for Transformer language models are are as follows:
1.1 总结
我们对Transformer语言模型的主要发现如下:
Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters
性能强烈依赖于规模,而对模型形状的依赖较弱: 模型性能最强烈地依赖于规模,规模由三个因素组成:模型参数数量
Smooth power laws: Performance has a power-law relationship with each of the three scale factors
平滑的幂律: 在不受其他两个因素瓶颈限制的情况下,性能与三个规模因素
Universality of overfitting: Performance improves predictably as long as we scale up
过拟合的普遍性: 只要我们同时扩展
Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5)
训练的普遍性: 训练曲线遵循可预测的幂律,其参数大致与模型大小无关。通过外推训练曲线的早期部分,我们可以大致预测如果训练更长时间将达到的损失。(第5节)
Transfer improves with test performance: When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)
迁移随测试性能提升: 当我们在与训练数据分布不同的文本上评估模型时,结果与训练验证集上的结果强相关,损失存在一个大致恒定的偏移——换句话说,迁移到不同分布会带来一个恒定的损失,但除此之外,其提升大致与训练集上的性能一致。(第3.2.2节)
Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).
样本效率: 大模型比小模型更具样本效率,能以更少的优化步骤(图2)和更少的数据点(图4)达到相同水平的性能。
Convergence is inefficient: When working within a fixed compute budget
收敛是低效的: 当在固定的计算预算
Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million tokens at convergence for the largest models we can train. (Section 5.1)
最优批量大小: 训练这些模型的理想批量大小大致仅是损失的一个幂次函数,并且仍然可以通过测量梯度噪声尺度来确定;对于我们能够训练的最大模型,在收敛时,批量大小约为100万到200万个词元。(第5.1节)
Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.
综上所述,这些结果表明,随着我们适当地扩展模型规模、数据和计算量,语言建模性能会平滑且可预测地提升。我们期望更大的语言模型将比当前模型表现更好,并且更具样本效率。
1.2 Summary of Scaling Laws
The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters
1.2 缩放定律总结
当性能仅受限于非嵌入参数数量
- For models with a limited number of parameters, trained to convergence on sufficiently large datasets:
- 对于参数数量有限、在足够大的数据集上训练至收敛的模型:
- For large models trained with a limited dataset with early stopping:
- 对于在有限数据集上训练并采用早停法的大模型:
- When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal use of compute):
- 当在有限计算量下训练,同时有足够大的数据集、最优大小的模型以及足够小的批量大小时(实现对计算量的最优利用):
These relations hold across eight orders of magnitude in
这些关系在
The critical batch size, which determines the speed/efficiency tradeoff for data parallelism, also roughly obeys a power law in
决定数据并行中速度/效率权衡的临界批量大小,也大致遵循关于
Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset size sublinearly according to
方程 (1.1) 和 (1.2) 共同表明,随着我们增加模型规模,我们应该根据
with fits pictured on the left in figure 4. We conjecture that this functional form may also parameterize the trained log-likelihood for other generative modeling tasks.
其拟合结果如图 4 左侧所示。我们推测这种函数形式也可能参数化其他生成建模任务中训练后的对数似然。
When training a given model for a finite number of parameter update steps
在无限数据限制下,对给定模型进行有限次参数更新步骤
where
其中
When training within a fixed compute budget
在固定的计算预算
with
其中
which closely matches the empirically optimal results
这与经验最优结果
We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their implications for training time, and a breakdown of our results per token. We also make some brief comparisons to LSTMs and recurrent Transformers.
我们为方程 (1.5) 提供了一些基本的理论动机,分析了学习曲线拟合及其对训练时间的影响,并详细列出了每个词元的结果。我们还与 LSTM 和循环 Transformer 进行了一些简要的比较。
1.3 Notation
We use the following notation:
– the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in some cases we report the loss for specific tokens within the context. – the number of model parameters, excluding all vocabulary and positional embeddings – an estimate of the total non-embedding training compute, where is the batch size, and is the number of training steps (ie parameter updates). We quote numerical values in PF-days, where one PF-day = floating point operations. – the dataset size in tokens – the critical batch size [MKAT18], defined and discussed in Section 5.1. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency. – an estimate of the minimum amount of non-embedding compute to reach a given value of the loss. This is the training compute that would be used if the model were trained at a batch size much less than the critical batch size. – an estimate of the minimal number of training steps needed to reach a given value of the loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size. – power-law exponents for the scaling of the loss as where can be any of .
1.3 符号说明
我们使用以下符号:
– 以奈特为单位的交叉熵损失。通常它会在一个上下文中的词元上取平均,但在某些情况下,我们会报告上下文中特定词元的损失。 – 模型参数的数量,排除所有词汇表和位置嵌入 – 对总非嵌入训练计算量的估计,其中 是批量大小, 是训练步数(即参数更新次数)。我们以 PF-天为单位给出数值,其中 1 PF-天 = 次浮点运算。 – 以词元为单位的数据集大小 – 临界批量大小 [MKAT18],在第 5.1 节中定义和讨论。以临界批量大小进行训练可以在时间和计算效率之间提供大致最优的折衷。 – 达到给定损失值所需的最小非嵌入计算量的估计值。这是在批量大小远小于临界批量大小时训练模型所使用的训练计算量。 – 达到给定损失值所需的最小训练步数的估计值。这也是在批量大小远大于临界批量大小时训练模型所使用的训练步数。 – 损失缩放的幂律指数,即 ,其中 可以是 中的任何一个。
Related Work
Power laws can arise from a wide variety of sources. Power-law scalings with model and dataset size in density estimation and in random forest models may be connected with our results. These models suggest that power-law exponents may have a very rough interpretation as the inverse of the number of relevant features in the data.
幂律可以来源于多种多样的因素。密度估计和随机森林模型中关于模型和数据集大小的幂律缩放可能与我们的结果相关。这些模型表明,幂律指数可以粗略地解释为数据中相关特征数量的倒数。
Some early work found power-law scalings between performance and dataset size. More recent work also investigated scaling between model size and data size; their work is perhaps the closest to ours in the literature. Note, however, that [HNA+17] found super-linear scaling of dataset size with model size, whereas we find a sub-linear scaling. There are some parallels between our findings on optimal allocation of compute and [Kom19], including power-law learning curves. EfficientNets also appear to obey an approximate power-law relation between accuracy and model size. Very recent work studies scaling with both dataset size and model size for a variety of datasets, and fits an ansatz similar to ours.
一些早期工作发现了性能与数据集大小之间的幂律缩放关系。更近期的工作也研究了模型规模与数据规模之间的缩放关系;它们可能是文献中与我们的工作最接近的。然而,请注意,[HNA+17] 发现数据集大小随模型规模呈超线性缩放,而我们发现的是亚线性缩放。我们在计算最优分配方面的发现与 [Kom19] 有一些相似之处,包括幂律学习曲线。EfficientNets 的准确率与模型规模之间似乎也遵循近似的幂律关系。非常近期的工作研究了多种数据集上数据集大小和模型大小的缩放,并拟合了一个与我们类似的假设形式。
EfficientNet advocates scaling depth and width exponentially (with different coefficients) for optimal performance of image models, resulting in a power-law scaling of width as a function of depth. We find that for language models this power should be roughly one when scaling up (as width/depth should remain fixed). But more importantly, we find that the precise architectural hyperparameters are unimportant compared to the overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles of shallower models, which could potentially explain this finding. Earlier work has compared width and depth, and found that wide ResNets can outperform deep ResNets on image classification. Some studies fix computation per data example, which tends to scale in proportion to the number of model parameters, whereas we investigate scaling with both model size and the quantity of training computation.
EfficientNet 提倡为了图像模型的最优性能,指数级地缩放深度和宽度(使用不同的系数),导致宽度作为深度的函数呈幂律缩放。我们发现对于语言模型,在进行扩展时(因为宽度/深度应保持固定),这个幂律指数应大致为 1。但更重要的是,我们发现与语言模型的整体规模相比,精确的架构超参数并不重要。在 [VWB16] 中,有人认为深层模型可以作为浅层模型的集成来发挥作用,这可能潜在地解释这一发现。早期的工作比较了宽度和深度,发现在图像分类中,宽 ResNet 可以胜过深 ResNet。一些研究固定每个数据示例的计算量,这往往与模型参数数量成比例缩放,而我们研究的是模型规模和训练计算量两者的缩放。
Various works have investigated generalization in highly overparameterized models, finding a "jamming transition" when the model size reaches the dataset size (this may require training many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do not observe such a transition, and find that the necessary training data scales sublinearly in the model size. Expansions in the model size, particularly at large width, may provide a useful framework for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions in realistic settings. Making this connection quantitative will require a characterization of the Hessian spectrum.
多项工作研究了高度过参数化模型中的泛化问题,发现在模型规模达到数据集大小时会出现“阻塞相变” (这可能需要训练比常规实践高出多个数量级,并且尤其不使用早停法)。我们没有观察到这种相变,并且发现所需的训练数据随模型规模呈亚线性缩放。模型规模的扩展,尤其是在大宽度下,可能为我们思考某些缩放关系提供了一个有用的框架。我们的优化结果,例如学习曲线的形状,可能可以用噪声二次模型来解释,该模型在现实环境中可以提供相当准确的预测。要定量地建立这种联系,需要对 Hessian 谱进行刻画。
Discussion
We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count
我们观察到语言模型的对数似然损失与非嵌入参数数量
We were able to precisely model the dependence of the loss on
我们能够精确地对损失同时依赖于
It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to test these relations on other domains, such as images, audio, and video models, and perhaps also for random network distillation. At this point we do not know which of our results depend on the structure of natural language data, and which are universal. It would also be exciting to find a theoretical framework from which the scaling relations can be derived: a 'statistical mechanics' underlying the 'thermodynamics' we have observed. Such a theory might make it possible to derive other more precise predictions, and provide a systematic understanding of the limitations of the scaling laws.
可以自然地推测,这些缩放关系将适用于其他使用最大似然损失的生成建模任务,甚至可能也适用于其他设定。为此,在其他领域(如图像、音频和视频模型,或许还有随机网络蒸馏)测试这些关系将会很有意义。目前,我们尚不清楚我们的结果中哪些依赖于自然语言数据的结构,哪些具有普适性。找到一个可以推导出这些缩放关系的理论框架——即我们所观察到的“热力学”背后的“统计力学”——也将是令人兴奋的。这样的理论可能使得推导出其他更精确的预测成为可能,并提供对缩放定律局限性的系统性理解。
In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: "more is different". For example, the smooth aggregate growth of the economy provides no indication of the specific technological developments that underwrite it. Similarly, the smooth improvements in language model loss may hide seemingly qualitative changes in capability.
在自然语言领域,研究损失的持续改善是否能转化为相关语言任务的性能提升至关重要。平滑的定量变化可能掩盖了重大的定性改进:“多即不同”。例如,经济的平滑总量增长并不能说明支撑它的具体技术发展。同样,语言模型损失的平滑改善可能隐藏了能力的看似质变。
Our results strongly suggest that larger models will continue to perform better, and will also be much more sample efficient than has been previously appreciated. Big models may be more important than big data. In this context, further investigation into model parallelism is warranted. Deep models can be trained using pipelining, which splits parameters depth-wise between devices, but eventually requires increased batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallelization, since large layers can be split between multiple workers with less serial dependency. Sparsity or branching may allow for even faster training of large networks through increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, it might be possible to remain on the compute-efficient frontier for an entire training run.
我们的结果有力地表明,更大的模型将继续表现得更好,并且其样本效率将远超之前的认知。大模型可能比大数据更重要。在这种背景下,对模型并行化的进一步研究是必要的。深度模型可以使用流水线并行进行训练,它在设备之间按深度切分参数,但随着使用更多设备,最终需要增加批量大小。另一方面,宽网络更适合并行化,因为大的层可以在多个工作器之间切分,且串行依赖更少。稀疏性或分支结构可能通过增加模型并行度,使得大型网络的训练更快。而使用随着训练进行来增长网络的方法,可能使得整个训练过程都能保持在计算效率的前沿。