Scaling Laws for Neural Language Models

Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint arXiv:2001.08361, 2020.

8850+

神经语言模型的缩放定律

Abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

我们研究了语言模型在交叉熵损失上性能的经验性缩放定律。损失随着模型规模、数据集大小以及训练所用计算量的增加而呈幂律缩放，某些趋势跨越了七个多数量级。其他架构细节，如网络宽度或深度，在很大范围内影响甚微。简单的方程可以描述过拟合对模型/数据集大小的依赖关系，以及训练速度对模型大小的依赖关系。这些关系使我们能够确定固定计算预算下的最优分配方案。更大的模型具有显著更高的样本效率，因此计算效率最优的训练方式涉及在相对适量的数据上训练非常大的模型，并在明显收敛之前就停止训练。

Introduction

Language provides a natural domain for the study of artificial intelligence, as the vast majority of reasoning tasks can be efficiently expressed and evaluated in language, and the world's text provides a wealth of data for unsupervised learning via generative modeling. Deep learning has recently seen rapid progress in language modeling, with state of the art models approaching human-level performance on many specific tasks, including the composition of coherent multi-paragraph prompted text samples.

语言为人工智能研究提供了一个天然的领域，因为绝大多数推理任务可以高效地用语言表达和评估，并且世界上的文本为通过生成建模进行无监督学习提供了丰富的数据。深度学习近来在语言建模方面取得了快速进展，最先进的模型在许多特定任务上接近人类水平的表现，包括撰写连贯的多段落提示文本样本。

One might expect language modeling performance to depend on model architecture, the size of neural models, the computing power used to train them, and the data available for this training process. In this work we will empirically investigate the dependence of language modeling loss on all of these factors, focusing on the Transformer architecture [VSP+17, LSP+18]. The high ceiling and low floor for performance on language tasks allows us to study trends over more than seven orders of magnitude in scale.

人们可能会预期语言建模性能取决于模型架构、神经模型的规模、用于训练它们的计算能力以及此训练过程可用的数据。在本工作中，我们将实证研究语言建模损失对所有上述因素的依赖性，重点关注Transformer架构。语言任务性能的高上限和低下限使我们能够研究跨越七个多数量级的规模趋势。

Throughout we will observe precise power-law scalings for performance as a function of training time, context length, dataset size, model size, and compute budget.

在整个过程中，我们将观察到性能作为训练时间、上下文长度、数据集大小、模型规模和计算预算的函数，呈现出精确的幂律缩放关系。

1.1 Summary

Our key findings for Transformer language models are are as follows:

1.1 总结

我们对Transformer语言模型的主要发现如下：

Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters $N$ (excluding embeddings), the size of the dataset $D$ , and the amount of compute $C$ used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3)

性能强烈依赖于规模，而对模型形状的依赖较弱： 模型性能最强烈地依赖于规模，规模由三个因素组成：模型参数数量 $N$ （不包括嵌入层）、数据集大小 $D$ 以及用于训练的计算量 $C$ 。在合理范围内，性能对其他架构超参数（如深度与宽度）的依赖非常弱。（第3节）

Smooth power laws: Performance has a power-law relationship with each of the three scale factors $N, D, C$ when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3)

平滑的幂律： 在不受其他两个因素瓶颈限制的情况下，性能与三个规模因素 $N, D, C$ 中的每一个都呈幂律关系，趋势跨越六个多数量级（见图1）。我们未观察到这些趋势在高端的偏离迹象，尽管性能在达到零损失之前最终必须趋于平缓。（第3节）

Universality of overfitting: Performance improves predictably as long as we scale up $N$ and $D$ in tandem, but enters a regime of diminishing returns if either $N$ or $D$ is held fixed while the other increases. The performance penalty depends predictably on the ratio $N^{0.74} / D$ , meaning that every time we increase the model size 8x, we only need to increase the data by roughly 5x to avoid a penalty. (Section 4)

过拟合的普遍性： 只要我们同时扩展 $N$ 和 $D$ ，性能就会可预测地提升，但如果 $N$ 或 $D$ 中的一方固定而另一方增加，则会进入收益递减的状态。性能损失可预测地取决于比率 $N^{0.74} / D$ ，这意味着每当我们将模型大小增加8倍时，我们只需将数据增加大约5倍即可避免损失。（第4节）

Universality of training: Training curves follow predictable power-laws whose parameters are roughly independent of the model size. By extrapolating the early part of a training curve, we can roughly predict the loss that would be achieved if we trained for much longer. (Section 5)

训练的普遍性： 训练曲线遵循可预测的幂律，其参数大致与模型大小无关。通过外推训练曲线的早期部分，我们可以大致预测如果训练更长时间将达到的损失。（第5节）

Transfer improves with test performance: When we evaluate models on text with a different distribution than they were trained on, the results are strongly correlated to those on the training validation set with a roughly constant offset in the loss – in other words, transfer to a different distribution incurs a constant penalty but otherwise improves roughly in line with performance on the training set. (Section 3.2.2)

迁移随测试性能提升： 当我们在与训练数据分布不同的文本上评估模型时，结果与训练验证集上的结果强相关，损失存在一个大致恒定的偏移——换句话说，迁移到不同分布会带来一个恒定的损失，但除此之外，其提升大致与训练集上的性能一致。（第3.2.2节）

Sample efficiency: Large models are more sample-efficient than small models, reaching the same level of performance with fewer optimization steps (Figure 2) and using fewer data points (Figure 4).

样本效率： 大模型比小模型更具样本效率，能以更少的优化步骤（图2）和更少的数据点（图4）达到相同水平的性能。

Convergence is inefficient: When working within a fixed compute budget $C$ but without any other restrictions on the model size $N$ or available data $D$ , we attain optimal performance by training very large models and stopping significantly short of convergence (see Figure 3). Maximally compute-efficient training would therefore be far more sample efficient than one might expect based on training small models to convergence, with data requirements growing very slowly as $D \sim C^{0.27}$ with training compute. (Section 6)

收敛是低效的： 当在固定的计算预算 $C$ 下工作，但对模型大小 $N$ 或可用数据 $D$ 没有任何其他限制时，我们通过训练非常大的模型并在远未收敛之前停止来获得最佳性能（见图3）。因此，最大化计算效率的训练将比基于训练小模型直至收敛所预期的样本效率高得多，随着训练计算量的增加，数据需求增长非常缓慢，满足 $D \sim C^{0.27}$ 。（第6节）

Optimal batch size: The ideal batch size for training these models is roughly a power of the loss only, and continues to be determinable by measuring the gradient noise scale [MKAT18]; it is roughly 1-2 million tokens at convergence for the largest models we can train. (Section 5.1)

最优批量大小： 训练这些模型的理想批量大小大致仅是损失的一个幂次函数，并且仍然可以通过测量梯度噪声尺度来确定；对于我们能够训练的最大模型，在收敛时，批量大小约为100万到200万个词元。（第5.1节）

Taken together, these results show that language modeling performance improves smoothly and predictably as we appropriately scale up model size, data, and compute. We expect that larger language models will perform better and be more sample efficient than current models.

综上所述，这些结果表明，随着我们适当地扩展模型规模、数据和计算量，语言建模性能会平滑且可预测地提升。我们期望更大的语言模型将比当前模型表现更好，并且更具样本效率。

1.2 Summary of Scaling Laws

The test loss of a Transformer trained to autoregressively model language can be predicted using a power-law when performance is limited by only either the number of non-embedding parameters $N$ , the dataset size $D$ , or the optimally allocated compute budget $C_{min}$ (see Figure 1):

1.2 缩放定律总结

当性能仅受限于非嵌入参数数量 $N$ 、数据集大小 $D$ 或最优分配的计算预算 $C_{min}$ 时，训练用于自回归语言建模的 Transformer 的测试损失可以用幂律预测（见图1）：

For models with a limited number of parameters, trained to convergence on sufficiently large datasets:

对于参数数量有限、在足够大的数据集上训练至收敛的模型：

\begin{matrix} (1.1) & L (N) = {(N_{c} / N)}^{α_{N}}; α_{N} \sim 0.076, N_{c} \sim 8.8 \times 10^{13} (non-embedding parameters) \end{matrix}

For large models trained with a limited dataset with early stopping:

对于在有限数据集上训练并采用早停法的大模型：

\begin{matrix} (1.2) & L (D) = {(D_{c} / D)}^{α_{D}}; α_{D} \sim 0.095, D_{c} \sim 5.4 \times 10^{13} (tokens) \end{matrix}

When training with a limited amount of compute, a sufficiently large dataset, an optimally-sized model, and a sufficiently small batch size (making optimal use of compute):

当在有限计算量下训练，同时有足够大的数据集、最优大小的模型以及足够小的批量大小时（实现对计算量的最优利用）：

\begin{matrix} (1.3) & L (C_{min}) = {(C_{c}^{min} / C_{min})}^{α_{C}^{min}}; α_{C}^{min} \sim 0.050, C_{c}^{min} \sim 3.1 \times 10^{8} (PF-days) \end{matrix}

These relations hold across eight orders of magnitude in $C_{min}$ , six orders of magnitude in $N$ , and over two orders of magnitude in $D$ . They depend very weakly on model shape and other Transformer hyperparameters (depth, width, number of self-attention heads), with specific numerical values associated with the Webtext2 training set. The power laws $α_{N}, α_{D}, α_{C}^{min}$ specify the degree of performance improvement expected as we scale up $N, D,$ or $C_{min}$ ; for example, doubling the number of parameters yields a loss that is smaller by a factor $2^{- α_{N}} = 0.95$ . The precise numerical values of $N_{c}, C_{c}^{min},$ and $D_{c}$ depend on the vocabulary size and tokenization and hence do not have a fundamental meaning.

这些关系在 $C_{min}$ 的八个数量级、 $N$ 的六个数量级以及 $D$ 的两个数量级以上成立。它们对模型形状和其他Transformer超参数（深度、宽度、自注意力头数）的依赖非常弱，具体的数值与Webtext2训练集相关。幂律指数 $α_{N}, α_{D}, α_{C}^{min}$ 指明了随着我们扩展 $N, D,$ 或 $C_{min}$ 所预期的性能提升程度；例如，将参数数量加倍会使损失缩小 $2^{- α_{N}} = 0.95$ 倍。 $N_{c}, C_{c}^{min},$ 和 $D_{c}$ 的具体数值取决于词汇表大小和分词方式，因此不具有根本性的意义。

The critical batch size, which determines the speed/efficiency tradeoff for data parallelism, also roughly obeys a power law in $L$ :

决定数据并行中速度/效率权衡的临界批量大小，也大致遵循关于 $L$ 的幂律：

\begin{matrix} (1.4) & B_{crit} (L) = \frac{B_{*}}{L^{1 / α_{B}}}, B_{*} \sim 2 \cdot 10^{8} tokens, α_{B} \sim 0.21 \end{matrix}

Equation (1.1) and (1.2) together suggest that as we increase the model size, we should increase the dataset size sublinearly according to $D \propto N^{\frac{α_{N}}{α_{D}}} \sim N^{0.74}$ . In fact, we find that there is a single equation combining (1.1) and (1.2) that governs the simultaneous dependence on $N$ and $D$ and governs the degree of overfitting:

方程 (1.1) 和 (1.2) 共同表明，随着我们增加模型规模，我们应该根据 $D \propto N^{\frac{α_{N}}{α_{D}}} \sim N^{0.74}$ 亚线性地增加数据集大小。事实上，我们发现存在一个结合了 (1.1) 和 (1.2) 的单一方程，它控制着对 $N$ 和 $D$ 的同时依赖性，并控制着过拟合的程度：

\begin{matrix} (1.5) & L (N, D) = {[{(\frac{N_{c}}{N})}^{\frac{α_{N}}{α_{D}}} + \frac{D_{c}}{D}]}^{α_{D}} \end{matrix}

with fits pictured on the left in figure 4. We conjecture that this functional form may also parameterize the trained log-likelihood for other generative modeling tasks.

其拟合结果如图 4 左侧所示。我们推测这种函数形式也可能参数化其他生成建模任务中训练后的对数似然。

When training a given model for a finite number of parameter update steps $S$ in the infinite data limit, after an initial transient period, the learning curves can be accurately fit by (see the right of figure 4)

在无限数据限制下，对给定模型进行有限次参数更新步骤 $S$ 的训练时，在初始瞬态期之后，学习曲线可以精确拟合为（见图 4 右侧）：

\begin{matrix} (1.6) & L (N, S) = {(\frac{N_{c}}{N})}^{α_{N}} + {(\frac{S_{c}}{S_{min} (S)})}^{α_{S}} \end{matrix}

where $S_{c} \approx 2.1 \times 10^{3}$ and $α_{S} \approx 0.76$ , and $S_{min} (S)$ is the minimum possible number of optimization steps (parameter updates) estimated using Equation (5.4).

其中 $S_{c} \approx 2.1 \times 10^{3}$ 和 $α_{S} \approx 0.76$ ，而 $S_{min} (S)$ 是使用方程 (5.4) 估计的最小可能的优化步骤（参数更新）数。

When training within a fixed compute budget $C$ , but with no other constraints, Equation (1.6) leads to the prediction that the optimal model size $N$ , optimal batch size $B$ , optimal number of steps $S$ , and dataset size $D$ should grow as

在固定的计算预算 $C$ 内进行训练，但没有其他约束时，方程 (1.6) 预测最优模型规模 $N$ 、最优批量大小 $B$ 、最优步骤数 $S$ 和数据集大小 $D$ 应按如下方式增长：

\begin{matrix} (1.7) & N \propto C^{α_{C}^{min} / α_{N}}, B \propto C^{α_{C}^{min} / α_{B}}, S \propto C^{α_{C}^{min} / α_{S}}, D = B \cdot S \end{matrix}

with

其中

\begin{matrix} (1.8) & α_{C}^{min} = 1 / (1 / α_{S} + 1 / α_{B} + 1 / α_{N}) \end{matrix}

which closely matches the empirically optimal results $N \propto C_{c_{min}}^{0.73}, B \propto C_{c_{min}}^{0.24},$ and $S \propto C_{c_{min}}^{0.03}$ . As the computational budget $C$ increases, it should be spent primarily on larger models, without dramatic increases in training time or dataset size (see Figure 3). This also implies that as models grow larger, they become increasingly sample efficient. In practice, researchers typically train smaller models for longer than would be maximally compute-efficient because of hardware constraints. Optimal performance depends on total compute as a power law (see Equation (1.3)).

这与经验最优结果 $N \propto C_{c_{min}}^{0.73}, B \propto C_{c_{min}}^{0.24},$ 和 $S \propto C_{c_{min}}^{0.03}$ 非常吻合。随着计算预算 $C$ 的增加，它应主要花费在更大的模型上，而训练时间或数据集大小不会急剧增加（见图 3）。这也意味着随着模型变得更大，它们会变得更具样本效率。在实践中，由于硬件限制，研究人员通常训练较小模型的时间比最大化计算效率所需的时间更长。最优性能遵循关于总计算量的幂律（见方程 (1.3)）。

We provide some basic theoretical motivation for Equation (1.5), an analysis of learning curve fits and their implications for training time, and a breakdown of our results per token. We also make some brief comparisons to LSTMs and recurrent Transformers.

我们为方程 (1.5) 提供了一些基本的理论动机，分析了学习曲线拟合及其对训练时间的影响，并详细列出了每个词元的结果。我们还与 LSTM 和循环 Transformer 进行了一些简要的比较。

1.3 Notation

We use the following notation:

$L$ – the cross entropy loss in nats. Typically it will be averaged over the tokens in a context, but in some cases we report the loss for specific tokens within the context.
$N$ – the number of model parameters, excluding all vocabulary and positional embeddings
$C \approx 6 N B S$ – an estimate of the total non-embedding training compute, where $B$ is the batch size, and $S$ is the number of training steps (ie parameter updates). We quote numerical values in PF-days, where one PF-day = $10^{15} \times 24 \times 3600 = 8.64 \times 10^{19}$ floating point operations.
$D$ – the dataset size in tokens
$B_{crit}$ – the critical batch size [MKAT18], defined and discussed in Section 5.1. Training at the critical batch size provides a roughly optimal compromise between time and compute efficiency.
$C_{min}$ – an estimate of the minimum amount of non-embedding compute to reach a given value of the loss. This is the training compute that would be used if the model were trained at a batch size much less than the critical batch size.
$S_{min}$ – an estimate of the minimal number of training steps needed to reach a given value of the loss. This is also the number of training steps that would be used if the model were trained at a batch size much greater than the critical batch size.
$α_{X}$ – power-law exponents for the scaling of the loss as $L (X) \propto 1 / X^{α_{X}}$ where $X$ can be any of $N, D, C, S, B, C^{min}$ .

1.3 符号说明

我们使用以下符号：

$L$ – 以奈特为单位的交叉熵损失。通常它会在一个上下文中的词元上取平均，但在某些情况下，我们会报告上下文中特定词元的损失。
$N$ – 模型参数的数量，排除所有词汇表和位置嵌入
$C \approx 6 N B S$ – 对总非嵌入训练计算量的估计，其中 $B$ 是批量大小， $S$ 是训练步数（即参数更新次数）。我们以 PF-天为单位给出数值，其中 1 PF-天 = $10^{15} \times 24 \times 3600 = 8.64 \times 10^{19}$ 次浮点运算。
$D$ – 以词元为单位的数据集大小
$B_{crit}$ – 临界批量大小 [MKAT18]，在第 5.1 节中定义和讨论。以临界批量大小进行训练可以在时间和计算效率之间提供大致最优的折衷。
$C_{min}$ – 达到给定损失值所需的最小非嵌入计算量的估计值。这是在批量大小远小于临界批量大小时训练模型所使用的训练计算量。
$S_{min}$ – 达到给定损失值所需的最小训练步数的估计值。这也是在批量大小远大于临界批量大小时训练模型所使用的训练步数。
$α_{X}$ – 损失缩放的幂律指数，即 $L (X) \propto 1 / X^{α_{X}}$ ，其中 $X$ 可以是 $N, D, C, S, B, C^{min}$ 中的任何一个。

Power laws can arise from a wide variety of sources. Power-law scalings with model and dataset size in density estimation and in random forest models may be connected with our results. These models suggest that power-law exponents may have a very rough interpretation as the inverse of the number of relevant features in the data.

幂律可以来源于多种多样的因素。密度估计和随机森林模型中关于模型和数据集大小的幂律缩放可能与我们的结果相关。这些模型表明，幂律指数可以粗略地解释为数据中相关特征数量的倒数。

Some early work found power-law scalings between performance and dataset size. More recent work also investigated scaling between model size and data size; their work is perhaps the closest to ours in the literature. Note, however, that [HNA+17] found super-linear scaling of dataset size with model size, whereas we find a sub-linear scaling. There are some parallels between our findings on optimal allocation of compute and [Kom19], including power-law learning curves. EfficientNets also appear to obey an approximate power-law relation between accuracy and model size. Very recent work studies scaling with both dataset size and model size for a variety of datasets, and fits an ansatz similar to ours.

一些早期工作发现了性能与数据集大小之间的幂律缩放关系。更近期的工作也研究了模型规模与数据规模之间的缩放关系；它们可能是文献中与我们的工作最接近的。然而，请注意，[HNA+17] 发现数据集大小随模型规模呈超线性缩放，而我们发现的是亚线性缩放。我们在计算最优分配方面的发现与 [Kom19] 有一些相似之处，包括幂律学习曲线。EfficientNets 的准确率与模型规模之间似乎也遵循近似的幂律关系。非常近期的工作研究了多种数据集上数据集大小和模型大小的缩放，并拟合了一个与我们类似的假设形式。

EfficientNet advocates scaling depth and width exponentially (with different coefficients) for optimal performance of image models, resulting in a power-law scaling of width as a function of depth. We find that for language models this power should be roughly one when scaling up (as width/depth should remain fixed). But more importantly, we find that the precise architectural hyperparameters are unimportant compared to the overall scale of the language model. In [VWB16] it was argued that deep models can function as ensembles of shallower models, which could potentially explain this finding. Earlier work has compared width and depth, and found that wide ResNets can outperform deep ResNets on image classification. Some studies fix computation per data example, which tends to scale in proportion to the number of model parameters, whereas we investigate scaling with both model size and the quantity of training computation.

EfficientNet 提倡为了图像模型的最优性能，指数级地缩放深度和宽度（使用不同的系数），导致宽度作为深度的函数呈幂律缩放。我们发现对于语言模型，在进行扩展时（因为宽度/深度应保持固定），这个幂律指数应大致为 1。但更重要的是，我们发现与语言模型的整体规模相比，精确的架构超参数并不重要。在 [VWB16] 中，有人认为深层模型可以作为浅层模型的集成来发挥作用，这可能潜在地解释这一发现。早期的工作比较了宽度和深度，发现在图像分类中，宽 ResNet 可以胜过深 ResNet。一些研究固定每个数据示例的计算量，这往往与模型参数数量成比例缩放，而我们研究的是模型规模和训练计算量两者的缩放。

Various works have investigated generalization in highly overparameterized models, finding a "jamming transition" when the model size reaches the dataset size (this may require training many orders of magnitude beyond typical practice, and in particular does not use early stopping). We do not observe such a transition, and find that the necessary training data scales sublinearly in the model size. Expansions in the model size, particularly at large width, may provide a useful framework for thinking about some of our scaling relations. Our results on optimization, such as the shape of learning curves, can likely be explained using a noisy quadratic model, which can provide quite accurate predictions in realistic settings. Making this connection quantitative will require a characterization of the Hessian spectrum.

多项工作研究了高度过参数化模型中的泛化问题，发现在模型规模达到数据集大小时会出现“阻塞相变” （这可能需要训练比常规实践高出多个数量级，并且尤其不使用早停法）。我们没有观察到这种相变，并且发现所需的训练数据随模型规模呈亚线性缩放。模型规模的扩展，尤其是在大宽度下，可能为我们思考某些缩放关系提供了一个有用的框架。我们的优化结果，例如学习曲线的形状，可能可以用噪声二次模型来解释，该模型在现实环境中可以提供相当准确的预测。要定量地建立这种联系，需要对 Hessian 谱进行刻画。

Discussion

We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count $N$ , dataset size $D$ , and optimized training computation $C_{min}$ , as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with $N, D, C_{min}$ are power-laws, there are diminishing returns with increasing scale.

我们观察到语言模型的对数似然损失与非嵌入参数数量 $N$ 、数据集大小 $D$ 以及优化后的训练计算量 $C_{min}$ 之间存在一致的缩放关系，如方程 (1.5) 和 (1.6) 所概括。相反，我们发现其对许多架构和优化超参数的依赖性非常弱。由于与 $N, D, C_{min}$ 的缩放是幂律关系，因此随着规模的增加，收益会递减。

We were able to precisely model the dependence of the loss on $N$ and $D$ , and alternatively on $N$ and $S$ , when these parameters are varied simultaneously. We used these relations to derive the compute scaling, magnitude of overfitting, early stopping step, and data requirements when training large language models. So our scaling relations go beyond mere observation to provide a predictive framework. One might interpret these relations as analogues of the ideal gas law, which relates the macroscopic properties of a gas in a universal way, independent of most of the details of its microscopic constituents.

我们能够精确地对损失同时依赖于 $N$ 和 $D$ ，或者同时依赖于 $N$ 和 $S$ 进行建模。我们利用这些关系推导了训练大型语言模型时的计算量缩放、过拟合程度、早停步骤和数据需求。因此，我们的缩放关系不仅仅是观察结果，更提供了一个预测框架。人们可以将这些关系类比为理想气体定律，后者以一种普适的方式关联气体的宏观属性，而与其微观组成的大部分细节无关。

It is natural to conjecture that the scaling relations will apply to other generative modeling tasks with a maximum likelihood loss, and perhaps in other settings as well. To this purpose, it will be interesting to test these relations on other domains, such as images, audio, and video models, and perhaps also for random network distillation. At this point we do not know which of our results depend on the structure of natural language data, and which are universal. It would also be exciting to find a theoretical framework from which the scaling relations can be derived: a 'statistical mechanics' underlying the 'thermodynamics' we have observed. Such a theory might make it possible to derive other more precise predictions, and provide a systematic understanding of the limitations of the scaling laws.

可以自然地推测，这些缩放关系将适用于其他使用最大似然损失的生成建模任务，甚至可能也适用于其他设定。为此，在其他领域（如图像、音频和视频模型，或许还有随机网络蒸馏）测试这些关系将会很有意义。目前，我们尚不清楚我们的结果中哪些依赖于自然语言数据的结构，哪些具有普适性。找到一个可以推导出这些缩放关系的理论框架——即我们所观察到的“热力学”背后的“统计力学”——也将是令人兴奋的。这样的理论可能使得推导出其他更精确的预测成为可能，并提供对缩放定律局限性的系统性理解。

In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: "more is different". For example, the smooth aggregate growth of the economy provides no indication of the specific technological developments that underwrite it. Similarly, the smooth improvements in language model loss may hide seemingly qualitative changes in capability.

在自然语言领域，研究损失的持续改善是否能转化为相关语言任务的性能提升至关重要。平滑的定量变化可能掩盖了重大的定性改进：“多即不同”。例如，经济的平滑总量增长并不能说明支撑它的具体技术发展。同样，语言模型损失的平滑改善可能隐藏了能力的看似质变。

Our results strongly suggest that larger models will continue to perform better, and will also be much more sample efficient than has been previously appreciated. Big models may be more important than big data. In this context, further investigation into model parallelism is warranted. Deep models can be trained using pipelining, which splits parameters depth-wise between devices, but eventually requires increased batch sizes as more devices are used. Wide networks on the other hand are more amenable to parallelization, since large layers can be split between multiple workers with less serial dependency. Sparsity or branching may allow for even faster training of large networks through increased model parallelism. And using methods like [WRH17, WYL19], which grow networks as they train, it might be possible to remain on the compute-efficient frontier for an entire training run.

我们的结果有力地表明，更大的模型将继续表现得更好，并且其样本效率将远超之前的认知。大模型可能比大数据更重要。在这种背景下，对模型并行化的进一步研究是必要的。深度模型可以使用流水线并行进行训练，它在设备之间按深度切分参数，但随着使用更多设备，最终需要增加批量大小。另一方面，宽网络更适合并行化，因为大的层可以在多个工作器之间切分，且串行依赖更少。稀疏性或分支结构可能通过增加模型并行度，使得大型网络的训练更快。而使用随着训练进行来增长网络的方法，可能使得整个训练过程都能保持在计算效率的前沿。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

Scaling Laws for Neural Language Models

神经语言模型的缩放定律

Abstract

Introduction

1.1 Summary

1.1 总结

1.2 Summary of Scaling Laws

1.2 缩放定律总结

1.3 Notation

1.3 符号说明

Discussion

🤖 Rasa

Scaling Laws for Neural Language Models ​

神经语言模型的缩放定律 ​

Abstract ​

Introduction ​

1.1 Summary ​

1.1 总结 ​

1.2 Summary of Scaling Laws ​

1.2 缩放定律总结 ​

1.3 Notation ​

1.3 符号说明 ​

Related Work ​

Discussion ​

Scaling Laws for Neural Language Models

神经语言模型的缩放定律

Abstract

Introduction

1.1 Summary

1.1 总结

1.2 Summary of Scaling Laws

1.2 缩放定律总结

1.3 Notation

1.3 符号说明

Related Work

Discussion