Decoupled Weight Decay Regularization

Loshchilov I, Hutter F. Decoupled weight decay regularization[J]. arXiv preprint arXiv:1711.05101, 2017.
https://github.com/loshchil/AdamW-and-SGDW

44200+

解耦权重衰减正则化

Abstract

$L_{2}$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ $L_{2}$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch.

$L_{2}$ 正则化与权重衰减正则化对于标准随机梯度下降来说是等价的，但正如我们所证明的，对于自适应梯度算法，情况并非如此。虽然这些算法的常见实现采用了 $L_{2}$ 正则化，但我们的工作揭示了两者之间的不等价性。为此，我们提出了一种简单的修改，通过解耦权重衰减与针对损失函数采取的优化步骤，来恢复权重衰减正则化的原始形式。我们提供的实证证据表明，我们提出的修改能够解耦标准 SGD 和 Adam 中权重衰减因子的最优选择与学习率的设置，并显著提升了 Adam 的泛化性能，使其能够在图像分类数据集上与带动量的 SGD 相竞争。我们提出的解耦权重衰减已被众多研究者采纳，并在 TensorFlow 和 PyTorch 中实现。

Introduction

Adaptive gradient methods, such as AdaGrad, RMSProp, Adam and most recently AMSGrad have become a default method of choice for training feed-forward and recurrent neural networks. Nevertheless, state-of-the-art results for popular image classification datasets, such as CIFAR-10 and CIFAR-100, are still obtained by applying SGD with momentum. Furthermore, Wilson et al. suggested that adaptive gradient methods do not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks, such as image classification, character-level language modeling and constituency parsing. Different hypotheses about the origins of this worse generalization have been investigated, such as the presence of sharp local minima and inherent problems of adaptive gradient methods. In this paper, we investigate whether it is better to use $L_{2}$ regularization or weight decay regularization to train deep neural networks with SGD and Adam. We show that a major factor of the poor generalization of the most popular adaptive gradient method, Adam, is due to the fact that $L_{2}$ regularization is not nearly as effective for it as for SGD. Specifically, our analysis of Adam leads to the following observations:

自适应梯度方法，如 AdaGrad、RMSProp、Adam 以及最近的 AMSGrad，已成为训练前馈和循环神经网络的默认选择。然而，对于流行的图像分类数据集，如 CIFAR-10 和 CIFAR-100，最先进的结果仍然是通过应用带动量的 SGD 获得的。此外，Wilson 等人指出，在一系列不同的深度学习任务上测试时，自适应梯度方法的泛化能力不如带动量的 SGD。关于这种泛化能力较差的原因，研究者已经探讨了不同的假设，例如尖锐局部极小值的存在以及自适应梯度方法固有的问题。在本文中，我们研究了在使用 SGD 和 Adam 训练深度神经网络时，是使用 $L_{2}$ 正则化更好还是使用权重衰减正则化更好。我们表明，最流行的自适应梯度方法 Adam 泛化能力差的一个主要因素，是因为 $L_{2}$ 正则化对它的效果远不如对 SGD 那样有效。具体来说，我们对 Adam 的分析得出以下观察：

$L_{2}$ regularization and weight decay are not identical. The two techniques can be made equivalent for SGD by a reparameterization of the weight decay factor based on the learning rate; however, as is often overlooked, this is not the case for Adam. In particular, when combined with adaptive gradients, $L_{2}$ regularization leads to weights with large historic parameter and/or gradient amplitudes being regularized less than they would be when using weight decay.

$L_{2}$ 正则化与权重衰减并不相同。 对于 SGD，通过基于学习率对权重衰减因子进行重新参数化，这两种技术可以等价；然而，正如经常被忽视的那样，对于 Adam 来说情况并非如此。特别是，当与自适应梯度结合时， $L_{2}$ 正则化会导致那些具有较大历史参数和/或梯度幅度的权重被正则化的程度低于使用权重衰减时的程度。

$L_{2}$ regularization is not effective in Adam. One possible explanation why Adam and other adaptive gradient methods might be outperformed by SGD with momentum is that common deep learning libraries only implement $L_{2}$ regularization, not the original weight decay. Therefore, on tasks/datasets where the use of $L_{2}$ regularization is beneficial for SGD, Adam leads to worse results than SGD with momentum.

$L_{2}$ 正则化在 Adam 中效果不佳。 为什么 Adam 和其他自适应梯度方法可能被带动量的 SGD 超越，一个可能的解释是，常见的深度学习库只实现了 $L_{2}$ 正则化，而不是原始的权重衰减。因此，在那些 $L_{2}$ 正则化对 SGD 有益的任务/数据集上，Adam 导致的结果比带动量的 SGD 更差。

Weight decay is equally effective in both SGD and Adam. For SGD, it is equivalent to $L_{2}$ regularization, while for Adam it is not.

权重衰减在 SGD 和 Adam 中同样有效。 对于 SGD，它等价于 $L_{2}$ 正则化，而对于 Adam 则不然。

Optimal weight decay depends on the total number of batch passes/weight updates. Our empirical analysis of SGD and Adam suggests that the larger the runtime/number of batch passes to be performed, the smaller the optimal weight decay.

最优权重衰减取决于批次通过/权重更新的总数。 我们对 SGD 和 Adam 的实证分析表明，运行的次数/执行的批次通过次数越多，最优的权重衰减越小。

Adam can substantially benefit from a scheduled learning rate multiplier. The fact that Adam is an adaptive gradient algorithm and as such adapts the learning rate for each parameter does not rule out the possibility to substantially improve its performance by using a global learning rate multiplier, scheduled, e.g., by cosine annealing.

Adam 可以显著受益于计划的学习率乘数。 Adam 是一种自适应梯度算法，因此会为每个参数调整学习率，但这并不排除通过使用全局学习率乘数来显著提升其性能的可能性，例如通过余弦退火进行调度。

The main contribution of this paper is to improve regularization in Adam by decoupling the weight decay from the gradient-based update. In a comprehensive analysis, we show that Adam generalizes substantially better with decoupled weight decay than with $L_{2}$ regularization, achieving 15% relative improvement in test error; this holds true for various image recognition datasets, training budgets, and learning rate schedules. We also demonstrate that our decoupled weight decay renders the optimal settings of the learning rate and the weight decay factor much more independent, thereby easing hyperparameter optimization.

本文的主要贡献在于通过将权重衰减与基于梯度的更新解耦来改进 Adam 中的正则化。在全面的分析中，我们表明使用解耦权重衰减的 Adam 比使用 $L_{2}$ 正则化的 Adam 泛化能力好得多，测试误差相对降低了 15%；这对于各种图像识别数据集、训练预算和学习率计划都成立。我们还证明，我们解耦的权重衰减使得学习率和权重衰减因子的最优设置更加独立，从而简化了超参数优化。

The main motivation of this paper is to improve Adam to make it competitive w.r.t. SGD with momentum even for those problems where it did not use to be competitive. We hope that as a result, practitioners do not need to switch between Adam and SGD anymore, which in turn should reduce the common issue of selecting dataset/task-specific training algorithms and their hyperparameters.

本文的主要动机是改进 Adam，使其即使对于那些它过去不具备竞争力的问题，也能与带动量的 SGD 竞争。我们希望结果是，实践者不再需要在 Adam 和 SGD 之间切换，这反过来应该会减少选择特定于数据集/任务的训练算法及其超参数的常见问题。

Decoupling the Weight Decay from the Gradient-based Update

In the weight decay described by Hanson & Pratt, the weights $θ$ decay exponentially as

在 Hanson & Pratt 描述的权重衰减中，权重 $θ$ 呈指数衰减：

\begin{matrix} (1) & θ_{t + 1} = (1 - λ) θ_{t} - α \nabla f_{t} (θ_{t}) \end{matrix}

where $λ$ defines the rate of the weight decay per step and $\nabla f_{t} (θ_{t})$ is the $t$ -th batch gradient to be multiplied by a learning rate $α$ . For standard SGD, it is equivalent to standard $L_{2}$ regularization:

其中 $λ$ 定义了每一步权重衰减的速率， $\nabla f_{t} (θ_{t})$ 是需要乘以学习率 $α$ 的第 $t$ 个批次梯度。对于标准 SGD，它等价于标准的 $L_{2}$ 正则化：

Proposition 1 (Weight decay = $L_{2}$ reg for standard SGD). Standard SGD with base learning rate $α$ executes the same steps on batch loss functions $f_{t} (θ)$ with weight decay $λ$ (defined in Equation 1) as it executes without weight decay on $f_{t}^{reg} (θ) = f_{t} (θ) + \frac{λ^{'}}{2} ∥ θ ∥_{2}^{2}$ , with $λ^{'} = \frac{λ}{α}$ .

命题 1 (标准 SGD 的权重衰减 = $L_{2}$ 正则化). 使用基础学习率 $α$ 和权重衰减 $λ$ 的标准 SGD，在批次损失函数 $f_{t} (θ)$ 上执行的步骤，等同于在没有权重衰减的情况下，在 $f_{t}^{reg} (θ) = f_{t} (θ) + \frac{λ^{'}}{2} ∥ θ ∥_{2}^{2}$ 上执行的步骤，其中 $λ^{'} = \frac{λ}{α}$ 。

The proofs of this well-known fact, as well as our other propositions, are given in Appendix A.

这个众所周知的事实的证明，以及我们的其他命题，都在附录 A 中给出。

Due to this equivalence, $L_{2}$ regularization is very frequently referred to as weight decay, including in popular deep learning libraries. However, as we will demonstrate later in this section, this equivalence does not hold for adaptive gradient methods. One fact that is often overlooked already for the simple case of SGD is that in order for the equivalence to hold, the $L_{2}$ regularizer $λ^{'}$ has to be set to $\frac{λ}{α}$ , i.e., if there is an overall best weight decay value $λ$ , the best value of $λ^{'}$ is tightly coupled with the learning rate $α$ . In order to decouple the effects of these two hyperparameters, we advocate to decouple the weight decay step as proposed by Hanson & Pratt (Equation 1).

由于这种等价性， $L_{2}$ 正则化经常被称为权重衰减，包括在流行的深度学习库中。然而，正如我们将在本节后面展示的，这种等价性对于自适应梯度方法并不成立。即使在 SGD 这个简单情况下，一个经常被忽视的事实是，为了保持等价性， $L_{2}$ 正则化器 $λ^{'}$ 必须设置为 $\frac{λ}{α}$ ，也就是说，如果存在一个全局最优的权重衰减值 $λ$ ，那么 $λ^{'}$ 的最优值就与学习率 $α$ 紧密耦合。为了解耦这两个超参数的影响，我们主张按照 Hanson & Pratt 提出的方式，将权重衰减步骤解耦。

Looking first at the case of SGD, we propose to decay the weights simultaneously with the update of $θ_{t}$ based on gradient information in Line 9 of Algorithm 1. This yields our proposed variant of SGD with momentum using decoupled weight decay (SGDW). This simple modification explicitly decouples $λ$ and $α$ (although some problem-dependent implicit coupling may of course remain as for any two hyperparameters). In order to account for a possible scheduling of both $α$ and $λ$ , we introduce a scaling factor $η_{t}$ delivered by a user-defined procedure $S e t S c h e d u l e M u l t i p l i e r (t)$ .

首先看 SGD 的情况，我们建议在算法 1 的第 9 行，基于梯度信息更新 $θ_{t}$ 的同时进行权重衰减。这产生了我们提出的使用解耦权重衰减的带动量 SGD 的变体。这种简单的修改明确地解耦了 $λ$ 和 $α$ 。为了考虑可能对 $α$ 和 $λ$ 都进行调度，我们引入了一个由用户定义的过程 $S e t S c h e d u l e M u l t i p l i e r (t)$ 提供的缩放因子 $η_{t}$ 。

Algorithm 1 SGD with $L_{2}$ regularization and SGD with decoupled weight decay (SGDW), both with momentum

given initial learning rate $α \in R$ , momentum factor $β_{1} \in R$ , weight decay/ $L_{2}$ regularization factor $λ \in R$
initialize time step $t \leftarrow 0$ , parameter vector $θ_{t = 0} \in R^{n}$ , first moment vector $m_{t = 0} \leftarrow 0$ , schedule multiplier $η_{t = 0} \in R$
repeat
$t \leftarrow t + 1$
$\nabla f_{t} (θ_{t - 1}) \leftarrow SelectBatch (θ_{t - 1})$ ▷ select batch and return the corresponding gradient
$g_{t} \leftarrow \nabla f_{t} (θ_{t - 1})$ $+ λ θ_{t - 1}$
$η_{t} \leftarrow SetScheduleMultiplier (t)$ ▷ can be fixed, decay, be used for warm restarts
$m_{t} \leftarrow β_{1} m_{t - 1} + η_{t} α g_{t}$
$θ_{t} \leftarrow θ_{t - 1} - m_{t}$ $- η_{t} λ θ_{t - 1}$
until stopping criterion is met
return optimized parameters $θ_{t}$

Now, let's turn to adaptive gradient algorithms like the popular optimizer Adam, which scale gradients by their historic magnitudes. Intuitively, when Adam is run on a loss function $f$ plus $L_{2}$ regularization, weights that tend to have large gradients in $f$ do not get regularized as much as they would with decoupled weight decay, since the gradient of the regularizer gets scaled along with the gradient of $f$ . This leads to an inequivalence of $L_{2}$ and decoupled weight decay regularization for adaptive gradient algorithms:

现在，让我们转向像流行的优化器 Adam 这样的自适应梯度算法，它们会按其历史幅度缩放梯度。直观地说，当 Adam 在损失函数 $f$ 加上 $L_{2}$ 正则化上运行时，那些在 $f$ 中倾向于具有大梯度的权重，其被正则化的程度会低于使用解耦权重衰减时，因为正则化器的梯度会与 $f$ 的梯度一起被缩放。这导致了自适应梯度算法的 $L_{2}$ 正则化与解耦权重衰减正则化的不等价性：

Proposition 2 (Weight decay ≠ $L_{2}$ reg for adaptive gradients). Let $O$ denote an optimizer that has iterates $θ_{t + 1} \leftarrow θ_{t} - α M_{t} \nabla f_{t} (θ_{t})$ when run on batch loss function $f_{t} (θ)$ without weight decay, and $θ_{t + 1} \leftarrow (1 - λ) θ_{t} - α M_{t} \nabla f_{t} (θ_{t})$ when run on $f_{t} (θ)$ with weight decay, respectively, with $M_{t} \neq k I$ (where $k \in R$ ). Then, for $O$ there exists no $L_{2}$ coefficient $λ^{'}$ such that running $O$ on batch loss $f_{t}^{reg} (θ) = f_{t} (θ) + \frac{λ^{'}}{2} ∥ θ ∥_{2}^{2}$ without weight decay is equivalent to running $O$ on $f_{t} (θ)$ with decay $λ \in R^{+}$ .

命题 2 (自适应梯度的权重衰减 ≠ $L_{2}$ 正则化). 设 $O$ 为一个优化器，当它在没有权重衰减的批次损失函数 $f_{t} (θ)$ 上运行时，其迭代式为 $θ_{t + 1} \leftarrow θ_{t} - α M_{t} \nabla f_{t} (θ_{t})$ ；当它在带有权重衰减的 $f_{t} (θ)$ 上运行时，迭代式为 $θ_{t + 1} \leftarrow (1 - λ) θ_{t} - α M_{t} \nabla f_{t} (θ_{t})$ ，其中 $M_{t} \neq k I$ 。那么，对于 $O$ ，不存在这样的 $L_{2}$ 系数 $λ^{'}$ ，使得在没有权重衰减的情况下，在批次损失 $f_{t}^{reg} (θ) = f_{t} (θ) + \frac{λ^{'}}{2} ∥ θ ∥_{2}^{2}$ 上运行 $O$ 等价于在带有衰减 $λ$ 的 $f_{t} (θ)$ 上运行 $O$ 。

We decouple weight decay and loss-based gradient updates in Adam as shown in line 12 of Algorithm 2; this gives rise to our variant of Adam with decoupled weight decay (AdamW).

我们在算法 2 的第 12 行所示，在 Adam 中解耦了权重衰减和基于损失的梯度更新；这产生了我们提出的带有解耦权重衰减的 Adam 变体。

Algorithm 2 Adam with $L_{2}$ regularization and Adam with decoupled weight decay (AdamW)

given $α = 0.001$ , $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 10^{- 8}$ , $λ \in R$
initialize time step $t \leftarrow 0$ , parameter vector $θ_{t = 0} \in R^{n}$ , first moment vector $m_{t = 0} \leftarrow 0$ , second moment vector $v_{t = 0} \leftarrow 0$ , schedule multiplier $η_{t = 0} \in R$
repeat
$t \leftarrow t + 1$
$\nabla f_{t} (θ_{t - 1}) \leftarrow SelectBatch (θ_{t - 1})$ ▷ select batch and return the corresponding gradient
$g_{t} \leftarrow \nabla f_{t} (θ_{t - 1})$ $+ λ θ_{t - 1}$
$m_{t} \leftarrow β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$
$v_{t} \leftarrow β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ ▷ all operations are element-wise
${\hat{m}}_{t} \leftarrow m_{t} / (1 - β_{1}^{t})$ ▷ $β_{1}$ taken to the power $t$
${\hat{v}}_{t} \leftarrow v_{t} / (1 - β_{2}^{t})$ ▷ $β_{2}$ taken to the power $t$
$η_{t} \leftarrow SetScheduleMultiplier (t)$ ▷ can be fixed, decay, or be used for warm restarts
$θ_{t} \leftarrow θ_{t - 1} - η_{t} (α {\hat{m}}_{t} / (\sqrt{{\hat{v}}_{t}} + ϵ)$ $+ λ θ_{t - 1})$
until stopping criterion is met
return optimized parameters $θ_{t}$

Having shown that $L_{2}$ regularization and weight decay regularization differ for adaptive gradient algorithms raises the question of how they differ and how to interpret their effects. Their equivalence for standard SGD remains very helpful for intuition: both mechanisms push weights closer to zero, at the same rate. However, for adaptive gradient algorithms they differ: with $L_{2}$ regularization, the sums of the gradient of the loss function and the gradient of the regularizer (i.e., the $L_{2}$ norm of the weights) are adapted, whereas with decoupled weight decay, only the gradients of the loss function are adapted (with the weight decay step separated from the adaptive gradient mechanism). With $L_{2}$ regularization both types of gradients are normalized by their typical (summed) magnitudes, and therefore weights $x$ with large typical gradient magnitude $s$ are regularized by a smaller relative amount than other weights. In contrast, decoupled weight decay regularizes all weights with the same rate $λ$ , effectively regularizing weights $x$ with large $s$ more than standard $L_{2}$ regularization does. We demonstrate this formally for a simple special case of adaptive gradient algorithm with a fixed preconditioner:

我们已经表明 $L_{2}$ 正则化和权重衰减正则化对于自适应梯度算法是不同的，这就引出了它们如何不同以及如何解释其效果的问题。它们在标准 SGD 中的等价性对于直观理解仍然非常有帮助：两种机制都以相同的速率将权重推向零。然而，对于自适应梯度算法，它们是不同的：使用 $L_{2}$ 正则化时，损失函数的梯度和正则化器的梯度之和被自适应调整；而使用解耦权重衰减时，只有损失函数的梯度被自适应调整。使用 $L_{2}$ 正则化时，两种梯度都按其典型大小进行归一化，因此具有较大典型梯度幅度 $s$ 的权重 $x$ 比其他权重受到的正则化相对更小。相比之下，解耦权重衰减以相同的速率 $λ$ 正则化所有权重，有效地使得具有大 $s$ 的权重 $x$ 比标准 $L_{2}$ 正则化受到更大的正则化。我们通过一个具有固定预条件器的自适应梯度算法的简单特例来形式化地证明这一点：

Proposition 3 (Weight decay = scale-adjusted $L_{2}$ reg for adaptive gradient algorithm with fixed preconditioner). Let $O$ denote an algorithm with the same characteristics as in Proposition 2, and using a fixed preconditioner matrix $M_{t} = diag (s)^{- 1}$ (with $s_{i} > 0$ for all $i$ ). Then, $O$ with base learning rate $α$ executes the same steps on batch loss functions $f_{t} (θ)$ with weight decay $λ$ as it executes without weight decay on the scale-adjusted regularized batch loss

命题 3 (具有固定预条件器的自适应梯度算法的权重衰减 = 尺度调整后的 $L_{2}$ 正则化). 设 $O$ 表示一个与命题 2 中具有相同特征的算法，并且使用一个固定的预条件器矩阵 $M_{t} = diag (s)^{- 1}$ 。那么，使用基础学习率 $α$ 和权重衰减 $λ$ 在批次损失函数 $f_{t} (θ)$ 上运行的 $O$ 所执行的步骤，等同于在没有权重衰减的情况下，在尺度调整后的正则化批次损失上执行的步骤：

\begin{matrix} (2) & f_{t}^{sreg} (θ) = f_{t} (θ) + \frac{λ^{'}}{2 α} {‖ θ ⊙ \sqrt{s} ‖}_{2}^{2} \end{matrix}

where $⊙$ and $\sqrt{\cdot}$ denote element-wise multiplication and square root, respectively, and $λ^{'} = \frac{λ}{α}$ .

其中 $⊙$ 和 $\sqrt{\cdot}$ 分别表示逐元素乘法和平方根，且 $λ^{'} = \frac{λ}{α}$ 。

We note that this proposition does not directly apply to practical adaptive gradient algorithms, since these change the preconditioner matrix at every step. Nevertheless, it can still provide intuition about the equivalent loss function being optimized in each step: parameters $θ_{i}$ with a large inverse preconditioner $s_{i}$ (which in practice would be caused by historically large gradients in dimension $i$ ) are regularized relatively more than they would be with $L_{2}$ regularization; specifically, the regularization is proportional to $\sqrt{s_{i}}$ .

我们注意到，这个命题并不直接适用于实际的自适应梯度算法，因为它们在每一步都会改变预条件器矩阵。尽管如此，它仍然可以为每一步优化的等效损失函数提供直观理解：具有大逆预条件器 $s_{i}$ 的参数 $θ_{i}$ 受到的正则化相对大于它们在使用 $L_{2}$ 正则化时受到的正则化；具体来说，正则化与 $\sqrt{s_{i}}$ 成正比。

Justification Of Decoupled Weight Decay Via A View Of Adaptive Gradient Methods As Bayesian Filtering

We now discuss a justification of decoupled weight decay in the framework of Bayesian filtering for a unified theory of adaptive gradient algorithms due to Aitchison. After we posted a preliminary version of our current paper on arXiv, Aitchison noted that his theory "gives us a theoretical framework in which we can understand the superiority of this weight decay over $L_{2}$ regularization, because it is weight decay, rather than $L_{2}$ regularization that emerges through the straightforward application of Bayesian filtering."(Aitchison, 2018). While full credit for this theory goes to Aitchison, we summarize it here to shed some light on why weight decay may be favored over $L_{2}$ regularization.

我们现在讨论在Aitchison提出的自适应梯度算法统一理论的贝叶斯滤波框架中，对解耦权重衰减合理性的论证。在我们于arXiv上发布当前论文的初步版本后，Aitchison指出，他的理论“为我们提供了一个理论框架，使我们能够理解这种权重衰减相较于 $L_{2}$ 正则化的优越性，因为正是权重衰减，而非 $L_{2}$ 正则化，通过直接应用贝叶斯滤波而出现”。虽然这一理论的完全归功于Aitchison，但我们在此对其进行总结，以阐明为何权重衰减可能比 $L_{2}$ 正则化更受青睐。

Aitchison views stochastic optimization of $n$ parameters $θ_{1}, \dots, θ_{n}$ as a Bayesian filtering problem with the goal of inferring a distribution over the optimal values of each of the parameters $θ_{i}$ given the current values of the other parameters $θ_{- i} (t)$ at time step $t$ . When the other parameters do not change this is an optimization problem, but when they do change it becomes one of "tracking" the optimizer using Bayesian filtering as follows. One is given a probability distribution $P (θ_{t} | y_{1 : t})$ of the optimizer at time step $t$ that takes into account the data $y_{1 : t}$ from the first $t$ mini batches, a state transition prior $P (θ_{t + 1} | θ_{t})$ reflecting a (small) data-independent change in this distribution from one step to the next, and a likelihood $P (y_{t + 1} | θ_{t + 1})$ derived from the mini batch at step $t + 1$ . The posterior distribution $P (θ_{t + 1} | y_{1 : t + 1})$ of the optimizer at time step $t + 1$ can then be computed (as usual in Bayesian filtering) by marginalizing over $θ_{t}$ to obtain the one-step ahead predictions $P (θ_{t + 1} | y_{1 : t})$ and then applying Bayes' rule to incorporate the likelihood $P (y_{t + 1} | θ_{t + 1})$ . Aitchison assumes a Gaussian state transition distribution $P (θ_{t + 1} | θ_{t})$ and an approximate conjugate likelihood $P (y_{t + 1} | θ_{t + 1})$ , leading to the following closed-form update of the filtering distribution's mean:

Aitchison将 $n$ 个参数 $θ_{1}, \dots, θ_{n}$ 的随机优化视为一个贝叶斯滤波问题，其目标是在给定时间步 $t$ 其他参数 $θ_{- i} (t)$ 的当前值的情况下，推断每个参数 $θ_{i}$ 最优值上的分布。当其他参数不变时，这是一个优化问题；但当它们变化时，就变成了一个使用贝叶斯滤波“跟踪”优化器的问题，具体如下：给定优化器在时间步 $t$ 的概率分布 $P (θ_{t} | y_{1 : t})$ （该分布考虑了前 $t$ 个小批次的数据 $y_{1 : t}$ ）、一个反映该分布从一步到下一步的数据无关变化的转移先验 $P (θ_{t + 1} | θ_{t})$ ，以及一个从第 $t + 1$ 步的小批次导出的似然 $P (y_{t + 1} | θ_{t + 1})$ 。然后，可以通过对 $θ_{t}$ 进行边缘化以获得一步超前预测 $P (θ_{t + 1} | y_{1 : t})$ ，然后应用贝叶斯规则纳入似然 $P (y_{t + 1} | θ_{t + 1})$ ，来计算优化器在时间步 $t + 1$ 的后验分布 $P (θ_{t + 1} | y_{1 : t + 1})$ 。Aitchison假设了一个高斯状态转移分布 $P (θ_{t + 1} | θ_{t})$ 和一个近似共轭的似然 $P (y_{t + 1} | θ_{t + 1})$ ，从而导出滤波分布均值的以下闭式更新：

\begin{matrix} (3) & μ_{p o s t} = μ_{p r i o r} + Σ_{p o s t} \times g \end{matrix}

where $g$ is the gradient of the log likelihood of the mini batch at time $t$ . This result implies a preconditioner of the gradients that is given by the posterior uncertainty $Σ_{p o s t}$ of the filtering distribution: updates are larger for parameters we are more uncertain about and smaller for parameters we are more certain about. Aitchison goes on to show that popular adaptive gradient methods, such as Adam and RMSprop, as well as Kronecker-factorized methods are special cases of this framework.

其中 $g$ 是时间步 $t$ 小批次的对数似然的梯度。这一结果意味着梯度有一个由滤波分布的后验不确定性 $Σ_{p o s t}$ 给出的预条件器：对于我们更不确定的参数，更新更大；对于更确定的参数，更新更小。Aitchison进一步证明，流行的自适应梯度方法，如Adam和RMSprop，以及克罗内克因子分解方法，都是该框架的特例。

Decoupled weight decay very naturally fits into this unified framework as part of the state-transition distribution: Aitchison assumes a slow change of the optimizer according to the following Gaussian:

解耦权重衰减非常自然地作为状态转移分布的一部分契合到这一统一框架中：Aitchison假设优化器根据以下高斯分布缓慢变化：

\begin{matrix} (4) & P (θ_{t + 1} | θ_{t}) = N ((I - A) θ_{t}, Q) \end{matrix}

where $Q$ is the covariance of Gaussian perturbations of the weights, and $A$ is a regularizer to avoid values growing unboundedly over time. When instantiated as $A = λ \times I$ , this regularizer $A$ plays exactly the role of decoupled weight decay as described in Equation 1, since this leads to multiplying the current mean estimate $θ_{t}$ by $(1 - λ)$ at each step. Notably, this regularization is also directly applied to the prior and does not depend on the uncertainty in each of the parameters (which would be required for $L_{2}$ regularization).

其中 $Q$ 是权重高斯扰动的协方差，而 $A$ 是一个正则化器，用于防止值随时间无界增长。当实例化为 $A = λ \times I$ 时，这个正则化器 $A$ 恰好扮演了公式1中描述的解耦权重衰减的角色，因为这导致每一步都将当前均值估计 $θ_{t}$ 乘以 $(1 - λ)$ 。值得注意的是，这种正则化也直接应用于先验，并且不依赖于每个参数的不确定性。

Conclusion And Future Work

Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), we identified and exposed the inequivalence of $L_{2}$ regularization and weight decay for Adam. We empirically showed that our version of Adam with decoupled weight decay yields substantially better generalization performance than the common implementation of Adam with $L_{2}$ regularization. We also proposed to use warm restarts for Adam to improve its anytime performance.

继有研究指出 Adam 等自适应梯度方法可能导致比带动量的 SGD 更差的泛化性能之后，我们识别并揭示了 Adam 中 $L_{2}$ 正则化与权重衰减的不等价性。我们通过实验证明，我们带有解耦权重衰减的 Adam 版本比普遍采用 $L_{2}$ 正则化的 Adam 实现具有显著更好的泛化性能。我们还提出对 Adam 使用热重启以提高其任意时刻的性能。

Our results obtained on image classification datasets must be verified on a wider range of tasks, especially ones where the use of regularization is expected to be important. It would be interesting to integrate our findings on weight decay into other methods which attempt to improve Adam, e.g., normalized direction-preserving Adam. While we focused our experimental analysis on Adam, we believe that similar results also hold for other adaptive gradient methods, such as AdaGrad and AMSGrad.

我们在图像分类数据集上获得的结果需要在更广泛的任务上进行验证，特别是在那些预期正则化很重要的任务上。将我们在权重衰减方面的发现整合到其他试图改进 Adam 的方法中将会很有趣。虽然我们的实验分析集中在 Adam 上，但我们相信类似的结果也适用于其他自适应梯度方法，例如 AdaGrad 和 AMSGrad。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

Decoupled Weight Decay Regularization

解耦权重衰减正则化

Abstract

Introduction

Decoupling the Weight Decay from the Gradient-based Update

Justification Of Decoupled Weight Decay Via A View Of Adaptive Gradient Methods As Bayesian Filtering

Conclusion And Future Work

🤖 Rasa

Decoupled Weight Decay Regularization ​

解耦权重衰减正则化 ​

Abstract ​

Introduction ​

Decoupling the Weight Decay from the Gradient-based Update ​

Justification Of Decoupled Weight Decay Via A View Of Adaptive Gradient Methods As Bayesian Filtering ​

Conclusion And Future Work ​

Decoupled Weight Decay Regularization

解耦权重衰减正则化

Abstract

Introduction

Decoupling the Weight Decay from the Gradient-based Update

Justification Of Decoupled Weight Decay Via A View Of Adaptive Gradient Methods As Bayesian Filtering

Conclusion And Future Work