解耦权重衰减正则化
Abstract
Introduction
Adaptive gradient methods, such as AdaGrad, RMSProp, Adam and most recently AMSGrad have become a default method of choice for training feed-forward and recurrent neural networks. Nevertheless, state-of-the-art results for popular image classification datasets, such as CIFAR-10 and CIFAR-100, are still obtained by applying SGD with momentum. Furthermore, Wilson et al. suggested that adaptive gradient methods do not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks, such as image classification, character-level language modeling and constituency parsing. Different hypotheses about the origins of this worse generalization have been investigated, such as the presence of sharp local minima and inherent problems of adaptive gradient methods. In this paper, we investigate whether it is better to use
自适应梯度方法,如 AdaGrad、RMSProp、Adam 以及最近的 AMSGrad,已成为训练前馈和循环神经网络的默认选择。然而,对于流行的图像分类数据集,如 CIFAR-10 和 CIFAR-100,最先进的结果仍然是通过应用带动量的 SGD 获得的。此外,Wilson 等人指出,在一系列不同的深度学习任务上测试时,自适应梯度方法的泛化能力不如带动量的 SGD。关于这种泛化能力较差的原因,研究者已经探讨了不同的假设,例如尖锐局部极小值的存在以及自适应梯度方法固有的问题。在本文中,我们研究了在使用 SGD 和 Adam 训练深度神经网络时,是使用
Weight decay is equally effective in both SGD and Adam. For SGD, it is equivalent to
权重衰减在 SGD 和 Adam 中同样有效。 对于 SGD,它等价于
Optimal weight decay depends on the total number of batch passes/weight updates. Our empirical analysis of SGD and Adam suggests that the larger the runtime/number of batch passes to be performed, the smaller the optimal weight decay.
最优权重衰减取决于批次通过/权重更新的总数。 我们对 SGD 和 Adam 的实证分析表明,运行的次数/执行的批次通过次数越多,最优的权重衰减越小。
Adam can substantially benefit from a scheduled learning rate multiplier. The fact that Adam is an adaptive gradient algorithm and as such adapts the learning rate for each parameter does not rule out the possibility to substantially improve its performance by using a global learning rate multiplier, scheduled, e.g., by cosine annealing.
Adam 可以显著受益于计划的学习率乘数。 Adam 是一种自适应梯度算法,因此会为每个参数调整学习率,但这并不排除通过使用全局学习率乘数来显著提升其性能的可能性,例如通过余弦退火进行调度。
The main contribution of this paper is to improve regularization in Adam by decoupling the weight decay from the gradient-based update. In a comprehensive analysis, we show that Adam generalizes substantially better with decoupled weight decay than with
本文的主要贡献在于通过将权重衰减与基于梯度的更新解耦来改进 Adam 中的正则化。在全面的分析中,我们表明使用解耦权重衰减的 Adam 比使用
The main motivation of this paper is to improve Adam to make it competitive w.r.t. SGD with momentum even for those problems where it did not use to be competitive. We hope that as a result, practitioners do not need to switch between Adam and SGD anymore, which in turn should reduce the common issue of selecting dataset/task-specific training algorithms and their hyperparameters.
本文的主要动机是改进 Adam,使其即使对于那些它过去不具备竞争力的问题,也能与带动量的 SGD 竞争。我们希望结果是,实践者不再需要在 Adam 和 SGD 之间切换,这反过来应该会减少选择特定于数据集/任务的训练算法及其超参数的常见问题。
Decoupling the Weight Decay from the Gradient-based Update
In the weight decay described by Hanson & Pratt, the weights
在 Hanson & Pratt 描述的权重衰减中,权重
where
其中
Proposition 1 (Weight decay =
命题 1 (标准 SGD 的权重衰减 =
The proofs of this well-known fact, as well as our other propositions, are given in Appendix A.
这个众所周知的事实的证明,以及我们的其他命题,都在附录 A 中给出。
Due to this equivalence,
由于这种等价性,
Looking first at the case of SGD, we propose to decay the weights simultaneously with the update of
首先看 SGD 的情况,我们建议在算法 1 的第 9 行,基于梯度信息更新
Algorithm 1 SGD with
- given initial learning rate
, momentum factor , weight decay/ regularization factor - initialize time step
, parameter vector , first moment vector , schedule multiplier - repeat
-
-
▷ select batch and return the corresponding gradient -
-
▷ can be fixed, decay, be used for warm restarts -
-
- until stopping criterion is met
- return optimized parameters
Now, let's turn to adaptive gradient algorithms like the popular optimizer Adam, which scale gradients by their historic magnitudes. Intuitively, when Adam is run on a loss function
现在,让我们转向像流行的优化器 Adam 这样的自适应梯度算法,它们会按其历史幅度缩放梯度。直观地说,当 Adam 在损失函数
Proposition 2 (Weight decay ≠
命题 2 (自适应梯度的权重衰减 ≠
We decouple weight decay and loss-based gradient updates in Adam as shown in line 12 of Algorithm 2; this gives rise to our variant of Adam with decoupled weight decay (AdamW).
我们在算法 2 的第 12 行所示,在 Adam 中解耦了权重衰减和基于损失的梯度更新;这产生了我们提出的带有解耦权重衰减的 Adam 变体。
Algorithm 2 Adam with
- given
, , , , - initialize time step
, parameter vector , first moment vector , second moment vector , schedule multiplier - repeat
-
-
▷ select batch and return the corresponding gradient -
-
-
▷ all operations are element-wise -
▷ taken to the power -
▷ taken to the power -
▷ can be fixed, decay, or be used for warm restarts -
- until stopping criterion is met
- return optimized parameters
Having shown that
我们已经表明
Proposition 3 (Weight decay = scale-adjusted
命题 3 (具有固定预条件器的自适应梯度算法的权重衰减 = 尺度调整后的
where
其中
We note that this proposition does not directly apply to practical adaptive gradient algorithms, since these change the preconditioner matrix at every step. Nevertheless, it can still provide intuition about the equivalent loss function being optimized in each step: parameters
我们注意到,这个命题并不直接适用于实际的自适应梯度算法,因为它们在每一步都会改变预条件器矩阵。尽管如此,它仍然可以为每一步优化的等效损失函数提供直观理解:具有大逆预条件器
Justification Of Decoupled Weight Decay Via A View Of Adaptive Gradient Methods As Bayesian Filtering
We now discuss a justification of decoupled weight decay in the framework of Bayesian filtering for a unified theory of adaptive gradient algorithms due to Aitchison. After we posted a preliminary version of our current paper on arXiv, Aitchison noted that his theory "gives us a theoretical framework in which we can understand the superiority of this weight decay over
我们现在讨论在Aitchison提出的自适应梯度算法统一理论的贝叶斯滤波框架中,对解耦权重衰减合理性的论证。在我们于arXiv上发布当前论文的初步版本后,Aitchison指出,他的理论“为我们提供了一个理论框架,使我们能够理解这种权重衰减相较于
Aitchison views stochastic optimization of
Aitchison将
where
其中
Decoupled weight decay very naturally fits into this unified framework as part of the state-transition distribution: Aitchison assumes a slow change of the optimizer according to the following Gaussian:
解耦权重衰减非常自然地作为状态转移分布的一部分契合到这一统一框架中:Aitchison假设优化器根据以下高斯分布缓慢变化:
where
其中
Conclusion And Future Work
Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), we identified and exposed the inequivalence of
继有研究指出 Adam 等自适应梯度方法可能导致比带动量的 SGD 更差的泛化性能之后,我们识别并揭示了 Adam 中
Our results obtained on image classification datasets must be verified on a wider range of tasks, especially ones where the use of regularization is expected to be important. It would be interesting to integrate our findings on weight decay into other methods which attempt to improve Adam, e.g., normalized direction-preserving Adam. While we focused our experimental analysis on Adam, we believe that similar results also hold for other adaptive gradient methods, such as AdaGrad and AMSGrad.
我们在图像分类数据集上获得的结果需要在更广泛的任务上进行验证,特别是在那些预期正则化很重要的任务上。将我们在权重衰减方面的发现整合到其他试图改进 Adam 的方法中将会很有趣。虽然我们的实验分析集中在 Adam 上,但我们相信类似的结果也适用于其他自适应梯度方法,例如 AdaGrad 和 AMSGrad。