Adam: A Method for Stochastic Optimization
Adam:一种随机优化方法
Abstract
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm provided a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other methods. Finally, we discuss AdamMax, a variant of Adam based on the infinity norm.
我们提出了 Adam,一种基于低阶矩自适应估计的随机目标函数一阶梯度优化算法。该方法易于实现,计算效率高,内存需求小,对梯度的对角缩放具有不变性,并且非常适合处理数据和/或参数规模较大的问题。该方法同样适用于非平稳目标以及梯度非常嘈杂和/或稀疏的问题。其超参数具有直观的解释,通常只需很少的调整。我们还讨论了与启发 Adam 的相关算法的一些联系。我们分析了该算法的理论收敛性质,给出了与在线凸优化框架下已知最佳结果相当的收敛速度的遗憾界。实证结果表明,Adam 在实践中效果良好,并优于其他方法。最后,我们讨论了基于无穷范数的 Adam 变体——AdamMax。
Introduction
Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. If the function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function. Often, objective functions are stochastic. For example, many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning. Objectives may also have other sources of noise than data subsampling, such as dropout regularization. For all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be restricted to first-order methods.
基于梯度的随机优化在许多科学和工程领域中具有核心的实际重要性。这些领域的许多问题可以归结为对某个标量参数化目标函数的优化,要求对其参数进行最大化或最小化。如果函数关于其参数是可微的,梯度下降是一种相对高效的优化方法,因为计算关于所有参数的一阶偏导数的计算复杂度与仅评估函数本身相同。通常,目标函数是随机的。例如,许多目标函数由在不同数据子样本上评估的子函数之和组成;在这种情况下,通过对各个子函数采取梯度步骤,即随机梯度下降或上升,可以使优化更加高效。SGD已被证明是一种高效且有效的优化方法,是许多机器学习成功案例的核心,例如深度学习的近期进展。除了数据子采样之外,目标函数还可能存在其他噪声来源,例如丢弃正则化。对于所有这些带噪声的目标,都需要高效的随机优化技术。本文的重点是优化具有高维参数空间的随机目标。在这些情况下,高阶优化方法不合适,因此本文将仅限于讨论一阶方法。
We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation. Our method is designed to combine the advantages of two recently popular methods: AdaGrad, which works well with sparse gradients, and RMSProp, which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are clarified in section 5. Some of Adam's advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing.
我们提出了Adam,一种仅需一阶梯度且内存需求小的随机优化方法。该方法通过梯度的一阶矩和二阶矩的估计,为不同参数计算各自的自适应学习率;Adam这个名字来源于自适应矩估计。我们的方法旨在结合两种近期流行方法的优势:AdaGrad,它在处理稀疏梯度时效果良好;以及RMSProp,它在在线和非平稳环境中效果良好。与这些方法及其他随机优化方法的重要联系将在第5节中阐明。Adam的一些优点包括:参数更新的幅度对梯度的重新缩放具有不变性;其步长大致由步长超参数界定;它不需要平稳目标;它适用于稀疏梯度;并且它能自然地执行一种步长退火。
In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam's convergence in online convex programming. Empirically, our method consistently outperforms other methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is a versatile algorithm that scales to large-scale high-dimensional machine learning problems.
在第2节中,我们描述了算法及其更新规则的特性。第3节解释了我们初始化偏差修正的技术,第4节提供了Adam在在线凸规划中收敛性的理论分析。实证结果表明,我们的方法在各种模型和数据集上始终优于其他方法,如第6节所示。总的来说,我们证明了Adam是一种可扩展至大规模高维机器学习问题的通用算法。
Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details, and for a slightly more efficient (but less clear) order of computation.
Require:
Require:
Require:
Require:
while
end while
return
Algorithm
See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let
我们提出的算法 Adam 的伪代码见算法1。设
The algorithm updates exponential moving averages of the gradient (
该算法更新梯度的一阶矩估计
Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines:
请注意,算法1的效率可以通过改变计算顺序来提高,尽管这会牺牲一些清晰度。例如,可以用以下几行替换循环中的最后三行:
2.1 Adam's update rule
An important property of Adam's update rule is its careful choice of stepsizes. Assuming
2.1 Adam 的更新规则
Adam 更新规则的一个重要特性是其对步长的谨慎选择。假设
The first case only happens in the most severe case of sparsity: when a gradient has been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize will be smaller. When
第一种情况仅在最严重的稀疏情况下发生:即梯度除了当前时间步外在所有时间步都为零。对于不那么稀疏的情况,有效步长会更小。当
Initialization Bias Correction
As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive the term for the second moment estimate; the derivation for the first moment estimate is completely analogous. Let
如第 2 节所述,Adam 采用了初始化偏差修正项。我们在此推导二阶矩估计的修正项;一阶矩估计的推导完全类似。设
We wish to know how
我们希望了解
where
其中,如果真实二阶矩
In case of sparse gradients, for a reliable estimate of the second moment one needs to average over many gradients by chosing a small value of
在稀疏梯度的情况下,为了可靠地估计二阶矩,需要通过选择较小的
Convergence Analysis
We analyze the convergence of Adam using the online learning framework proposed. Given an arbitrary, unknown sequence of convex cost functions
我们使用提出的在线学习框架分析 Adam 的收敛性。给定一个任意的、未知的凸代价函数序列
where
其中
Theorem 4.1. Assume that the function
定理 4.1. 假设函数
Our Theorem 4.1 implies when the data features are sparse and bounded gradients, the summation term can be much smaller than its upper bound
我们的定理 4.1 表明,当数据特征是稀疏的且梯度有界时,求和项可以远小于其上界,特别是如果函数类别和数据特征具有某种形式。他们关于期望值的结果也适用于 Adam。特别是,自适应方法,如 Adam 和 Adagrad,可以达到
Finally, we can show the average regret of Adam converges,
最后,我们可以证明 Adam 的平均遗憾收敛,
Corollary 4.2. Assume that the function
推论 4.2. 假设函数
This result can be obtained by using Theorem 4.1 and
该结果可以通过使用定理 4.1 和
Related Work
Optimization methods bearing a direct relation to Adam are RMSProp and AdaGrad; these relationships are discussed below. Other stochastic optimization methods include vSGD, AdaDelta and the natural Newton method from Roux & Fitzgibbon, all setting stepsizes by estimating curvature from first-order information. The Sum-of-Functions Optimizer (SFO) is a quasi-Newton method based on minibatches, but (unlike Adam) has memory requirements linear in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained systems such as a GPU. Like natural gradient descent (NGD), Adam employs a preconditioner that adapts to the geometry of the data, since
与 Adam 直接相关的优化方法是 RMSProp 和 AdaGrad;这些关系将在下面讨论。其他随机优化方法包括 vSGD、AdaDelta 以及 Roux 和 Fitzgibbon 的自然牛顿法,所有这些方法都通过从一阶信息估计曲率来设置步长。和函数优化器是一种基于小批量的拟牛顿法,但与 Adam 不同,其内存需求与数据集的小批量分区数量呈线性关系,这在内存受限的系统(如 GPU)上通常不可行。与自然梯度下降类似,Adam 采用了一种自适应数据几何结构的预条件器,因为
RMSProp: An optimization method closely related to Adam is RMSProp. A version with momentum has sometimes been used. There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient. RMSProp also lacks a bias-correction term; this matters most in case of a value of
RMSProp: 与 Adam 密切相关的优化方法是 RMSProp。有时会使用带动量版本的 RMSProp。带动量的 RMSProp 与 Adam 之间存在几个重要差异:带动量的 RMSProp 使用重新缩放梯度的动量来生成参数更新,而 Adam 的更新是直接使用梯度一阶和二阶矩的运行平均值来估计的。RMSProp 还缺少偏差修正项;这在
AdaGrad: An algorithm that works well for sparse gradients is AdaGrad. Its basic version updates parameters as
AdaGrad: 一种在稀疏梯度上效果良好的算法是 AdaGrad。其基本版本按以下方式更新参数:
Extensions
7.1 ADAMAX
In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled)
在 Adam 中,各个权重的更新规则是根据其各自当前梯度和过去梯度的
Note that the decay term is here equivalently parameterised as
注意,这里的衰减项等效地参数化为
Which corresponds to the remarkably simple recursive formula:
这对应于极其简单的递归公式:
with initial value
初始值为
Algorithm 2: AdaMax, a variant of Adam based on the infinity norm. See section 7.1 for details. Good default settings for the tested machine learning problems are
Require:
Require:
Require:
Require:
while
end while
return
7.2 Temporal averaging
Since the last iterate is noisy due to stochastic approximation, better generalization performance is often achieved by averaging. Previously in Moulines & Bach (2011), Polyak-Ruppert averaging (Polyak & Juditsky, 1992; Ruppert, 1988) has been shown to improve the convergence of standard SGD, where
7.2 时间平均
由于随机逼近导致最后一步迭代存在噪声,通常通过对参数进行平均可以获得更好的泛化性能。先前 Moulines 和 Bach 的研究表明,Polyak-Ruppert 平均可以改善标准 SGD 的收敛性,其中
Conclusion
We have introduced a simple and computationally efficient algorithm for gradient-based optimization of stochastic objective functions. Our method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward to implement and requires little memory. The experiments confirm the analysis on the rate of convergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning.
我们提出了一种简单且计算高效的算法,用于基于梯度的随机目标函数优化。我们的方法针对具有大规模数据集和/或高维参数空间的机器学习问题。该方法结合了近期两种流行优化方法的优点:AdaGrad 处理稀疏梯度的能力,以及 RMSProp 处理非平稳目标的能力。该方法易于实现且内存需求小。实验证实了对凸问题收敛速度的分析。总的来说,我们发现 Adam 稳健且适用于机器学习领域中广泛的非凸优化问题。