Adam: A Method for Stochastic Optimization

Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.

246510+

Adam：一种随机优化方法

Abstract

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm provided a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other methods. Finally, we discuss AdamMax, a variant of Adam based on the infinity norm.

我们提出了 Adam，一种基于低阶矩自适应估计的随机目标函数一阶梯度优化算法。该方法易于实现，计算效率高，内存需求小，对梯度的对角缩放具有不变性，并且非常适合处理数据和/或参数规模较大的问题。该方法同样适用于非平稳目标以及梯度非常嘈杂和/或稀疏的问题。其超参数具有直观的解释，通常只需很少的调整。我们还讨论了与启发 Adam 的相关算法的一些联系。我们分析了该算法的理论收敛性质，给出了与在线凸优化框架下已知最佳结果相当的收敛速度的遗憾界。实证结果表明，Adam 在实践中效果良好，并优于其他方法。最后，我们讨论了基于无穷范数的 Adam 变体——AdamMax。

Introduction

Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. If the function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function. Often, objective functions are stochastic. For example, many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning. Objectives may also have other sources of noise than data subsampling, such as dropout regularization. For all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be restricted to first-order methods.

基于梯度的随机优化在许多科学和工程领域中具有核心的实际重要性。这些领域的许多问题可以归结为对某个标量参数化目标函数的优化，要求对其参数进行最大化或最小化。如果函数关于其参数是可微的，梯度下降是一种相对高效的优化方法，因为计算关于所有参数的一阶偏导数的计算复杂度与仅评估函数本身相同。通常，目标函数是随机的。例如，许多目标函数由在不同数据子样本上评估的子函数之和组成；在这种情况下，通过对各个子函数采取梯度步骤，即随机梯度下降或上升，可以使优化更加高效。SGD已被证明是一种高效且有效的优化方法，是许多机器学习成功案例的核心，例如深度学习的近期进展。除了数据子采样之外，目标函数还可能存在其他噪声来源，例如丢弃正则化。对于所有这些带噪声的目标，都需要高效的随机优化技术。本文的重点是优化具有高维参数空间的随机目标。在这些情况下，高阶优化方法不合适，因此本文将仅限于讨论一阶方法。

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation. Our method is designed to combine the advantages of two recently popular methods: AdaGrad, which works well with sparse gradients, and RMSProp, which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are clarified in section 5. Some of Adam's advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing.

我们提出了Adam，一种仅需一阶梯度且内存需求小的随机优化方法。该方法通过梯度的一阶矩和二阶矩的估计，为不同参数计算各自的自适应学习率；Adam这个名字来源于自适应矩估计。我们的方法旨在结合两种近期流行方法的优势：AdaGrad，它在处理稀疏梯度时效果良好；以及RMSProp，它在在线和非平稳环境中效果良好。与这些方法及其他随机优化方法的重要联系将在第5节中阐明。Adam的一些优点包括：参数更新的幅度对梯度的重新缩放具有不变性；其步长大致由步长超参数界定；它不需要平稳目标；它适用于稀疏梯度；并且它能自然地执行一种步长退火。

In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam's convergence in online convex programming. Empirically, our method consistently outperforms other methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is a versatile algorithm that scales to large-scale high-dimensional machine learning problems.

在第2节中，我们描述了算法及其更新规则的特性。第3节解释了我们初始化偏差修正的技术，第4节提供了Adam在在线凸规划中收敛性的理论分析。实证结果表明，我们的方法在各种模型和数据集上始终优于其他方法，如第6节所示。总的来说，我们证明了Adam是一种可扩展至大规模高维机器学习问题的通用算法。

Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details, and for a slightly more efficient (but less clear) order of computation. $g_{t}^{2}$ indicates the elementwise square $g_{t} ⊙ g_{t}$ . Good default settings for the tested machine learning problems are $α = 0.001$ , $β_{1} = 0.9$ , $β_{2} = 0.999$ and $ϵ = 10^{- 8}$ . All operations on vectors are element-wise. With $β_{1}^{t}$ and $β_{2}^{t}$ we denote $β_{1}$ and $β_{2}$ to the power $t$ .

Require: $α$ : Stepsize
Require: $β_{1}, β_{2} \in [0, 1]$ : Exponential decay rates for the moment estimates
Require: $f (θ)$ : Stochastic objective function with parameters $θ$
Require: $θ_{0}$ : Initial parameter vector

$m_{0} \leftarrow 0$ (Initialize $1^{st}$ moment vector)

$v_{0} \leftarrow 0$ (Initialize $2^{nd}$ moment vector)

$t \leftarrow 0$ (Initialize timestep)

while $θ_{t}$ not converged do

$t \leftarrow t + 1$

$g_{t} \leftarrow \nabla_{θ} f_{t} (θ_{t - 1})$ (Get gradients w.r.t. stochastic objective at timestep $t$ )

$m_{t} \leftarrow β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t}$ (Update biased first moment estimate)

$v_{t} \leftarrow β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2}$ (Update biased second raw moment estimate)

${\hat{m}}_{t} \leftarrow m_{t} / (1 - β_{1}^{t})$ (Compute bias-corrected first moment estimate)

${\hat{v}}_{t} \leftarrow v_{t} / (1 - β_{2}^{t})$ (Compute bias-corrected second raw moment estimate)

$θ_{t} \leftarrow θ_{t - 1} - α \cdot {\hat{m}}_{t} / (\sqrt{{\hat{v}}_{t}} + ϵ)$ (Update parameters)

end while

return $θ_{t}$ (Resulting parameters)

Algorithm

See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let $f (θ)$ be a noisy objective function: a stochastic scalar function that is differentiable w.r.t. parameters $θ$ . We are interested in minimizing the expected value of this function, $E [f (θ)]$ w.r.t. its parameters $θ$ . With $f_{1} (θ), \dots, f_{T} (θ)$ we denote the realisations of the stochastic function at subsequent timesteps $1, \dots, T$ . The stochasticity might come from the evaluation at random subsamples (minibatches) of datapoints, or arise from inherent function noise. With $g_{t} = \nabla_{θ} f_{t} (θ)$ we denote the gradient, i.e. the vector of partial derivatives of $f_{t}$ , w.r.t $θ$ evaluated at timestep $t$ .

我们提出的算法 Adam 的伪代码见算法1。设 $f (θ)$ 是一个带噪声的目标函数：一个关于参数 $θ$ 可微的随机标量函数。我们感兴趣的是最小化该函数关于其参数 $θ$ 的期望值 $E [f (θ)]$ 。我们用 $f_{1} (θ), \dots, f_{T} (θ)$ 表示该随机函数在后续时间步 $1, \dots, T$ 的实现。随机性可能来自于对数据点的随机子样本（小批量）的评估，或源于固有的函数噪声。我们用 $g_{t} = \nabla_{θ} f_{t} (θ)$ 表示梯度，即在时间步 $t$ 计算的关于 $θ$ 的偏导数向量。

The algorithm updates exponential moving averages of the gradient ( $m_{t}$ ) and the squared gradient ( $v_{t}$ ) where the hyper-parameters $β_{1}, β_{2} \in [0, 1]$ control the exponential decay rates of these moving averages. The moving averages themselves are estimates of the $1^{st}$ moment (the mean) and the $2^{nd}$ raw moment (the uncentered variance) of the gradient. However, these moving averages are initialized as (vectors of) $0$ 's, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the $β$ s are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates ${\hat{m}}_{t}$ and ${\hat{v}}_{t}$ . See section 3 for more details.

该算法更新梯度的一阶矩估计 $m_{t}$ 和二阶矩估计 $v_{t}$ ，其中超参数 $β_{1}, β_{2} \in [0, 1]$ 控制这些移动平均的指数衰减率。移动平均本身是对梯度的 $1^{st}$ 矩和 $2^{nd}$ 原始矩的估计。然而，这些移动平均被初始化为 $0$ ，导致矩估计偏向于零，尤其是在初始时间步，并且当衰减率很小时尤其如此。好消息是，这种初始化偏差可以很容易地被抵消，从而得到偏差修正的估计 ${\hat{m}}_{t}$ 和 ${\hat{v}}_{t}$ 。更多细节见第3节。

Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines: $α_{t} = α \cdot \sqrt{1 - β_{2}^{t} / (1 - β_{1}^{t})}$ and $θ_{t} \leftarrow θ_{t - 1} - α_{t} \cdot m_{t} / (\sqrt{v_{t}} + \hat{ϵ})$ .

请注意，算法1的效率可以通过改变计算顺序来提高，尽管这会牺牲一些清晰度。例如，可以用以下几行替换循环中的最后三行： $α_{t} = α \cdot \sqrt{1 - β_{2}^{t} / (1 - β_{1}^{t})}$ 、 $θ_{t} \leftarrow θ_{t - 1} - α_{t} \cdot m_{t} / (\sqrt{v_{t}} + \hat{ϵ})$ 。

2.1 Adam's update rule

An important property of Adam's update rule is its careful choice of stepsizes. Assuming $ϵ = 0$ , the effective step taken in parameter space at timestep $t$ is $Δ_{t} = α \cdot {\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}}$ . The effective stepsize has two upper bounds: $| Δ_{t} | \leq α \cdot (1 - β_{1}) / \sqrt{1 - β_{2}}$ in the case $(1 - β_{1}) > \sqrt{1 - β_{2}}$ , and $| Δ_{t} | \leq α$ otherwise.

2.1 Adam 的更新规则

Adam 更新规则的一个重要特性是其对步长的谨慎选择。假设 $ϵ = 0$ ，在时间步 $t$ 参数空间中实际采取的步长为 $Δ_{t} = α \cdot {\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}}$ 。该有效步长有两个上限：在情况 $(1 - β_{1}) > \sqrt{1 - β_{2}}$ 下 $| Δ_{t} | \leq α \cdot (1 - β_{1}) / \sqrt{1 - β_{2}}$ ，以及在其他情况下的 $| Δ_{t} | \leq α$ 。

The first case only happens in the most severe case of sparsity: when a gradient has been zero at all timesteps except at the current timestep. For less sparse cases, the effective stepsize will be smaller. When $(1 - β_{1}) = \sqrt{1 - β_{2}}$ we have that $| {\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}} | < 1$ therefore $| Δ_{t} | < α$ . In more common scenarios, we will have that ${\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}} \approx \pm 1$ since $| E [g] / \sqrt{E [g^{2}]} | \leq 1$ . The effective magnitude of the steps taken in parameter space at each timestep are approximately bounded by the stepsize setting $α$ , i.e., $| Δ_{t} | ≲ α$ . This can be understood as establishing a trust region around the current parameter value, beyond which the current gradient estimate does not provide sufficient information. This typically makes it relatively easy to know the right scale of $α$ in advance. For many machine learning models, for instance, we often know in advance that good optima are with high probability within some set region in parameter space; it is not uncommon, for example, to have a prior distribution over the parameters. Since $α$ sets (an upper bound of) the magnitude of steps in parameter space, we can often deduce the right order of magnitude of $α$ such that optima can be reached from $θ_{0}$ within some number of iterations. With a slight abuse of terminology, we will call the ratio ${\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}}$ the signal-to-noise ratio ( $S N R$ ). With a smaller SNR the effective stepsize $Δ_{t}$ will be closer to zero. This is a desirable property, since a smaller SNR means that there is greater uncertainty about whether the direction of ${\hat{m}}_{t}$ corresponds to the direction of the true gradient. For example, the SNR value typically becomes closer to 0 towards an optimum, leading to smaller effective steps in parameter space: a form of automatic annealing. The effective stepsize $Δ_{t}$ is also invariant to the scale of the gradients; rescaling the gradients $g$ with factor $c$ will scale ${\hat{m}}_{t}$ with a factor $c$ and ${\hat{v}}_{t}$ with a factor $c^{2}$ , which cancel out: $(c \cdot {\hat{m}}_{t}) / (\sqrt{c^{2} \cdot {\hat{v}}_{t}}) = {\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}}$ .

第一种情况仅在最严重的稀疏情况下发生：即梯度除了当前时间步外在所有时间步都为零。对于不那么稀疏的情况，有效步长会更小。当 $(1 - β_{1}) = \sqrt{1 - β_{2}}$ 时，有 $| {\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}} | < 1$ ，因此 $| Δ_{t} | < α$ 。在更常见的场景中，由于 $| E [g] / \sqrt{E [g^{2}]} | \leq 1$ ，我们有 ${\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}} \approx \pm 1$ 。在每个时间步，参数空间中实际采取的步长幅度大致受步长设置 $α$ 的限制，即 $| Δ_{t} | ≲ α$ 。这可以理解为在当前参数值周围建立一个信任区域，超出该区域，当前的梯度估计无法提供足够的信息。这通常使得预先知道 $α$ 的正确量级相对容易。例如，对于许多机器学习模型，我们常常预先知道好的最优解大概率位于参数空间的某个设定区域内；例如，对参数有一个先验分布并不罕见。由于 $α$ 设定了参数空间中步长的幅度，我们通常可以推断出 $α$ 的正确数量级，以便在一定次数的迭代内从 $θ_{0}$ 到达最优解。我们稍微滥用一下术语，将比率 ${\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}}$ 称为信噪比。信噪比越小，有效步长 $Δ_{t}$ 越接近零。这是一个理想的特性，因为较小的信噪比意味着 ${\hat{m}}_{t}$ 的方向是否对应于真实梯度方向存在更大的不确定性。例如，信噪比值通常在接近最优解时趋近于 0，导致参数空间中的有效步长变小：这是一种自动退火的形式。有效步长 $Δ_{t}$ 还对梯度的尺度具有不变性；用因子 $c$ 重新缩放梯度 $g$ 会使 ${\hat{m}}_{t}$ 缩放因子 $c$ ，使 ${\hat{v}}_{t}$ 缩放因子 $c^{2}$ ，两者相抵消： $(c \cdot {\hat{m}}_{t}) / (\sqrt{c^{2} \cdot {\hat{v}}_{t}}) = {\hat{m}}_{t} / \sqrt{{\hat{v}}_{t}}$ 。

Initialization Bias Correction

As explained in section 2, Adam utilizes initialization bias correction terms. We will here derive the term for the second moment estimate; the derivation for the first moment estimate is completely analogous. Let $g$ be the gradient of the stochastic objective $f$ , and we wish to estimate its second raw moment (uncentered variance) using an exponential moving average of the squared gradient, with decay rate $β_{2}$ . Let $g_{1}, \dots, g_{T}$ be the gradients at subsequent timesteps, each a draw from an underlying gradient distribution $g_{t} \sim p (g_{t})$ . Let us initialize the exponential moving average as $v_{0} = 0$ (a vector of zeros). First note that the update at timestep $t$ of the exponential moving average $v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2}$ (where $g_{t}^{2}$ indicates the elementwise square $g_{t} ⊙ g_{t}$ ) can be written as a function of the gradients at all previous timesteps:

如第 2 节所述，Adam 采用了初始化偏差修正项。我们在此推导二阶矩估计的修正项；一阶矩估计的推导完全类似。设 $g$ 为随机目标 $f$ 的梯度，我们希望使用平方梯度的指数移动平均（衰减率为 $β_{2}$ ）来估计其二阶原始矩。令 $g_{1}, \dots, g_{T}$ 为后续时间步的梯度，每个都是从潜在梯度分布 $g_{t} \sim p (g_{t})$ 中抽取的。我们将指数移动平均初始化为 $v_{0} = 0$ （零向量）。首先注意到，在时间步 $t$ 对指数移动平均的更新 $v_{t} = β_{2} \cdot v_{t - 1} + (1 - β_{2}) \cdot g_{t}^{2}$ （其中 $g_{t}^{2}$ 表示逐元素平方 $g_{t} ⊙ g_{t}$ ）可以写成先前所有时间步梯度的函数：

\begin{matrix} (1) & v_{t} = (1 - β_{2}) \sum_{i = 1}^{t} β_{2}^{t - i} \cdot g_{i}^{2} \end{matrix}

We wish to know how $E [v_{t}]$ , the expected value of the exponential moving average at timestep $t$ , relates to the true second moment $E [g_{t}^{2}]$ , so we can correct for the discrepancy between the two. Taking expectations of the left-hand and right-hand sides of eq. (1):

我们希望了解 $E [v_{t}]$ （时间步 $t$ 指数移动平均的期望值）与真实二阶矩 $E [g_{t}^{2}]$ 之间的关系，以便修正两者之间的差异。对等式 (1) 两边取期望：

\begin{aligned} (2) & E [v_{t}] & = E [(1 - β_{2}) \sum_{i = 1}^{t} β_{2}^{t - i} \cdot g_{i}^{2}] \\ (3) & = E [g_{t}^{2}] \cdot (1 - β_{2}) \sum_{i = 1}^{t} β_{2}^{t - i} + ζ \\ (4) & = E [g_{t}^{2}] \cdot (1 - β_{2}^{t}) + ζ \end{aligned}

where $ζ = 0$ if the true second moment $E [g_{t}^{2}]$ is stationary; otherwise $ζ$ can be kept small since the exponential decay rate $β_{1}$ can (and should) be chosen such that the exponential moving average assigns small weights to gradients too far in the past. What is left is the term $(1 - β_{2}^{t})$ which is caused by initializing the running average with zeros. In algorithm 1 we therefore divide by this term to correct the initialization bias.

其中，如果真实二阶矩 $E [g_{t}^{2}]$ 是平稳的，则 $ζ = 0$ ；否则， $ζ$ 可以保持很小，因为指数衰减率 $β_{2}$ 可以（并且应该）被选择得使指数移动平均对过去太远的梯度赋予很小的权重。剩下的项是 $(1 - β_{2}^{t})$ ，它是由将运行平均值初始化为零引起的。因此，在算法 1 中，我们除以该项来修正初始化偏差。

In case of sparse gradients, for a reliable estimate of the second moment one needs to average over many gradients by chosing a small value of $β_{2}$ ; however it is exactly this case of small $β_{2}$ where a lack of initialisation bias correction would lead to initial steps that are much larger.

在稀疏梯度的情况下，为了可靠地估计二阶矩，需要通过选择较小的 $β_{2}$ 来对许多梯度进行平均；然而，正是在这种 $β_{2}$ 较小的情况下，缺乏初始化偏差修正将导致初始步长过大。

Convergence Analysis

We analyze the convergence of Adam using the online learning framework proposed. Given an arbitrary, unknown sequence of convex cost functions $f_{1} (θ), f_{2} (θ), \dots, f_{T} (θ)$ . At each time $t$ , our goal is to predict the parameter $θ_{t}$ and evaluate it on a previously unknown cost function $f_{t}$ . Since the nature of the sequence is unknown in advance, we evaluate our algorithm using the regret, that is the sum of all the previous difference between the online prediction $f_{t} (θ_{t})$ and the best fixed point parameter $f_{t} (θ^{*})$ from a feasible set $X$ for all the previous steps. Concretely, the regret is defined as:

我们使用提出的在线学习框架分析 Adam 的收敛性。给定一个任意的、未知的凸代价函数序列 $f_{1} (θ), f_{2} (θ), \dots, f_{T} (θ)$ 。在每个时间步 $t$ ，我们的目标是预测参数 $θ_{t}$ ，并在一个先前未知的代价函数 $f_{t}$ 上对其进行评估。由于序列的性质预先未知，我们使用遗憾来评估我们的算法，遗憾是迄今为止所有时间步中，在线预测 $f_{t} (θ_{t})$ 与可行集 $X$ 内的最佳固定点参数 $f_{t} (θ^{*})$ 之间所有先前差异的总和。具体来说，遗憾定义为：

\begin{matrix} (5) & R (T) = \sum_{t = 1}^{T} [f_{t} (θ_{t}) - f_{t} (θ^{*})] \end{matrix}

where $θ^{*} = \arg min_{θ \in X} \sum_{t = 1}^{T} f_{t} (θ)$ . We show Adam has $O (\sqrt{T})$ regret bound and a proof is given in the appendix. Our result is comparable to the best known bound for this general convex online learning problem. We also use some definitions simplify our notation, where $g_{t} ≜ \nabla f_{t} (θ_{t})$ and $g_{t, i}$ as the $i^{t h}$ element. We define $g_{1 : t, i} \in R^{t}$ as a vector that contains the $i^{t h}$ dimension of the gradients over all iterations till $t$ , $g_{1 : t, i} = [g_{1, i}, g_{2, i}, \dots, g_{t, i}]$ . Also, we define $γ ≜ \frac{β_{1}^{2}}{\sqrt{β_{2}}}$ . Our following theorem holds when the learning rate $α_{t}$ is decaying at a rate of $t^{- \frac{1}{2}}$ and first moment running average coefficient $β_{1, t}$ decay exponentially with $λ$ , that is typically close to 1, e.g. $1 - 10^{- 8}$ .

其中 $θ^{*} = \arg min_{θ \in X} \sum_{t = 1}^{T} f_{t} (θ)$ 。我们证明了 Adam 具有 $O (\sqrt{T})$ 的遗憾界，证明见附录。我们的结果与这个一般凸在线学习问题的最佳已知界相当。我们还使用一些定义来简化符号，其中 $g_{t} ≜ \nabla f_{t} (θ_{t})$ ， $g_{t, i}$ 表示第 $i$ 个元素。我们将 $g_{1 : t, i} \in R^{t}$ 定义为包含直到 $t$ 的所有迭代中梯度的第 $i$ 维的向量，即 $g_{1 : t, i} = [g_{1, i}, g_{2, i}, \dots, g_{t, i}]$ 。另外，我们定义 $γ ≜ \frac{β_{1}^{2}}{\sqrt{β_{2}}}$ 。当学习率 $α_{t}$ 以 $t^{- \frac{1}{2}}$ 的速率衰减，且一阶矩运行平均系数 $β_{1, t}$ 以 $λ$ 指数衰减时，我们的下述定理成立。

Theorem 4.1. Assume that the function $f_{t}$ has bounded gradients, $∥ \nabla f_{t} (θ) ∥_{2} \leq G$ , $∥ \nabla f_{t} (θ) ∥_{\infty} \leq G_{\infty}$ for all $θ \in R^{d}$ and distance between any $θ_{t}$ generated by Adam is bounded, $∥ θ_{n} - θ_{m} ∥_{2} \leq D,$ and $∥ θ_{m} - θ_{n} ∥_{\infty} \leq D_{\infty}$ for any $m, n \in {1, \dots, T}$ , and $β_{1}, β_{2} \in [0, 1)$ satisfy $\frac{β_{1}^{2}}{\sqrt{β_{2}}} < 1.$ Let $α_{t} = \frac{α}{\sqrt{t}}$ and $β_{1, t} = β_{1} λ^{t - 1}, λ \in (0, 1)$ . Adam achieves the following guarantee, for all $T \geq 1$ .

定理 4.1. 假设函数 $f_{t}$ 具有有界梯度， $∥ \nabla f_{t} (θ) ∥_{2} \leq G$ , $∥ \nabla f_{t} (θ) ∥_{\infty} \leq G_{\infty}$ 对所有 $θ \in R^{d}$ 成立，且 Adam 生成的任意 $θ_{t}$ 之间的距离有界， $∥ θ_{n} - θ_{m} ∥_{2} \leq D,$ 以及 $∥ θ_{m} - θ_{n} ∥_{\infty} \leq D_{\infty}$ 对于任意 $m, n \in {1, \dots, T}$ 成立，且 $β_{1}, β_{2} \in [0, 1)$ 满足 $\frac{β_{1}^{2}}{\sqrt{β_{2}}} < 1.$ 令 $α_{t} = \frac{α}{\sqrt{t}}$ 且 $β_{1, t} = β_{1} λ^{t - 1}, λ \in (0, 1)$ 。Adam 对于所有 $T \geq 1$ 满足以下保证。

R (T) \leq \frac{D^{2}}{2 α (1 - β_{1})} \sum_{i = 1}^{d} \sqrt{T {\hat{v}}_{T, i}} + \frac{α (1 + β_{1}) G_{\infty}}{(1 - β_{1}) \sqrt{1 - β_{2}} (1 - γ)^{2}} \sum_{i = 1}^{d} {‖ g_{1 : T, i} ‖}_{2} + \sum_{i = 1}^{d} \frac{D_{\infty}^{2} G_{\infty} \sqrt{1 - β_{2}}}{2 α (1 - β_{1}) (1 - λ)^{2}}

Our Theorem 4.1 implies when the data features are sparse and bounded gradients, the summation term can be much smaller than its upper bound $\sum_{i = 1}^{d} ∥ g_{1 : T, i} ∥_{2} < d G_{\infty} \sqrt{T}$ and $\sum_{i = 1}^{d} \sqrt{T v_{T, i}} < d G_{\infty} \sqrt{T},$ in particular if the class of function and data features are in the form of section 1.2 in (Duchi et al., 2011). Their results for the expected value $E [\sum_{i = 1}^{d} ∥ g_{1 : T, i} ∥_{2}]$ also apply to Adam. In particular, the adaptive method, such as Adam and Adagrad, can achieve $O (\log d \sqrt{T})$ , an improvement over $O (\sqrt{d T})$ for the non-adaptive method. Decaying $β_{1, t}$ towards zero is important in our theoretical analysis and also matches previous empirical findings, e.g. (Sutskever et al., 2013) suggests reducing the momentum coefficient in the end of training can improve convergence.

我们的定理 4.1 表明，当数据特征是稀疏的且梯度有界时，求和项可以远小于其上界，特别是如果函数类别和数据特征具有某种形式。他们关于期望值的结果也适用于 Adam。特别是，自适应方法，如 Adam 和 Adagrad，可以达到 $O (\log d \sqrt{T})$ ，这是对非自适应方法的 $O (\sqrt{d T})$ 的改进。使 $β_{1, t}$ 向零衰减在我们的理论分析中很重要，也与先前的经验结果一致。

Finally, we can show the average regret of Adam converges,

最后，我们可以证明 Adam 的平均遗憾收敛，

Corollary 4.2. Assume that the function $f_{t}$ has bounded gradients, $∥ \nabla f_{t} (θ) ∥_{2} \leq G$ , $∥ \nabla f_{t} (θ) ∥_{\infty} \leq G_{\infty}$ for all $θ \in R^{d}$ and distance between any $θ_{t}$ generated by Adam is bounded, $∥ θ_{n} - θ_{m} ∥_{2} \leq D$ , $∥ θ_{m} - θ_{n} ∥_{\infty} \leq D_{\infty}$ for any $m, n \in {1, \dots, T}$ . Adam achieves the following guarantee, for all $T \geq 1$ :

推论 4.2. 假设函数 $f_{t}$ 具有有界梯度，对于所有 $θ \in R^{d}$ ，有 $∥ \nabla f_{t} (θ) ∥_{2} \leq G$ ， $∥ \nabla f_{t} (θ) ∥_{\infty} \leq G_{\infty}$ ，并且由 Adam 生成的任意 $θ_{t}$ 之间的距离有界，即对于任意 $m, n \in {1, \dots, T}$ ，有 $∥ θ_{n} - θ_{m} ∥_{2} \leq D$ ， $∥ θ_{m} - θ_{n} ∥_{\infty} \leq D_{\infty}$ 。Adam 对所有 $T \geq 1$ 满足以下保证：

\frac{R (T)}{T} = O (\frac{1}{\sqrt{T}})

This result can be obtained by using Theorem 4.1 and $\sum_{i = 1}^{d} ∥ g_{1 : T, i} ∥_{2} \leq d G_{\infty} \sqrt{T} .$ Thus, $lim_{T \to \infty} \frac{R (T)}{T} = 0.$

该结果可以通过使用定理 4.1 和 $\sum_{i = 1}^{d} ∥ g_{1 : T, i} ∥_{2} \leq d G_{\infty} \sqrt{T}$ 得到。因此， $lim_{T \to \infty} \frac{R (T)}{T} = 0.$

Optimization methods bearing a direct relation to Adam are RMSProp and AdaGrad; these relationships are discussed below. Other stochastic optimization methods include vSGD, AdaDelta and the natural Newton method from Roux & Fitzgibbon, all setting stepsizes by estimating curvature from first-order information. The Sum-of-Functions Optimizer (SFO) is a quasi-Newton method based on minibatches, but (unlike Adam) has memory requirements linear in the number of minibatch partitions of a dataset, which is often infeasible on memory-constrained systems such as a GPU. Like natural gradient descent (NGD), Adam employs a preconditioner that adapts to the geometry of the data, since ${\hat{v}}_{t}$ is an approximation to the diagonal of the Fisher information matrix; however, Adam's preconditioner (like AdaGrad's) is more conservative in its adaption than vanilla NGD by preconditioning with the square root of the inverse of the diagonal Fisher information matrix approximation.

与 Adam 直接相关的优化方法是 RMSProp 和 AdaGrad；这些关系将在下面讨论。其他随机优化方法包括 vSGD、AdaDelta 以及 Roux 和 Fitzgibbon 的自然牛顿法，所有这些方法都通过从一阶信息估计曲率来设置步长。和函数优化器是一种基于小批量的拟牛顿法，但与 Adam 不同，其内存需求与数据集的小批量分区数量呈线性关系，这在内存受限的系统（如 GPU）上通常不可行。与自然梯度下降类似，Adam 采用了一种自适应数据几何结构的预条件器，因为 ${\hat{v}}_{t}$ 近似于 Fisher 信息矩阵的对角线；然而，与普通 NGD 相比，Adam 的预条件器通过使用近似的对角 Fisher 信息矩阵逆的平方根进行预条件，在适应性上更为保守。

RMSProp: An optimization method closely related to Adam is RMSProp. A version with momentum has sometimes been used. There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient. RMSProp also lacks a bias-correction term; this matters most in case of a value of $β_{2}$ close to 1 (required in case of sparse gradients), since in that case not correcting the bias leads to very large stepsizes and often divergence, as we also empirically demonstrate in section 6.4.

RMSProp： 与 Adam 密切相关的优化方法是 RMSProp。有时会使用带动量版本的 RMSProp。带动量的 RMSProp 与 Adam 之间存在几个重要差异：带动量的 RMSProp 使用重新缩放梯度的动量来生成参数更新，而 Adam 的更新是直接使用梯度一阶和二阶矩的运行平均值来估计的。RMSProp 还缺少偏差修正项；这在 $β_{2}$ 值接近 1 的情况下最为重要，因为在这种情况下，不修正偏差会导致步长非常大并经常发散，正如我们在第 6.4 节中通过实验证明的那样。

AdaGrad: An algorithm that works well for sparse gradients is AdaGrad. Its basic version updates parameters as $θ_{t + 1} = θ_{t} - α \cdot g_{t} / \sqrt{\sum_{i = 1}^{t} g_{i}^{2}} .$ Note that if we choose $β_{2}$ to be infinitesimally close to 1 from below, then $lim_{β_{2} \to 1} {\hat{v}}_{t} = t^{- 1} \cdot \sum_{i = 1}^{t} g_{i}^{2} .$ AdaGrad corresponds to a version of Adam with $β_{1} = 0$ , infinitesimal $(1 - β_{2})$ and a replacement of $α$ by an annealed version $α_{t} = α \cdot t^{- 1 / 2},$ namely $θ_{t} - α \cdot t^{- 1 / 2} \cdot {\hat{m}}_{t} / \sqrt{lim_{β_{2} \to 1} {\hat{v}}_{t}}$ $= θ_{t} - α \cdot t^{- 1 / 2} \cdot g_{t} / \sqrt{t^{- 1} \cdot \sum_{i = 1}^{t} g_{i}^{2}}$ $= θ_{t} - α \cdot g_{t} / \sqrt{\sum_{i = 1}^{t} g_{i}^{2}} .$ Note that this direct correspondence between Adam and Adagrad does not hold when removing the bias-correction terms; without bias correction, like in RMSProp, a $β_{2}$ infinitesimally close to 1 would lead to infinitely large bias, and infinitely large parameter updates.

AdaGrad： 一种在稀疏梯度上效果良好的算法是 AdaGrad。其基本版本按以下方式更新参数： $θ_{t + 1} = θ_{t} - α \cdot g_{t} / \sqrt{\sum_{i = 1}^{t} g_{i}^{2}}$ 。注意，如果我们选择 $β_{2}$ 从下方无限接近 1，那么 $lim_{β_{2} \to 1} {\hat{v}}_{t} = t^{- 1} \cdot \sum_{i = 1}^{t} g_{i}^{2}$ 。AdaGrad 对应于 Adam 的一个版本，其中 $β_{1} = 0$ ， $(1 - β_{2})$ 无穷小，并且用退火版本 $α_{t} = α \cdot t^{- 1 / 2}$ 替换 $α$ ，即 $θ_{t} - α \cdot t^{- 1 / 2} \cdot {\hat{m}}_{t} / \sqrt{lim_{β_{2} \to 1} {\hat{v}}_{t}}$ $= θ_{t} - α \cdot t^{- 1 / 2} \cdot g_{t} / \sqrt{t^{- 1} \cdot \sum_{i = 1}^{t} g_{i}^{2}}$ $= θ_{t} - α \cdot g_{t} / \sqrt{\sum_{i = 1}^{t} g_{i}^{2}}$ 。需要注意的是，当去除偏差修正项时，Adam 和 AdaGrad 之间的这种直接对应关系不成立；如果没有偏差修正，如 RMSProp 中那样，无限接近 1 的 $β_{2}$ 将导致无限大的偏差和无限大的参数更新。

Extensions

7.1 ADAMAX

In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) $L^{2}$ norm of their individual current and past gradients. We can generalize the $L^{2}$ norm based update rule to a $L^{p}$ norm based update rule. Such variants become numerically unstable for large $p$ . However, in the special case where we let $p \to \infty$ , a surprisingly simple and stable algorithm emerges; see algorithm 2. We'll now derive the algorithm. Let, in case of the $L^{p}$ norm, the stepsize at time $t$ be inversely proportional to $v_{t}^{1 / p}$ , where:

在 Adam 中，各个权重的更新规则是根据其各自当前梯度和过去梯度的 $L^{2}$ 范数反比缩放。我们可以将基于 $L^{2}$ 范数的更新规则推广到基于 $L^{p}$ 范数的更新规则。这种变体对于较大的 $p$ 会变得数值不稳定。然而，在 $p \to \infty$ 的特殊情况下，会出现一个极其简单且稳定的算法；参见算法 2。我们现在推导该算法。在 $L^{p}$ 范数的情况下，令时间 $t$ 的步长与 $v_{t}^{1 / p}$ 成反比，其中：

\begin{aligned} (6) & v_{t} & = β_{2}^{p} v_{t - 1} + (1 - β_{2}^{p}) {| g_{t} |}^{p} \\ (7) & = (1 - β_{2}^{p}) \sum_{i = 1}^{t} β_{2}^{p (t - i)} \cdot {| g_{i} |}^{p} \end{aligned}

Note that the decay term is here equivalently parameterised as $β_{2}^{p}$ instead of $β_{2}$ . Now let $p \to \infty$ , and define $u_{t} = lim_{p \to \infty} (v_{t})^{1 / p}$ , then:

注意，这里的衰减项等效地参数化为 $β_{2}^{p}$ 而不是 $β_{2}$ 。现在令 $p \to \infty$ ，并定义 $u_{t} = lim_{p \to \infty} (v_{t})^{1 / p}$ ，则：

\begin{aligned} (8) & u_{t} = lim_{p \to \infty} {(v_{t})}^{1 / p} & = lim_{p \to \infty} {((1 - β_{2}^{p}) \sum_{i = 1}^{t} β_{2}^{p (t - i)} \cdot {| g_{i} |}^{p})}^{1 / p} \\ (9) & = lim_{p \to \infty} {(1 - β_{2}^{p})}^{1 / p} {(\sum_{i = 1}^{t} β_{2}^{p (t - i)} \cdot {| g_{i} |}^{p})}^{1 / p} \\ (10) & = lim_{p \to \infty} {(\sum_{i = 1}^{t} {(β_{2}^{(t - i)} \cdot | g_{i} |)}^{p})}^{1 / p} \\ (11) & = max (β_{2}^{t - 1} | g_{1} |, β_{2}^{t - 2} | g_{2} |, \dots, β_{2} | g_{t - 1} |, | g_{t} |) \end{aligned}

Which corresponds to the remarkably simple recursive formula:

这对应于极其简单的递归公式：

\begin{matrix} (12) & u_{t} = max (β_{2} \cdot u_{t - 1}, | g_{t} |) \end{matrix}

with initial value $u_{0} = 0$ . Note that, conveniently enough, we don't need to correct for initialization bias in this case. Also note that the magnitude of parameter updates has a simpler bound with AdaMax than Adam, namely: $| Δ_{t} | \leq α$ .

初始值为 $u_{0} = 0$ 。请注意，方便的是，在这种情况下我们不需要修正初始化偏差。还要注意，使用 AdaMax 时参数更新的幅度有一个比 Adam 更简单的界，即： $| Δ_{t} | \leq α$ 。

Algorithm 2: AdaMax, a variant of Adam based on the infinity norm. See section 7.1 for details. Good default settings for the tested machine learning problems are $α = 0.002$ , $β_{1} = 0.9$ and $β_{2} = 0.999$ . With $β_{1}^{t}$ we denote $β_{1}$ to the power $t$ . Here, $(α / (1 - β_{1}^{t}))$ is the learning rate with the bias-correction term for the first moment. All operations on vectors are element-wise.

Require: $α$ : Stepsize

Require: $β_{1}, β_{2} \in [0, 1)$ : Exponential decay rates

Require: $f (θ)$ : Stochastic objective function with parameters $θ$

Require: $θ_{0}$ : Initial parameter vector

$m_{0} \leftarrow 0$ (Initialize $1^{st}$ , moment vector)

$u_{0} \leftarrow 0$ (Initialize the exponentially weighted infinity norm)

$t \leftarrow 0$ (Initialize timestep)

while $θ_{t}$ not converged do

$t \leftarrow t + 1$

$g_{t} \leftarrow \nabla_{θ} f_{t} (θ_{t - 1})$ (Get gradients w.r.t. stochastic objective at timestep $t$ )

$m_{t} \leftarrow β_{1} \cdot m_{t - 1} + (1 - β_{1}) \cdot g_{t}$ (Update biased first moment estimate)

$u_{t} \leftarrow max (β_{2} \cdot u_{t - 1}, | g_{t} |)$ (Update the exponentially weighted infinity norm)

$θ_{t} \leftarrow θ_{t - 1} - (α / (1 - β_{1}^{t})) \cdot m_{t} / u_{t}$ (Update parameters)

end while

return $θ_{t}$ (Resulting parameters)

7.2 Temporal averaging

Since the last iterate is noisy due to stochastic approximation, better generalization performance is often achieved by averaging. Previously in Moulines & Bach (2011), Polyak-Ruppert averaging (Polyak & Juditsky, 1992; Ruppert, 1988) has been shown to improve the convergence of standard SGD, where ${\bar{θ}}_{t} = \frac{1}{t} \sum_{k = 1}^{n} θ_{k} .$ Alternatively, an exponential moving average over the parameters can be used, giving higher weight to more recent parameter values. This can be trivially implemented by adding one line to the inner loop of algorithms 1 and 2: ${\bar{θ}}_{t} \leftarrow β_{2} \cdot {\bar{θ}}_{t - 1} + (1 - β_{2}) θ_{t},$ with ${\bar{θ}}_{0} = 0$ . Initialization bias can again be corrected by the estimator ${\hat{θ}}_{t} = {\bar{θ}}_{t} / (1 - β_{2}^{t}) .$

7.2 时间平均

由于随机逼近导致最后一步迭代存在噪声，通常通过对参数进行平均可以获得更好的泛化性能。先前 Moulines 和 Bach 的研究表明，Polyak-Ruppert 平均可以改善标准 SGD 的收敛性，其中 ${\bar{θ}}_{t} = \frac{1}{t} \sum_{k = 1}^{n} θ_{k}$ 。或者，可以使用参数的指数移动平均，给予最近参数值更高的权重。这可以通过在算法 1 和 2 的内循环中添加一行轻松实现： ${\bar{θ}}_{t} \leftarrow β_{2} \cdot {\bar{θ}}_{t - 1} + (1 - β_{2}) θ_{t}$ ，其中 ${\bar{θ}}_{0} = 0$ 。初始化偏差可以再次通过估计器 ${\hat{θ}}_{t} = {\bar{θ}}_{t} / (1 - β_{2}^{t})$ 进行修正。

Conclusion

We have introduced a simple and computationally efficient algorithm for gradient-based optimization of stochastic objective functions. Our method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward to implement and requires little memory. The experiments confirm the analysis on the rate of convergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning.

我们提出了一种简单且计算高效的算法，用于基于梯度的随机目标函数优化。我们的方法针对具有大规模数据集和/或高维参数空间的机器学习问题。该方法结合了近期两种流行优化方法的优点：AdaGrad 处理稀疏梯度的能力，以及 RMSProp 处理非平稳目标的能力。该方法易于实现且内存需求小。实验证实了对凸问题收敛速度的分析。总的来说，我们发现 Adam 稳健且适用于机器学习领域中广泛的非凸优化问题。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

Adam: A Method for Stochastic Optimization

Adam：一种随机优化方法

Abstract

Introduction

Algorithm

2.1 Adam's update rule

2.1 Adam 的更新规则

Initialization Bias Correction

Convergence Analysis

Extensions

7.1 ADAMAX

7.2 Temporal averaging

7.2 时间平均

Conclusion

🤖 Rasa

Adam: A Method for Stochastic Optimization ​

Adam：一种随机优化方法 ​

Abstract ​

Introduction ​

Algorithm ​

2.1 Adam's update rule ​

2.1 Adam 的更新规则 ​

Initialization Bias Correction ​

Convergence Analysis ​

Related Work ​

Extensions ​

7.1 ADAMAX ​

​

7.2 Temporal averaging ​

7.2 时间平均 ​

Conclusion ​

Adam: A Method for Stochastic Optimization

Adam：一种随机优化方法

Abstract

Introduction

Algorithm

2.1 Adam's update rule

2.1 Adam 的更新规则

Initialization Bias Correction

Convergence Analysis

Related Work

Extensions

7.1 ADAMAX

7.2 Temporal averaging

7.2 时间平均

Conclusion