Difference Between Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum Gradient Descent

When tuning the way the model updates the weights and bias parameters, have you considered which optimization algorithm will make the model perform better and faster? Should I use gradient descent, stochastic gradient descent, or Adam's method?

This post describes the main differences between the different optimization algorithms and how to choose the best optimization method.

The function of the optimization algorithm is to minimize (or maximize) the loss function E(x) by improving the training method.

There are some parameters inside the model, which are used to calculate the degree of deviation between the actual value and the predicted value of the target value Y in the test set. Based on these parameters, the loss function E(x) is formed.

For example, weights (W) and biases (b) are such internal parameters that are generally used to calculate output values and play a major role in training neural network models.

The internal parameters of the model play a very important role in efficiently training the model and producing accurate results. This is also why we should use various optimization strategies and algorithms to update and calculate the network parameters that affect model training and model output to approximate or reach the optimal value.

Optimization algorithms fall into two broad categories:

1. First-order optimization algorithm

This algorithm uses the gradient value of each parameter to minimize or maximize the loss function E(x). The most commonly used first-order optimization algorithm is gradient descent.

Function Gradient: A multivariate expression of the derivative dy/dx, used to represent the instantaneous rate of change of y with respect to x. Often in order to calculate the derivative of a multivariate function, the gradient is used instead of the derivative, and the partial derivative is used to calculate the gradient. A key difference between gradient and derivative is that the gradient of a function forms a vector field.

Therefore, for univariate functions, derivatives are used to analyze; while gradients are generated based on multivariate functions. More theoretical details are not explained in detail here.

2. Second-order optimization algorithm

Second-order optimization algorithms use second-order derivatives (also called Hessian methods) to minimize or maximize the loss function. This method is not widely used due to the computational cost of the second derivative.

gradient descent

Gradient descent is one of the most important techniques and foundations when training and optimizing intelligent systems. The function of gradient descent is:

By finding the minimum value, controlling the variance, and updating the model parameters, the model finally converges.

The formula for network update parameters is: θ=θ−η×∇(θ).J(θ) , where η is the learning rate and ∇(θ).J(θ) is the gradient of the loss function J(θ).

This is the most commonly used optimization algorithm in neural networks.

如今，梯度下降主要用于在神经网络模型中进行权重更新，即在一个方向上更新和调整模型的参数，来最小化损失函数。

2006年引入的反向传播技术，使得训练深层神经网络成为可能。反向传播技术是先在前向传播中计算输入信号的乘积及其对应的权重，然后将激活函数作用于这些乘积的总和。这种将输入信号转换为输出信号的方式，是一种对复杂非线性函数进行建模的重要手段，并引入了非线性激活函数，使得模型能够学习到几乎任意形式的函数映射。然后，在网络的反向传播过程中回传相关误差，使用梯度下降更新权重值，通过计算误差函数E相对于权重参数W的梯度，在损失函数梯度的相反方向上更新权重参数。

梯度下降的变体

传统的批量梯度下降将计算整个数据集梯度，但只会进行一次更新，因此在处理大型数据集时速度很慢且难以控制，甚至导致内存溢出。

权重更新的快慢是由学习率η决定的，并且可以在凸面误差曲面中收敛到全局最优值，在非凸曲面中可能趋于局部最优值。

使用标准形式的批量梯度下降还有一个问题，就是在训练大型数据集时存在冗余的权重更新。

标准梯度下降的上述问题在随机梯度下降方法中得到了解决。

1.随机梯度下降(SDG)

随机梯度下降（Stochastic gradient descent，SGD）对每个训练样本进行参数更新，每次执行都进行一次更新，且执行速度更快。

θ=θ−η⋅∇(θ) × J(θ;x(i);y(i))，其中x(i)和y(i)为训练样本。

频繁的更新使得参数间具有高方差，损失函数会以不同的强度波动。这实际上是一件好事，因为它有助于我们发现新的和可能更优的局部最小值，而标准梯度下降将只会收敛到某个局部最优值。

但SGD的问题是，由于频繁的更新和波动，最终将收敛到最小限度，并会因波动频繁存在超调量。

虽然已经表明，当缓慢降低学习率η时，标准梯度下降的收敛模式与SGD的模式相同。

另一种称为“小批量梯度下降”的变体，则可以解决高方差的参数更新和不稳定收敛的问题。

2.小批量梯度下降

为了避免SGD和标准梯度下降中存在的问题，一个改进方法为小批量梯度下降（Mini Batch Gradient Descent），因为对每个批次中的n个训练样本，这种方法只执行一次更新。

使用小批量梯度下降的优点是：

1)可以减少参数更新的波动，最终得到效果更好和更稳定的收敛。

2)还可以使用最新的深层学习库中通用的矩阵优化方法，使计算小批量数据的梯度更加高效。

3)通常来说，小批量样本的大小范围是从50到256，可以根据实际问题而有所不同。

4)在训练神经网络时，通常都会选择小批量梯度下降算法。

这种方法有时候还是被成为SGD。

使用梯度下降及其变体时面临的挑战

1.很难选择出合适的学习率。太小的学习率会导致网络收敛过于缓慢，而学习率太大可能会影响收敛，并导致损失函数在最小值上波动，甚至出现梯度发散。

2.此外，相同的学习率并不适用于所有的参数更新。如果训练集数据很稀疏，且特征频率非常不同，则不应该将其全部更新到相同的程度，但是对于很少出现的特征，应使用更大的更新率。

3.在神经网络中，最小化非凸误差函数的另一个关键挑战是避免陷于多个其他局部最小值中。实际上，问题并非源于局部极小值，而是来自鞍点，即一个维度向上倾斜且另一维度向下倾斜的点。这些鞍点通常被相同误差值的平面所包围，这使得SGD算法很难脱离出来，因为梯度在所有维度上接近于零。

进一步优化梯度下降

现在我们要讨论用于进一步优化梯度下降的各种算法。

1.动量

SGD方法中的高方差振荡使得网络很难稳定收敛，所以有研究者提出了一种称为动量（Momentum）的技术，通过优化相关方向的训练和弱化无关方向的振荡，来加速SGD训练。换句话说，这种新方法将上个步骤中更新向量的分量’γ’添加到当前更新向量。

V(t)=γV(t−1)+η∇(θ).J(θ)

最后通过θ=θ−V(t)来更新参数。

动量项γ通常设定为0.9，或相近的某个值。

这里的动量与经典物理学中的动量是一致的，就像从山上投出一个球，在下落过程中收集动量，小球的速度不断增加。

在参数更新过程中，其原理类似：

1)使网络能更优和更稳定的收敛；

2)减少振荡过程。

当其梯度指向实际移动方向时，动量项γ增大；当梯度与实际移动方向相反时，γ减小。这种方式意味着动量项只对相关样本进行参数更新，减少了不必要的参数更新，从而得到更快且稳定的收敛，也减少了振荡过程。

Difference Between Gradient Descent, Stochastic Gradient Descent, Mini-Batch Gradient Descent, Momentum Gradient Descent

Guess you like