10. Batch gradient descent

The core of gradient descent is to find the partial derivative of a function, which is found in advanced mathematics. The essence of gradient descent: It is a method that uses gradients to iteratively update weight parameters to minimize the objective function.
There are three different forms of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Among them, the mini-batch gradient descent method is also commonly used for model training in deep learning.

1. Batch Gradient Descent (BGD)

Batch gradient descent is the most primitive form, which refers to using all samples to update the gradient at each iteration.
Insert image description here

Insert image description here

Advantages:
(1) One iteration is to calculate all samples. At this time, the matrix is ​​used to operate, achieving parallelism.
(2) The direction determined by the full data set can better represent the sample population and thus more accurately move towards the direction of the extreme value. When the objective function is a convex function, BGD must be able to obtain the global optimum.
(1) During the training process, use a fixed learning rate and do not have to worry about learning rate decay.
(2) The direction determined by the full data set can better represent the sample population and thus more accurately move towards the direction of the extreme value. When the objective function is convex, it will definitely converge to the global minimum. If the objective function is non-convex, it will converge to the local minimum.
  Disadvantages:
  (1) When the number of samples m is large, all samples need to be calculated at each iteration step, and the training process will be very slow.
  From the perspective of the number of iterations, the number of BGD iterations is relatively small.
  (2) Each update occurs after traversing all the examples. Only then will you find that some examples may be redundant and have little effect on parameter update.

2. Stochastic Gradient Descent (SGD)

Stochastic gradient descent is different from batch gradient descent. Stochastic gradient descent uses one sample per iteration to update parameters.
Insert image description here

Advantages:
(1) Since the loss function is not based on all training data, but the loss function on a certain piece of training data is randomly optimized in each iteration, the update speed of each round of parameters is greatly accelerated.
(1) Noise is added during the learning process to improve the generalization error.
(2) Since the loss function is not based on all training data, but the loss function on a certain piece of training data is randomly optimized in each iteration, the update speed of each round of parameters is greatly accelerated.

Disadvantages:
(1) Accuracy decreases. Because even when the objective function is a strongly convex function, SGD still cannot achieve linear convergence.
(2) It may converge to a local optimum because a single sample does not represent the trend of the entire sample.
(3) It is not easy to implement in parallel.
(1) It does not converge and fluctuates near the minimum value.
(2) Vectorization calculation cannot be used in one sample, and the learning process becomes very slow.
(3) A single sample cannot represent the trend of the entire sample.
(4) When encountering a local minimum or saddle point, SGD will get stuck at the gradient of 0.

3. Mini-Batch Gradient Descent (MBGD)

Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent. The idea is: use **batch_size** samples for each iteration to update the parameters.
Insert image description here

Advantages:
(1) Through matrix operations, optimizing neural network parameters on one batch at a time will not be much slower than single data.
(2) Using one batch each time can greatly reduce the number of iterations required for convergence, and at the same time make the converged results closer to the effect of gradient descent. (For example, 30W in the above example, when batch_size=100 is set, 3000 iterations are required, which is much less than the 30W of SGD) (
1) The calculation speed is faster than Batch Gradient Descent, because updates can be performed only by traversing some samples.
(2) Randomly selecting samples is helpful to avoid repeating redundant samples and samples that contribute less to parameter update.
(3) Using one batch each time can greatly reduce the number of iterations required for convergence, and at the same time make the converged results closer to the effect of gradient descent.
(4) Parallelization can be achieved.

Disadvantages:
(1) During the iterative process, the learning process will fluctuate due to the presence of noise. Therefore, it hovers in the region of the minimum value and does not converge.
(2) The learning process will have more oscillations. In order to get closer to the minimum value, it is necessary to increase the learning rate attenuation term to reduce the learning rate and avoid excessive oscillations.
(3) Improper selection of batch_size may cause some problems.

A key parameter in the mini-batch SGD algorithm is the learning rate. In practice, it is necessary to gradually reduce the learning rate over time—learning rate decay.
Why do we need to perform learning rate decay?
In the early stage of gradient descent, a larger step size (learning rate) can be accepted and gradient descent can be performed at a faster speed. When converging, we want the step size to be smaller and to oscillate slightly around the minimum. Assuming that the model has approached the area with small gradient, if the original learning rate is maintained, it can only hover around the optimal point. If the learning rate is reduced, the objective function can be further reduced, which helps the convergence of the algorithm and makes it easier to approach the optimal solution.

Reference:
https://zhuanlan.zhihu.com/p/72929546
https://blog.csdn.net/yato0514/article/details/82261821

Guess you like

Origin blog.csdn.net/weixin_44986037/article/details/130093637
Recommended