Gradient descent method (SGD) principle

Table of contents

Principle of gradient descent (SGD): finding partial derivatives

1. Gradient (mathematical definition)

2. Gradient descent method iteration steps

BGD batch gradient descent algorithm

Tricks of BGD and SGD in project selection


Principle of gradient descent (SGD): finding partial derivatives

1. Gradient (mathematical definition)

It means that the directional derivative of a certain function at this point reaches the maximum value along this direction (that is, the direction of the maximum directional derivative ), that is, the function changes fastest along this direction at this point , with the largest change rate (the gradient of the mold).

gradient descent

2.  Gradient descent method iteration steps

An intuitive explanation of gradient descent : For example, we are somewhere
on a big mountain . Since we don’t know how to get down the mountain, we decide to take one step at a time , that is, every time we reach a position, we solve the gradient of the current position. Take a step down along the negative direction of the gradient , that is, the current steepest position, and then continue to solve the current position gradient , and take a step toward this step along the steepest and easiest position to go downhill. We continued walking like this step by step until we felt that we had reached the foot of the mountain. Of course, if we continue like this, we may not reach the bottom of the mountain, but reach a certain lower part of the mountain .

BGD batch gradient descent algorithm

It is a gradient-based optimization method that works by finding the minimum value of the error function through multiple iterations . In each iteration, the algorithm calculates the gradient of the error function based on a set of training samples and updates the model parameters based on this. Since the BGD algorithm needs to calculate the gradients of all training samples at each iteration, it usually puts greater pressure on memory and computing resources.

Compared with other gradient descent algorithms, BGD has the following advantages:

  • A better convergence effect can be obtained in a shorter period of time.
  • It is usually possible to avoid getting stuck in local minima.
  • It has strong robustness and can handle larger input data sets.

Although the BGD algorithm has the above advantages, there are still some issues that need attention. One of the important problems is the slow convergence speed of the algorithm. Since the gradients of all training samples need to be calculated for each iteration, the algorithm tends to converge slowly. In addition, the BGD algorithm is not easy to handle online learning problems because online learning usually requires calculations on a single sample, while the batch gradient descent algorithm requires calculations on all samples.

In order to solve the above problems of the BGD algorithm, researchers have proposed some variant algorithms. The most common of these is the stochastic gradient descent (SGD) algorithm. Different from the BGD algorithm, the SGD algorithm only calculates the gradient of a single training sample in each iteration, thus greatly improving the calculation speed of the algorithm. In addition, the SGD algorithm can also handle online learning problems better because it only needs to calculate a single sample.

In short, the BGD algorithm is a commonly used machine learning algorithm suitable for the optimization of large-scale data sets. Although there are some shortcomings, they can be solved by some variant algorithms. In practical applications, we should choose the most appropriate optimization algorithm based on the data set size and problem requirements.

Tricks of BGD and SGD in project selection

BGD : The relative noise is lower and the amplitude is larger. You can continue to find the minimum value.
SGD : Most of the time you are closer to the global minimum. Sometimes you will be far away from the minimum because the sample just points you in the wrong direction. Therefore, SGD has a lot of noise. On average, it will eventually be close to the minimum. However, sometimes the direction is wrong, because SGD will never converge, but will always fluctuate around the minimum value. Only one training sample is processed at a time, which is too inefficient.
Mini-batch : In practice, it is best to choose a mini-batch that is not too big or too small. It gets a lot of vectorization, high efficiency, and fast convergence.

How does adjusting Batch_Size  affect the training effect?

  1. If Batch_Size is too small, the model performance will be extremely poor (error will soar).
  2. As Batch_Size increases, the same amount of data can be processed faster.
  3. As Batch_Size increases, the number of epochs required to achieve the same accuracy increases.
  4. Due to the contradiction between the above two factors, Batch_Size increases to a certain point and reaches the optimal time.
  5. Since the final convergence accuracy will fall into different local extrema, Batch_Size increases until at some point, the optimal final convergence accuracy is achieved.

If the training set is small (less than 2000 samples), use the BGD method directly. The general  mini-batch  size is 64 to 512. Considering the computer memory settings and usage, if  the mini-batch  size is 2�, the code will run Go faster.

 Analysis of the principle of gradient descent method (SGD) and its improved optimization algorithm - Zhihu

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/133361586