Introduction to Gradient Descent, Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent

This article will introduce the difference between gradient descent, stochastic gradient descent and mini-batch gradient descent. Here I assume that the number of samples in my training set is N=1000. The knowledge about the formula will be discussed in another article. This article mainly talks about the advantages and disadvantages of the optimization algorithm, because the formula also involves other aspects of knowledge.

1 Gradient descent

The first is gradient descent. First, I move out its calculation formula, as shown in the figure below.

First of all, look at the first formula. First, my N=1000, that is, my training set has a total of 1000. The first formula is the loss function, and the second formula is to calculate the descent gradient. We first calculate the total loss of our 1000 samples and then calculate the average, then calculate the derivative of this function, and proceed in the direction of gradient descent, that is, to make the loss of my 1000 samples smaller and smaller.

So its shortcomings can be seen at once, that is, although the calculation result is very accurate, I can find out the direction of decline very accurately, but I need to calculate 1000 samples before the function can drop once, so the operation will be very slow. Slow, generally not used.

2 Stochastic Gradient Descent

Next is stochastic gradient descent, I still post its calculation formula

 First of all, the difference between the two formulas is that the stochastic gradient descent does not have N. In fact, it is not that there is no N. Instead, I only calculate the error of one sample at a time, and then calculate the direction of its derivative descent, and perform an update.

Therefore, its advantages are also obvious. Compared with gradient descent, it also calculates 1000 samples. Gradient descent can only be updated once, but stochastic gradient descent can update 1000 times. But its shortcomings also came out, that is, it is not accurate. That is to say, if one of my samples is too noisy, which causes my downward direction to be wrong, and it is updated in the direction of greater loss, but this is only an individual phenomenon, and our overall direction is actually updated in the direction of loss reduction. So I think stochastic gradient descent is better than gradient descent.

3 Mini-batch gradient descent

So do we have a good way to ensure both accuracy and speed? This is our mini-batch gradient descent. First give its calculation formula

 It can be seen that its calculation formula is almost exactly the same as our gradient descent calculation formula, so what is its physical meaning? I first assume that M=100, which is actually our batchsize, which is usually 64 or 128. Here I take 100 for the convenience of calculation. It means that I calculate the loss of 100 samples at a time to update. I only It takes 10 such calculations to update all 1000 samples, that is, iterates 10 times.

Therefore, it combines the advantages of the above two methods, using not too large samples to ensure the speed and the accuracy of the descending direction. We use this method when performing simple training.

In fact, having said so much, it has already been packaged in Python. We only need to call it, but we still need to know a little bit about the knowledge points. I can find it when I forget it. I must grasp this. Spend.

Guess you like

Origin blog.csdn.net/qq_45710342/article/details/122479121