Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

The main contents of the article are as follows:

1. Explanation of BGD principle of batch gradient descent method
2.
Explanation of principle of stochastic gradient descent method SGD 3. Detailed explanation of small batch gradient MBGD principle explanation
4. Specific examples and three implementation methods code detailed explanation
5. Summary of three gradient descent methods

When applying machine learning algorithms, we usually use gradient descent to train the adopted algorithms. In fact, the commonly used gradient descent method also contains three different forms, each of which has different advantages and disadvantages.

Below we compare the three gradient descent methods with a linear regression algorithm.

The hypothetical function of the general linear regression function is:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

The corresponding loss function is:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD
(here 1/2 is for the convenience of derivative calculation later)

The following figure is used as a visualization of the corresponding energy function of a two-dimensional parameter (theta0, theta1) group:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

Let's explain the three gradient descent methods separately.

1

Batch gradient descent method BGD

Our goal is to make the error function as small as possible, that is, to solve the weights to make the error function as small as possible. First, we initialize the weigths randomly, and then repeatedly update the weights to reduce the error function until it stops when the requirements are met. Here to update the algorithm, we choose the gradient descent algorithm, using the initialized weights and repeatedly updating the weights:

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD
This represents the learning rate, which represents the size of each step toward the steepest direction of J. In order to update the weights, we need to find the partial derivative of the function J. First, when we only have one data point (x, y), the partial derivative of J is:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

Then for all data points, the partial derivative (cumulative sum) of the above loss function is:

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

In the process of minimizing the loss function, the weights need to be updated repeatedly to reduce the error function. The update process is as follows:

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD
Well, the pseudo code for each parameter update is as follows:

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

From the update formula in the figure above, we can see that every parameter update we use all training data (for example, if there are m pieces, then m pieces are used), if the training data is very large, it is very time-consuming.

The convergence graph of batch gradient descent is given below:


Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

From the figure, we can get that the number of BGD iterations is relatively small.

2

Stochastic Gradient Descent Method SGD

Since batch gradient descent uses all the number of samples each time a new parameter is updated, the training speed will become very slow as the number of samples increases. Stochastic gradient descent is proposed to solve this method. It uses the loss function of each sample to obtain a partial derivative of θ to obtain the corresponding gradient to update θ:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

The update process is as follows:
 Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

Stochastic gradient descent is updated iteratively through each sample. Compared with the batch gradient descent above, all training samples are needed for one iteration (often the real problem training data is very huge nowadays), and one iteration cannot be optimal. For 10 times, you need to traverse the training sample 10 times.

However, a problem with SGD is that it has more noise than BGD, so that SGD does not move towards the overall optimization direction in every iteration.

The convergence diagram of stochastic gradient descent is as follows:

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD
We can see from the figure that there are many iterations of SGD, and the search process in the solution space looks very blind. But generally it is moving towards the optimal value.

3

min-batch MBGD

We can see from the above two gradient descent methods that each has its own advantages and disadvantages, so can we achieve a compromise between the performance of the two methods? The training process of the algorithm is relatively fast, and the accuracy of the final parameter training must be guaranteed. This is the original intention of the Mini-batch Gradient Descent (MBGD).

We assume that the number of samples used each time the parameter is updated is 10 (different tasks are completely different, here is an example)

The updated pseudo code is as follows:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

4

Examples and detailed code

Here, referring to other people's blog, a data is created, as shown in the following figure:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

The training data A and B are independent variables, and C is the dependent variable.
I hope to use these training data to train me a linear model for prediction of the following data. The test set is as follows:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

For example, we give (3.1, 5.5) what is the difference between the value predicted by the model and the 9.5 we have given? This is not the point, the point is the parameter update method in the process of our training model (this is the focus of this article) batch gradient descent and how the stochastic gradient descent code is implemented.

The following are respectively:


First we look at the code of the batch gradient descent method as follows:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

There may still be abstractions. In order to let everyone better understand these two important methods, I will explain them line by line in combination with examples below:
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

We look at the code of the stochastic gradient descent method as follows:

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

The biggest difference between batch gradient descent and batch gradient descent is that when we update the parameters here, we do not take all training samples into consideration, and then sum and divide by the total. Instead, I program and implement any sample point (the random function in the code is Can clearly see), and then use this sample point to update! This is the biggest difference!

So at this time, we are also very easy to know that the realization of the small batch stochastic gradient descent method is based on this, randomly taking batch samples instead of 1 sample, and mastering the essence is very easy to implement!

All the code, training, prediction and results of this linear model are given below for reference:
except for the code implementation of the above two methods, there are predictions, and the main logic code is as follows:

Forecast code:


Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

Main logic code:


Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

The final running result is:


Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

It shows that it corresponds exactly to the real value we have given.

5

Summary of three gradient descent methods

1. Batch gradient descent uses all the training data for each update to minimize the loss function. If there is only one minimum value, then the batch gradient descent considers all the data in the training set and iteratively moves towards the minimum value, but it has disadvantages But if the sample value is large, the update speed will be very slow.

2. Stochastic gradient descent only considers one sample point during each update, which will greatly speed up the training data, which also happens to be the disadvantage of batch gradient descent, but it may be because the training data has more noise points, so every time In the process of updating with noise points, it is not necessary to update in the direction of the minimum, but due to multiple rounds of updating, the overall direction is still updated in the direction of the minimum, which increases the speed.

3. The mini-batch gradient descent method is designed to solve the slow training speed of the batch gradient descent method and the accuracy of the stochastic gradient descent method. But note here that the batches for different problems are different. Listen to my brother tell me, The parser training part batch of our nlp is generally set to 10000, so why is it 10000? I think this and how many layers of the neural network need to be set in each question, no one can answer accurately, only through experimental results Adjustment of hyperparameters.

Okay, I have finished what I want to talk about in this article. I sincerely hope that it will be helpful to everyone's understanding. Welcome everyone to point out!

Reference:
Gradient descent algorithm and its Python implementation
[Machine Learning] Three forms of gradient descent method BGD, SGD and MBGD.
Thanks: Guo Jiang, Xiaoming, Tokugawa

Recommended reading:

Selected dry goods|Summary of
dry goods catalog for the past six months Dry goods|Master the optimization of the mathematical foundation of machine learning [1] (Key knowledge)
[Intuitive detailed explanation] What is PCA and SVD

      欢迎关注公众号学习交流~         

Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD
Welcome to join the exchange group to exchange learning~
Dry goods|Code principle teaches you to understand SGD stochastic gradient descent, BGD, MBGD

Guess you like

Origin blog.51cto.com/15009309/2553990
Recommended