Three tips about gradient descent (Gradient Descent)

Table of contents

1 Introduction

2. Overview

3. Learning rate

3.1 The value size problem

3.2 Solutions

4. Stochastic Gradient Descent

5. Feature Scaling

5.1 Why feature scaling

5.2 How to do feature scaling


1 Introduction

Recently, I am studying the machine learning course of teacher Li Hongyi, and I have to say that it is really super detailed and easy to understand. Here I put the link of station B ( poke me to go ) and the link of Baidu AIstudio ( poke me to go ), the course content of the two places is the same, but Baidu AIstudio has the corresponding course assignments, but use station B The advantage of watching is that there are enthusiastic netizens posting barrage to explain the meaning in English (Ms. Li Hongyi’s PPT is in English, please ignore it), it is recommended to watch it at station B, and then do homework on Baidu AIstudio.

2. Overview

We know that to judge the quality of a model, we need to use the Loss function. The model with the smallest Loss value is definitely more effective and more accurate. Then the purpose of gradient descent is to find the minimum value of the Loss function through iteration. Of course, it can also be used in other functions. , as long as the function is differentiable, gradient descent is available. I won’t say much about the principle and mathematical explanation of gradient descent. A blogger has already written it very well ( poke me to go ). This article is a brief talk about the three optimization methods of gradient descent explained by Mr. Li Hongyi .

3. Learning rate

3.1 The value size problem

The mathematical formula for gradient descent:  

γ is the learning rate, also known as the step size. Many articles, including teacher Li Hongyi, use the example of going down the mountain to explain gradient descent. We need to get the minimum θ value through continuous iteration, that is, go to the lowest point of the valley. However, the learning rate must be appropriate, neither too large nor too small!

If it is too small: too small will lead to not reaching the lowest point of the valley, that is, not getting the minimum θ value. Of course, if there is enough time to do it, it will reach the lowest point sooner or later, but the efficiency of such a model too low.

If it is too large: too large will lead to a large step in one step. Although the speed is fast, it may get stuck and never reach the lowest point. Even if it is too large, it may cause the θ value to rise instead of falling.

As shown below

——The value is just right

———The value is too small

——The value is large

———The value of Mrs. Mrs. is large

3.2 Solutions

There are three workarounds:

① Visualization : Visualize an image of the change of Loss value with the number of iterations when different learning rates are taken , as shown in the following figure (the color corresponds to the above figure):

After drawing this picture, you can get the most suitable image by selecting different learning rates, so as to get the appropriate learning rate.

②Adjust the learning rate : We can set the learning rate to decrease as the number of iterations increases . Because when going down the mountain, we need to go down the mountain very quickly, so the early step length is very large, but when you are about to reach the valley, you need to reduce the step length, so as not to stride too much and miss the valley.

t represents the number of iterations

  

③Adaptive gradient algorithm (Adagrad) : The best way is to set a different learning rate for each iteration , and you need to use an adaptive gradient algorithm.

The original gradient descent formula:

Adaptive gradient algorithm formula: (σ represents the root mean square of all past differential values)

Derivation process: the following will   be replaced by g

4. Stochastic Gradient Descent

The original gradient descent algorithm needs to bring in all the data every iteration. For example, if there is a data set with ten sets of data, then all the data needs to be brought in and calculated before iterating once and taking a step forward; Then the principle of stochastic gradient descent is to take only one set of data and substitute it into iterations, so that each set of data can take a step forward. When the amount of data is large, the stochastic gradient algorithm may not need to substitute all the data to obtain the minimum Loss value.

5. Feature Scaling

5.1 Why feature scaling

Feature scaling can be represented by the following figure, that is, to ensure that the values ​​of different features are in the same or similar range, so that the convergence speed can be improved when using the gradient algorithm.

 Suppose there is such a function:

 If the value of W1 is in the range (1,10), and the value of W2 is in the range (100, 1000), then the change of W1 will have little effect on y, and the change of W2 will have little effect on y. It is very large, which will cause the gradient descent to take many detours, take more time, be inefficient, and require many sets of learning rates to reach the center of the circle. And if after feature scaling, let W1 and W2 be in the same range, such as range (1, 10), then the gradient descent will become smooth and save time.

5.2 How to do feature scaling

There are many ways to do feature scaling, such as minimum-maximum normalization and decimal scaling normalization. Here we will mention the zero-mean normalization (Z-Score normalization) mentioned by Mr. Li Hongyi.

Formula: ( mean value of a row of data, standard deviation of a row of data, x is many sets of data, each x has a set of features)

The processed data has a mean of 0 and a standard deviation of 1.

Guess you like

Origin blog.csdn.net/huiaixing/article/details/117739436