Deep learning - image understanding gradient descent, learning rate (learning rate)

write in front

When I was studying gradient descent and Learning Rate, I didn't understand what kind of process these two were. I wondered for a long time. Until recently, I started watching the first section of Mr. Li Hongyi's deep learning course video. It gave me a sense of relief, so I recorded it

gradient descent

When we are training the model, we will first calculate the Loss (loss) first, and then we will use the gradient descent method to reversely update and transfer the optimization parameters, so that our Loss will decrease. This is our usual practice. How to visualize understand?
First of all, assume that Loss is a function of several parameters such as w (this is also in line with the process of calculating the loss function of our neural network), and these parameters such as w are the many parameters in our neural network. We assume that we get Loss Graph of a function on w.
Then we will find the partial derivative of w for Loss, then the calculation is the slope of the Loss curve, and our w is known in the neural network, so we can get what our current slope is.
Suppose we are now at point P, as shown in the figure below, then what is gradient descent? In order to get a better neural network, we need to make our loss function Loss smaller, how to reflect it in the figure? We can get the slope of point P, the right side is low and the left side is high, that is, we can increase w to achieve the decline of Loss, and in this process, we can find that the gradient, that is, the slope is reduced, so this is achieved. Gradient descent reduces our Loss.

  • We only need to increase or decrease our parameters in the direction of gradient descent, that is, the reduction of the Loss loss function can be achieved, which is the so-called gradient descent.

insert image description here

Learning Rate

When we know gradient descent, then there is a problem, don't we need to update the parameters in the direction like gradient descent (Loss reduction)? How big is the parameter to increase (or decrease) each time? This decision increases (or decreases) the value of the parameter, which is our learning rate.
In order to make our training faster, we can increase the value of the parameter change when the gradient is larger, and the value of the parameter change is smaller when the gradient is smaller. And how does our learning rate affect it?
We can simply use the slope size * learning rate to achieve our function, we often call this Learning Rate the step size. Imagine that each time we change the parameters, is it similar to taking a step in the direction of going down a mountain, and the size of this step is determined by the steepness of the mountain (gradient size) and the step size (Learing Rate).

Effect of Learning Rate

We often use a decreasing learning rate approach when training a model. Let's draw a few pictures to gain a deep understanding of the meaning of this practice.

Learning Rate is relatively small

Assuming that our learning rate is a relatively small and constant value, then we are likely to have the following situation, because our learning rate is small, when we reach point P, the model will think that Loss has been It can no longer continue to descend and is stuck in such a saddle point, and there is no way to jump out, so although the model thinks that he has reached the optimal w value and corresponds to the minimum Loss value he thinks, but we stand in the global When looking at it from the perspective, you will find that he has not reached the ideal value.
Jumping out of that point, there's Loss's better w parameter behind it.
When this happens, what we can do is to maintain an optimal model at this time, then stop training, set a larger Learning Rate to retrain, and let it jump out of the saddle point.

insert image description here

Learing Rate较大

Then if our Learning Rate maintains a relatively large value, then it is very likely that we were at point P1 last time, but because of a relatively large learning rate, our parameters reached point P2 after updating, and at P2, And because the learning rate is very large, unfortunately your model jumps back to the P1 point, which will cause the model to oscillate repeatedly, and there is no way to make w reach the optimal solution
insert image description here
, whether to solve this situation or solve the problem of small learning rate. of that situation. In fact, the most ideal is the learning rate descent method mentioned earlier. Using a high learning rate at the beginning allows us to quickly reach a more suitable parameter, and then use a small learning rate to find the most suitable parameter.

Guess you like

Origin blog.csdn.net/scarecrow_sun/article/details/120527738