Gradient Descent Algorithm Principle Neural Network (Gradient Descent)

         In solving the model parameters of the neural network algorithm, Gradient Descent is the most commonly used method. The following is my personal understanding of gradient descent during my study. If there is anything wrong, please point it out.

1. ✌ Gradient definition

         We have learned calculus to find the partial derivative of each variable of a multivariate function, and write the partial derivative of each parameter obtained in the form of a vector, which is the gradient. For example, the function f(x,y), which takes the partial derivatives of x and y respectively, and the obtained gradient vector is (∂f/∂x, ∂f/∂y)T, referred to as grad f(x,y) or ▽f (x,y). For the specific gradient vector at the point (x0, y0) is (∂f/∂x0, ∂f/∂y0)T. or ▽f(x0,y0), if it is a vector gradient with 3 parameters, it is (∂f /∂x, ∂f/∂y, ∂f/∂z)T, and so on.
         So what is the significance of finding this gradient vector? In terms of geometry, his meaning is the fastest increase in function change. Specifically, for the function f(x,y), at the point (x0,y0), the direction along the gradient vector is (∂f/∂x0, ∂f/∂y0)T and the direction is f(x,y ) The fastest increase. In other words, along the direction of the gradient vector, it is easier to find the maximum value of the function. Conversely, along the opposite direction of the gradient vector, that is, the direction of -(∂f/∂x0, ∂f/∂y0)T, the gradient decreases the fastest, which means it is easier to find the minimum value of the function.
Insert picture description here
For point F, the gradient of point F is the direction of the green vector, then its opposite direction is the place where the decline is fastest.
For point B, the gradient of point B is negative, so the opposite direction of the gradient is the lower right. It is also the place where the function drops the fastest. It
can be seen that they are all working hard to make the function reach the minimum.

2.✌ Gradient descent and gradient ascent

         Generally, we require gradient descent to be used when taking the minimum value of the loss function, and gradient rise should be used to obtain the maximum value. Both methods are to iteratively update the parameters.
The principle of gradient descent will be introduced below.

3. ✌ Icon of gradient descent

Insert picture description here

         First, let’s look at this picture. The z-axis is the loss function, and the x-axis and y-axis are two parameters respectively. Now the problem is that we require the loss function to reach the minimum value of the corresponding parameter. You may want to exhaust each parameter. The method is obviously not working. The parameter values ​​are not limited. It is impossible to get all the values, or to derive the loss function and find the extreme value. This method may be no problem in theory, but because the loss function we encounter is different each time, It is difficult to encapsulate this method into a function. Different function types and different derivations cannot achieve a general solution. So what should I do?
         Think of it as a bowl. When we put a small ball into the bowl, according to a natural phenomenon, the ball will definitely roll down. So what's special about the path of the ball? When letting a place with a steep slope, the steeper the place is, the easier it will be and the faster it will come down. Doesn’t it correspond to our gradient? Every time the ball rolls in the opposite direction of the gradient, there will always be a moment when it reaches its lowest point. .
         Gradient descent is the principle, but there are new problems. Let's look at a picture.
Insert picture description here
         According to the above theory, the ball will definitely roll to a lowest point, so must this point be the lowest point? Definitely not. According to the above figure, if the ball falls into a certain recessed area, it will terminate and not reach the lowest point. Then we can say that we have obtained the local optimum instead of the global optimum. Here There is a supplement, if our loss function is convex, then we will definitely get the global optimal solution.
         After learning high numbers, you may know that the location of the minimum value is not necessarily the minimum value, it is only a local minimum, then what should be done? A lot of optimized algorithms have been generated, which are derived from various mathematical derivations. The new formula is not explained here. This article is only to explain the principle of gradient descent. If you are interested, you can find relevant literature on your own.

4. ✌ Concepts related to gradient descent

w = w − a ∗ d J / d w w=w-a*dJ/dw w=wad J / d w
is the core formula of gradient descent. Use this formula to update the value of w. What is the minus sign? Not much to say, look at the picture.
Insert picture description here
         When our point is point b, the gradient is positive (derivative value), then we want to get the minimum value, it must be shifted to the left, then we need to subtract this value *
If thelearning rateis point a, the gradient is negative (The derivative value is negative), then you need to move right, and if the derivative value is negative, you should add it

  1. Loss function: After learning linear regression, you may know that we use MSE (Mean Square Error) to evaluate its quality and use it to measure the degree of model fit.
    J (w 1, w 2) = 1 / m ∑ i = 0 m (y − y ′) 2 J(w1,w2)=1/m\sum_{i=0}^m(y-y')^ 2J(w1,w2)=1/mi=0mAndY2
    Obviously, the smaller the function, the better, then we are asking for the optimal values ​​of w1 and w2 to make our loss function reach the minimum value, which uses gradient descent.
  2. Learning rate: It is the a in the above formula. Some places are also called step lengths. I feel very contradictory. I feel that there are some problems in this place. I personally think it is a regulating number, because w and its corresponding derivative are possible. The order of magnitude is different, then you need to multiply the derivative by a small number to adjust

5. ✌ The calculation process of gradient descent

         It involves multi-dimensional matrix operations and a lot of symbols, which are difficult for beginners to understand. Here we simplify it and replace it with a simple version, but the principle is the same, that is, to extend low-dimensional to multi-dimensional.
Not much to say (because the editing document formula is not easy to write, so I demonstrated the process on scratch paper), look at the picture! ! !
Insert picture description here

6. ✌ Algorithm process:

  1. Determine the gradient (derivative) where the current parameter is located d J / dw dJ/dwdJ/dw

  2. Multiply the learning rate by the gradient to get the parameter update distance, namely a*dJ/dw

  3. Determine the number of iterations and threshold, divided into two cases

    3.1 The first type reaches the number of iterations, and the calculation ends.
    3.2 The second type of parameter update value is less than the threshold. To put it plainly, a*dJ/dw tends to 0, indicating that the optimal position is almost reached

7. ✌ Algorithm optimization:

Is there any place to optimize it?

  1. The choice of learning rate:
    It is easy to know that if the learning rate is too small, the parameter update rate will be smaller and the change will be small, which will increase the number of iterations and increase the model training time. If the learning rate is too large, it will cause parameter changes. Too big, the iteration is too fast, leading to skip the position of the optimal solution.
    Look at the picture to understand
    Insert picture description here

  2. The initial value of the parameter:
    the difference of the initial value will also affect the effect of the model, because the gradient descent will sometimes get a local optimal solution, and if the location is selected properly, this situation will be avoided

  3. Normalization of data to eliminate the influence of dimension:
    after normalization, the value range of different features will be divided into the same range, which will reduce a certain amount of calculation
    x = x − mean (x) / std (x) x=x- mean(x)/std(x)x=xm e a n ( x ) / s t d ( x )
    sample minus the mean divided by the standard deviation, so that the processed data will conform to the Gaussian distribution

Guess you like

Origin blog.csdn.net/m0_47256162/article/details/113834516