Principle of Gradient Descent Algorithm - Study Notes

1. What is gradient

       In calculus, the ∂ partial derivative is calculated for the parameters of the multivariate function, and the obtained partial derivative of each parameter is written in the form of a vector, which is the gradient. For example, the function f(x,y) calculates partial derivatives for x and y respectively, and the obtained gradient vector is (∂f/∂x, ∂f/∂y)T, referred to as grad f(x,y) or ▽f (x,y). For the specific gradient vector at point (x0, y0), it is (∂f/∂x0, ∂f/∂y0)T. Or ▽f(x0, y0), if it is a vector gradient of 3 parameters, it is (∂f /∂x, ∂f/∂y, ∂f/∂z)T, and so on.

       The geometric meaning of this gradient vector is the place where the function changes the fastest. Specifically, for the function f(x,y), at the point (x0,y0), the direction along the gradient vector is (∂f/∂x0, ∂f/∂y0) The direction of T is f(x,y ) increases fastest. In other words, along the direction of the gradient vector, it is easier to find the maximum value of the function. Conversely, along the opposite direction of the gradient vector, that is, the direction of -(∂f/∂x0, ∂f/∂y0)T, the gradient decreases the fastest, which means it is easier to find the minimum value of the function. Therefore, the minimum value of the error square and the regression parameters corresponding to this minimum value can be found by iterative methods.

2. The process of gradient descent

       Figure 1-1 shows the iterative process of the gradient descent method. J(θ0,θ1) is a function with two independent variables, which is similar to the error sum of squares function in unary linear regression. This function also has two variables b and c . For more than two variables, the mathematical treatment is similar, but it cannot be expressed with intuitive graphics like two variables. Our purpose is to start from any point on the graph to find the minimum value of the function, that is, the point where the sum of squared errors is the smallest, and to obtain the two independent variables corresponding to this point (theta0, theta1 in Figure 1-1, in unary linear regression are b and c) are the regression parameters to be solved. Assuming that the starting point is at the maximum value of the sum of squared errors, calculate the gradient of this point, the negative direction of the gradient is the direction where the error sum drops fastest, and then the parameters b and c move a fixed distance to the second point where x is drawn , calculate the gradient of this point again, get another direction, and continue to move. This process iterates until the gradient is 0 (in practice, it is a very small number, not exactly 0), which is the minimum value of the function.

Figure 1-1 Two random paths for the gradient descent method to find the minimum value of the function 

The whole gradient descent process can be understood as (first initialize the relevant parameters of the algorithm: mainly initialize θ0, θ1..., θnθ0, θ1..., θn, algorithm termination distance ε and step size α):

(1) Specify the location (determine the gradient of the loss function at the current location):

(2) Take a step (multiply the gradient of the loss function by the step size to get the distance of the current position):

  

(3) Judgment:

Determine whether all θiθi, gradient descent distances are less than ε, if less than ε, the algorithm terminates, and all current θiθi (i=0,1,...n) are the final results. Otherwise, go to step 4.

(4) Update the position (update all θ, and continue to step 1 after the update):

 3. Reference code:

Guess you like

Origin blog.csdn.net/hu_666666/article/details/127202816