Table of contents
Simple understanding of gradient descent algorithm
Simple understanding of gradient descent algorithm
The formula of the gradient descent algorithm is very simple, " along the opposite direction of the gradient (the steepest slope) " is obtained from our daily experience, what is the essential reason? Why is the fastest direction of local decline the negative direction of the gradient? Maybe many of my friends are still not clear. It doesn't matter, next I will explain in detail the mathematical derivation process of the gradient descent algorithm formula in plain language.
downhill problem
Suppose we are located on a certain mountainside of Huangshan Mountain, the mountains are endless, and we don't know how to get down the mountain. So I decided to take one step at a time, that is, take a small step along the steepest and most downhill direction of the current position each time, and then continue to take a small step along the steepest direction of the next position. Go on step by step like this, until we feel that we have reached the foot of the mountain. The steepest downhill direction here is the negative direction of the gradient .
First understand what is gradient? In layman's terms, the gradient means that the directional derivative of a function at that point achieves its maximum value along that direction, that is, the derivative of the function at the current position .
In the above formula, θ is an independent variable, f(θ) is a function about θ , and θ represents the gradient.
If the function f(θ) is convex , then it can be optimized using the gradient descent algorithm. We are already familiar with the formula of the gradient descent algorithm:
Among them, θo is the independent variable parameter, that is, the coordinates of the downhill position, η is the learning factor , that is, a small step ( step length ) each time the downhill is advanced , and θ is the updated θo, that is, the position after a small step downhill .
First order Taylor expansion
If the function is smooth enough, in the case of knowing the derivative values of each order of the function at a certain point, Taylor's formula can use these derivative values as coefficients to construct a polynomial to approximate the value of the function in the neighborhood of this point.
A little mathematical foundation is required here, and some understanding of Taylor expansions is required. Simply put, the first-order Taylor expansion uses the concept of a local linear approximation of a function . Let's take the first-order Taylor expansion as an example:
A small segment [ θo , θ ] of the convex function f(θ) is represented by the black curve in the above figure, and the value of f(θ) can be obtained by using the idea of linear approximation, as shown in the red straight line in the above figure . The slope of this line is equal to the derivative of f(θ) at θo . Then according to the straight line equation, it is easy to get the approximate expression of f(θ) as:
This is the derivation process of the first-order Taylor expansion, and the main mathematical idea used is the linear fitting approximation of the curve function.
Gradient Descent Mathematics
After knowing the first-order Taylor expansion, the next step is the key point! Let's take a look at how the gradient descent algorithm is derived.
First write the expression of the first-order Taylor expansion:
Among them, θ−θo is a tiny vector, and its size is the step length η we mentioned earlier , which is analogous to each small step in the process of going down a mountain. η is a scalar, and the unit vector of θ−θo is represented by v . Then θ−θo can be expressed as:
It is especially important to note that θ−θo cannot be too large, because if it is too large, the linear approximation will not be accurate enough, and the first-order Taylor approximation will not hold. After substitution, the expression of f(θ) is:
Here comes the point, the purpose of the local decline is to make the function value f(θ) smaller every time θ is updated . That is to say, in the above formula, we hope f(θ)<f(θo) . Then there are:
Because η is a scalar and is generally set to a positive value, it can be ignored, and the inequality becomes:
The above inequality is very important! Both v and ∇f(θo) are vectors, ∇f(θo) is the gradient direction of the current position , and v represents the unit vector for the next step , which we need to solve. With it, we can calculate according to vθ−θo=ηv Determine the value of θ .
If you want the product of two vectors to be less than zero, let's first look at the situations that the product of two vectors contains:
Both A and B are vectors, and α is the angle between the two vectors. The product of A and B is:
Both || A || and || B || are scalars. When || A || and || B || are determined, as long as cos(α)=−1, that is, A and B are completely reversed, The vector product of A and B can be minimized (negative maximum value).
As the name implies, when v and ∇f(θo) are opposite to each other, that is, when v is the negative direction of the current gradient direction, v⋅∇f(θo) can be made as small as possible, which ensures that the direction of v is local Direction of fastest descent.
After knowing that v is the opposite direction of ∇f(θo) , it can be directly obtained:
The reason to divide by the modulus of ∇f(θo) || ∇f(θo) || is because v is a unit vector.
After finding the optimal solution v , bring it into θ−θo=ηv , and get:
In general, since || ∇f(θo) || is a scalar, it can be incorporated into the step factor η , which simplifies to:
In this way, we derive the update expression for θ in the gradient descent algorithm .
Summarize
We understand the mathematical principle of the gradient descent algorithm through the first-order Taylor expansion, using the idea of linear approximation and vector multiplication minimization.