Simple understanding of gradient descent algorithm: first-order Taylor expansion, mathematical principle of gradient descent

Table of contents

Simple understanding of gradient descent algorithm

First order Taylor expansion

Gradient Descent Mathematics


Simple understanding of gradient descent algorithm

picture

The formula of the gradient descent algorithm is very simple, " along the opposite direction of the gradient (the steepest slope) " is obtained from our daily experience, what is the essential reason? Why is the fastest direction of local decline the negative direction of the gradient? Maybe many of my friends are still not clear. It doesn't matter, next I will explain in detail the mathematical derivation process of the gradient descent algorithm formula in plain language.

downhill problem

Suppose we are located on a certain mountainside of Huangshan Mountain, the mountains are endless, and we don't know how to get down the mountain. So I decided to take one step at a time, that is, take a small step along the steepest and most downhill direction of the current position each time, and then continue to take a small step along the steepest direction of the next position. Go on step by step like this, until we feel that we have reached the foot of the mountain. The steepest downhill direction here is the negative direction of the gradient .

picture

First understand what is gradient? In layman's terms, the gradient means that the directional derivative of a function at that point achieves its maximum value along that direction, that is, the derivative of the function at the current position .

picture

In the above formula, θ  is an independent variable, f(θ)  is a function about  θ  , and θ  represents the gradient.

If the function  f(θ)  is convex , then it can be optimized using the gradient descent algorithm. We are already familiar with the formula of the gradient descent algorithm:

picture

Among them, θo is the independent variable parameter, that is, the coordinates of the downhill position, η  is the learning factor , that is, a small step ( step length ) each time the downhill is advanced , and θ is the updated θo, that is, the position after a small step downhill .

First order Taylor expansion

If the function is smooth enough, in the case of knowing the derivative values ​​of each order of the function at a certain point, Taylor's formula can use these derivative values ​​as coefficients to construct a polynomial to approximate the value of the function in the neighborhood of this point.

A little mathematical foundation is required here, and some understanding of Taylor expansions is required. Simply put, the first-order Taylor expansion uses the concept of a local linear approximation of a function . Let's take the first-order Taylor expansion as an example:

picture

picture

A small segment [ θo , θ ] of the convex function  f(θ)  is represented by the black curve in the above figure, and the value of f(θ) can be obtained by using the idea of ​​linear approximation, as shown in the red straight line in the above figure . The slope of this line is equal to  the derivative of f(θ)  at  θo  . Then according to the straight line equation, it is easy to get  the approximate expression of f(θ)  as:

picture

This is the derivation process of the first-order Taylor expansion, and the main mathematical idea used is the linear fitting approximation of the curve function.

Gradient Descent Mathematics

After knowing the first-order Taylor expansion, the next step is the key point! Let's take a look at how the gradient descent algorithm is derived.

First write the expression of the first-order Taylor expansion:

picture

Among them, θ−θo  is a tiny vector, and its size is the step length η we mentioned earlier , which is analogous to each small step in the process of going down a mountain. η  is a scalar, and  the unit vector of  θ−θo  is represented by . Then  θ−θo  can be expressed as:

picture

It is especially important to note that θ−θo  cannot be too large, because if it is too large, the linear approximation will not be accurate enough, and the first-order Taylor approximation will not hold. After substitution, the expression of f(θ)  is:

picture

Here comes the point, the purpose of the local decline is to  make the function value f(θ)  smaller every time θ  is updated  . That is to say, in the above formula, we hope f(θ)<f(θo) . Then there are: 

picture

Because  η  is a scalar and is generally set to a positive value, it can be ignored, and the inequality becomes:

picture

The above inequality is very important! Both v  and  ∇f(θo)  are vectors, ∇f(θo)  is the gradient direction of the current position , and v represents the unit vector for the next step , which we need to solve. With it, we can calculate according to  vθ−θo=ηv  Determine  the value of θ  .

If you want the product of two vectors to be less than zero, let's first look at the situations that the product of two vectors contains:

picture

Both A  and  are vectors, and α  is the angle between the two vectors. The product of A  and  is:

picture

Both || A || and || B || are scalars. When || A || and || B || are determined, as long as cos(α)=−1, that is,  and  are completely reversed, The vector product of and  can  be minimized (negative maximum value).

As the name implies, when  and  ∇f(θo)  are opposite to each other, that is,  when is the negative direction of the current gradient direction,  v⋅∇f(θo)  can be made as small as possible, which ensures that  the direction of is local Direction of fastest descent.

After knowing that  v  is  the opposite direction of ∇f(θo)  , it can be directly obtained:

picture

The reason to divide by  the modulus of ∇f(θo)  || ∇f(θo) || is because  is a unit vector.

After finding the optimal solution  , bring it into  θ−θo=ηv  , and get:

picture

In general, since || ∇f(θo) || is a scalar, it can be incorporated into the step factor  η  , which simplifies to:

picture

In this way, we derive  the update expression for θ  in the gradient descent algorithm .

Summarize

We understand the mathematical principle of the gradient descent algorithm through the first-order Taylor expansion, using the idea of ​​linear approximation and vector multiplication minimization.

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132524014
Recommended