basic concept

Derivative

Let the function $y = f (x)$ at point $x_0$ There is a definition in a certain field, when the independent variable $x$ at $x_0$ with increment $Δ x$ ， $(x + Δ x)$ is also in this neighborhood, the corresponding function obtains the increment $Δy=f(x_0+Δx) - f(x_0)$ ；当 $\Delta{x}\to0$ 时， $\Delta{y}$ 与 $\Delta{x}$ exists, then the function $y = f (x)$ at point ${x}_0$ can be derived at , and call this limit a function $y = f (x)$ at point ${x}_0$ The derivative at is denoted as ${f}'({x}_0)$ 。
$f'({x}_0)=\lim_ {x_0\to0}\frac{\Delta{y}}{\Delta{x}}=\lim_{x_0\to0}\frac{f(x+\Delta{x}-f(x)}{\Delta{ x}}$

Partial derivative

Under the multivariate function , the derivative and the partial derivative are essentially the same, and both are the limit of the ratio of the change of the function value to the change of the independent variable when the change of the independent variable approaches 0 . Simply put, the derivative refers to the unary function , the function
$y =$ The rate of change of $f$ $($ $x$ $)$ along the positive direction of the x-axis at a certain point; the partial derivative refers tothe multivariate function, the function $y=f({x} _0, {x}_1,...{x}_n)$ at a pointalong ${x}_0, {x}_1,...{x}_n)$ is the rate of change in the positive direction of a certain coordinate axis.
$\frac{\alpha}{\alpha{x_j}}{f({x}_0, {x}_1,.. .{x}_n)} = \lim_{\Delta{x\to0}}\frac{\Delta{y}}{\Delta{x}} = \lim_{\Delta{x\to0}}{\frac {f(x_0,...,{x}_j + \Delta{x_j},...,x_n ) - f(x_0, ...x_j, ..., x_n)}{\Delta{x}} }$

directional derivative

Derivatives and partial derivatives both discuss the rate of change of a function along the positive direction of the coordinate axis . The definition of a directional derivative arises when we consider the rate of change in an arbitrary direction at a point along a function.
In a multivariate function, the partial derivative refers to the change of the independent variable on a certain coordinate axis , and the directional derivative refers to the change of the independent variable on multiple coordinate axes in a certain direction. Partial derivatives are essentially a special case of directional derivatives .
$\frac{\alpha}{\alpha{l}}{f({x}_0, {x} _1,...{x}_n)} = \lim_{ {\rho\to0}}\frac{\Delta{y}}{\Delta{x}} = \lim_{ {\ rho\to0}}{ \frac{f(x_0+\Delta{x_0},...,{x}_j + \Delta{x_j},...,x_n + \Delta{x_n})- f(x_0, ...x_j, . .., x_n)}{\rho}}$
ψ $\rho = \sqrt{(\Delta{x_0}^2)+ ... +(\Delta{x_j}^2)+...+(\Delta{x_n}^2)}$

gradient

Gradient, originally intended to be a vector, indicates that the directional derivative of the function at this point obtains the maximum value along this direction , that is, the direction corresponding to the maximum of all directional derivatives at a certain point, and the function changes the fastest along this direction at this point. Maximum rate of change.

$gradf(x_0, ..., x_j ,... x_n)=(\frac{\alpha{f}}{\alpha{x_0}}, ...\frac{\alpha{f}}{\alpha{x_j}}, ..., \ frac{\alpha{f}}{\alpha{x_n}})$

The relationship between directional derivative and gradient

Under the binary function, the directional derivative can be expressed as follows:
$\frac{\alpha{f}}{\alpha{l}} = \ frac{\alpha{f}}{\alpha{x}}*cos\theta + \frac{\alpha{f}}{\alpha{y}}*sin\theta$
where $\theta$ is expressed as the angle between the direction vector and the x-axis. We can convert the above formula into the inner product of two vectors, expressed as follows:
$\frac{\alpha{f} }{\alpha{l}} = [\frac{\alpha{f}}{\alpha{x}}, \frac{\alpha{f}}{\alpha{y}}] * [cos\theta, sin\theta]$
The inner product of two vectors can be expressed as the multiplication of the modules of the two vectors, and then multiplied by the angle between them, expressed as follows:
$\frac{\alpha{f}}{\alpha{l}} = \sqrt{\frac {\alpha{f}}{\alpha{x}}^2 + \frac{\alpha{f}}{\alpha{y}}^2} *\sqrt{ {cos\theta}^2 + { sin \theta}^2} * cos\varphi = \sqrt{\frac{\alpha{f}}{\alpha{x}}^2 + \frac{\alpha{f}}{\alpha{y}}^ 2} * 1 * cos\varphi$
to the geometric meaning of the vector inner product, we can know that $\varphi$ is the angle between two vectors, that is,the angle between the gradient and a certain direction. If and only if, the angle between the direction and the gradient is 0, the directional derivative takes the maximum value. That is,the direction corresponding to the maximum directional derivative at a certain point in the function, that is, the direction of the gradient.

How to understand the gradient descent method

In simple terms, in neural networks, gradient descent is a method of finding the minimization of a loss function.

intuitive version

Gradient Descent

In the simplest unary convex function $f(x)=x^2$ as an example to demonstrate how the gradient descent method finds the minimum value of the function.
Suppose the starting point $x_0=10$ , analogy to the initial value of the neural network weight after initialization, as shown in the figure below:

$f(x_0)$ The corresponding gradient is:
$grad(f(x_0))=\frac{\ alpha{f}}{\alpha{x}}i=f'(x_0)=(2x{|}_{x_0=10})=20$
corresponds toThe vector on the $x-$ $\Delta{f(x)}$ is the fastest growing direction of the function, then $-\Delta{f(x)}$ is the fastest decaying direction of the function:

according to the gradient descent method, move a certain distance to get the next position:
$x_1 = x_0 - \eta{\ Delta{f(x_0)}}$
among them, $\eta$ is the step size, through which the moving distance can be controlled, assuming $\eta=0.1$ , then the position of the next point is:
$x_1 = x_0 - \eta{\Delta{f(x_0)}} = 10 - 0.1 * 20 = 8$

Using the gradient descent method to iterate continuously, the following update process can be obtained:

Repeat the gradient descent process continuously, and you can see the gradient value $\Delta(f(x)$ is constantly decreasing and eventually approaches 0, at this time $\Delta{f(x)} = f'(x)=0$ , the function takes the minimum value.

learning step size

In the above calculation process, use the learning step size $\eta$ to control the moving distance, now to observe different $\eta$ on the final result.

$\eta$ is too small

set $\eta$ is 0.01, and iterates 10 times. It can be seen that there is still a long distance from the bottom.
0.01.png

$\eta$ suitable

set $\eta$ is 0.2, and iterates 10 times, and it can be seen that it has just reached the bottom.
0.2.png

$\eta$ is larger

set $\eta$ is 0.5, iterates 10 times, and it can be seen that the function value swings between two points.
0.5.png

$\eta$ too large

set $\eta$ is 1.1, and iterates 10 times. At this time, the function value crosses the bottom and continues to rise.
1.1.png
In summary, different learning steps $\eta$ , as the number of iterations increases, the change law of the function value is different.

math version

Gradient descent is a commonly used first-order optimization method, and it is one of the simplest and classical methods for solving unconstrained problems. For an unconstrained optimization problem $min f (x)$ , where $f (x)$ is acontinuous differentiablefunction, for a set ${x_0, x_1,...x_n}$ ，满足：
$f(x_{t+1}) < f(x_t), t=0,1,...n$
Then it can converge to the local minimum point, as shown in the figure below:

Then $min f (x)$ becomes how to find the next point $x_{t+1}$ , and guarantee $f(x_{t+1}) < f(x_t)$ . For unary functions, the function value will only increase with $The x$ value changes, so the next $x_{t+1}$ is the previous $x_t$ Take a small step in a certain direction $\Delta{x}$ obtained.
By Taylor expansion:
$f(x+\Delta{x})\simeq{f(x)+\Delta{x}f'(x )}$
sub-coordinate is the current $x$ moves a small step $\Delta{x}$ is obtained, which is approximately equal to the right side. Due to the need to ensure that $f(x_{t+1}) < f(x_t)$ ，即 $f(x+\Delta{x})<f(x)$ ，则 $\Delta{x}f'(x) <0$ .
assumeΔ $\Delta{x} = -\alpha{f'(x)}, (\alpha>0)$ , where $\alpha$ is the learning step size, so $\Delta{x}f'(x)=-\alpha{ { f'(x)}^2}$ . Since the square of any non-zero number is greater than 0, it can be guaranteed that $\Delta{x}f'(x)<0$ .
Thus, setting
$f(x+\Delta{x}) = f(x -\alpha{f'(x)})$
can guarantee that $f(x+\Delta{x})<f(x)$ ，也即 $x_{t+1}=x_t-\alpha{f'(x)}$ , the moving direction is the negative gradient direction.
The overall flow of the gradient descent method is shown in the figure below:

optimization function

stochastic gradient descent

The stochastic gradient descent method updates the weight parameters by selecting a single sample data each time.
advantage:

The training speed is fast, avoiding the computational redundancy problem in the batch gradient update process;
When the amount of training data is relatively large, it can also converge at a faster speed;

shortcoming:

Since each sample is randomly sampled, after the model is trained, only a small part of the sample may be used, which makes the obtained gradient biased;
In addition, with higher variance, greater volatility

batch gradient descent

The batch stochastic gradient descent method needs to traverse all sample data each time to update the weight parameters.
advantage:

Due to traversing the complete data set, the result is the global minimum;

shortcoming:

The model training speed is slow. When the amount of data is large, it cannot be fully loaded into the memory, and the gradient calculation complexity is large;

mini-batch gradient descent method

Each iteration considers a small portion of sample data to update the weight parameters.
advantage:

Reduce the problem of high variance in SGD, making the model convergence more stable;
Matrix multiplication can speed up calculation;

shortcoming:

There is also randomness in sample selection, which does not guarantee good convergence of the model;

exponentially weighted average

Before introducing the subsequent multiple optimization functions, it is necessary to understand the exponential weighted average method, which is essentially a method similar to averaging . The exponential weighted average method can be used to estimate the local mean of the variable , so that the update of the variable is related to the historical value within a period of time , in which each observation value is given different weights, and its weighting coefficient decreases exponentially with time .
Compared with the average method, the exponential weighted average method does not need to save all the past values, and the amount of calculation is significantly reduced.
The formula of the exponential weighted average method is as follows:
$v_{t+1} = \beta{v_t} + (1-\beta)\theta_{t+1}$
where $v_{t+1}$ Represents up to $t + 1$ -dayaverage, $\theta_{t+1}$ Indicates the $t + 1-$ day temperature value, $\beta$ is a tunable hyperparameter. The three important properties of the exponential weighted average method are verified as follows:

local average
Weighting coefficient
The coefficient decreases exponentially with time

Suppose $\beta=0.9$ , then the average process of the exponential weighted average method can be obtained as follows:
$v_{100} = 0.9v_{99} + 0.1\theta_{100};$
$v_{99} = 0.9v_{98} + 0.1\theta_{99};$
$v_{98} = 0.9v_{97} + 0.1\theta_{98};$
Simplify the above formula to get the following expression:
$v_{100} = 0.1\theta_{100} + 0.9v_{99}$
$0.1\theta_{100} + 0.9*( 0.1\theta_{99} + 0.9v_{98}) = 0.1\theta_{100} + 0.1 * 0.9 \theta_{99} + {0.9}^2* (0.1\theta_{98} + 0.9v_{97} )$
$0.1\theta_{100} + 0.1 * 0.9 \theta_{99} + 0.1 * {0.9}^2 \theta_{ 98} + {0.9}^3 v_{97}$

From the above expression, we can see that $v_{100}$ It is equal to each observation value multiplied by a weight, and the weight decreases exponentially with time. Another point is that the average obtained by the exponential weighted average method is a local average, so how many observations are averaged, roughly $\frac{1}{1-\beta}$ , in this example $\beta=0.9$ , then 10 observations are referenced.
In order to make the average calculation more accurate, it is necessary tocorrect the bias. The following is an example to introduce why bias correction is needed, assuming $\beta=0.98$ , initialize $v_0=0$ ， $\theta_1=40$ 。 sov
$v_1=\beta{v_0}+(1-\beta)\theta_t=0.98 *v_0+0.02\theta_t=0.02\theta_t=8$
So the estimated value on the first day is not accurate, $v_2=0.98v_1+0.02\theta_2 = 0.02*0.98*\theta_1 + 0.02 \theta_2$ , due to $\theta_1$ Sum $\theta_2$ are all integers, so the calculated $v_2$ Much smaller than $\theta_1$ Sum $\theta_2$ .
For inaccurate early estimates, use $\frac{v_t}{1-\beta^t}$ Indicates the value of the day, where $t$ represents the number of days, so the estimated value for the second day is:
$\frac{v_2}{0.0396}$
with $The t$ value increases, $\beta_t$ close to 0, so when $When t$ is large, bias correction has little effect.

Momentum gradient descent method

The basic idea of momentum gradient descent is to compute an exponentially weighted average of the gradients and use that gradient to update the weights.
For the mini-batch gradient descent method, due to the randomness of sampling, each drop is not strictly in the direction of the minimum, but the overall downward trend is in the direction of the minimum.

The momentum gradient descent method is to reduce the amplitude of the swing and improve the learning speed.

The expression of the momentum gradient descent method is as follows:
$v_{dw} = \beta{v_{dw}} + (1-\beta)dW$
$v_{db} = \beta{v_{db}} + (1-\beta)db$
$W-\alpha{v_{dw}};b = b-\alpha{v_{db}}$
Use the exponential weighted average method to obtain the local gradient average value, and replace the original single iteration gradient value for parameter update. It borrows the concept of momentum in physics and imagines parameter optimization as the task of pushing a ball down a hill. The gradient value is similar to the acceleration of the ball. When the direction of the gradient before and after is consistent, learning can be accelerated. When the direction of the gradient before and after is inconsistent, fluctuations will be suppressed . .

RMS plug

Although the momentum optimization algorithm has initially solved the problem of large swings in the parameter optimization process, the so-called swings refer to the range of parameters after optimization and update. As shown in the figure below, the blue one is the route taken by the Momentum optimization algorithm. Green is the route taken by the RMSprop optimization algorithm.

The expression of RMSprop is as follows:
$S_{dw} = \beta{S_{dw}} + (1-\beta)(dW)^2$
$S_{db} = \beta{S_{db}} + (1-\beta)(db)^2$
$W=W-\alpha{\frac{dW}{\sqrt{S_{dw}}}};b=b-\alpha{\frac{db}{\sqrt{S_{db}}}}$
The RMSprop gradient descent method first uses the exponential weighted average method to obtain the weighted average of the gradient square , and then uses the square root of the gradient ratio to update the weight parameter .
The RMSprop algorithm computes a differential square weighted average of the gradients. This approach is beneficial to eliminate the direction of large swing amplitude (when there is a relatively large value in dW and db, then when updating the weight or bias, it is divided by the square root of the gradient accumulated before it, and the update amplitude will become Small ), which is used to correct the swing range, so that the swing range of each dimension is smaller, and on the other hand, it makes the network converge faster.

Adam

The Adam optimization algorithm is basically a combination of RMSprop and Momentum. The specific calculation process is as follows:

Calculate the gradient $d W, d b$ ;
Calculate the exponentially weighted average of momentum $v_{dw} = \beta_1{v_{dw}} + (1-\beta_1)dW$ , $v_{db} = \beta{v_{db}} + (1-\beta)db$ ;
Update using RMSprop, $S_{dw} = \beta_23{S_{dw}} + (1-\beta_2)(dW)^2$ ， $S_{db} = \beta{S_{db}} + (1-\beta)(db)^2$ ；
进行偏差修正， $v_{dW}^{corrected} = \frac{v_{dW}}{1-\beta_1^t},v_{db}^{corrected} = \frac{v_{db}}{1-\beta_1^t},S_{dW}^{corrected}=\frac{S_{dW}}{1-\beta_2^t},S_{db}^{corrected}=\frac{S_{db}}{1-\beta_2^t}$ ;
进行权值更新， $W=W-\alpha{\frac{v_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}},b=b-\alpha{\frac{v_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}}$

The Adam algorithm combines RMSprop and Momentum and is an extremely commonly used learning algorithm for different neural networks.

Reference link:

https://blog.csdn.net/wo164683812/article/details/90382330
https://www.zhihu.com/question/36301367
https://blog.csdn.net/weixin_44492824/article/details/122270260
https://www.zhihu.com/question/305638940/answer/1639782992
https://www.zhihu.com/question/305638940/answer/606831354
https://zhuanlan.zhihu.com/p/36564434

Basic understanding of optimization functions

basic concept

Derivative

Partial derivative

directional derivative

gradient

The relationship between directional derivative and gradient

How to understand the gradient descent method

intuitive version

Gradient Descent

learning step size

math version

optimization function

stochastic gradient descent

batch gradient descent

mini-batch gradient descent method

exponentially weighted average

Momentum gradient descent method

RMS plug

Adam

Reference link:

Guess you like