Basic understanding of optimization functions

basic concept

Derivative

Let the function y = f ( x ) y=f(x)y=f ( x ) at pointx 0 x_0x0There is a definition in a certain field, when the independent variable xxx atx 0 x_0x0with increment Δx ΔxΔx ( x + Δ x ) (x+Δx) (x+Δ x ) is also in this neighborhood, the corresponding function obtains the incrementΔ y = f ( x 0 + Δ x ) − f ( x 0 ) Δy=f(x_0+Δx) - f(x_0)y _=f(x0+Δ x )f(x0);当 Δ x → 0 \Delta{x}\to0 Δx _0时,Δy \Delta{y}Δy Δ x \Delta{x} The ratio of Δ x exists, then the functiony = f ( x ) y=f(x)y=f ( x ) at pointx 0 {x}_0x0can be derived at , and call this limit a function y = f ( x ) y=f(x)y=f ( x ) at pointx 0 {x}_0x0The derivative at is denoted as f ′ ( x 0 ) {f}'({x}_0)f(x0)
f ′ ( x 0 ) = lim ⁡ x 0 → 0 Δ y Δ x = lim ⁡ x 0 → 0 f ( x + Δ x − f ( x ) Δ x f'({x}_0)=\lim_ {x_0\to0}\frac{\Delta{y}}{\Delta{x}}=\lim_{x_0\to0}\frac{f(x+\Delta{x}-f(x)}{\Delta{ x}}f(x0)=limx00Δx _y _=limx00Δx _f(x+Δxf(x)

Partial derivative

Under the multivariate function , the derivative and the partial derivative are essentially the same, and both are the limit of the ratio of the change of the function value to the change of the independent variable when the change of the independent variable approaches 0 . Simply put, the derivative refers to the unary function , the function y = f ( x ) y=f(x)
y=The rate of change of f ( x ) along the positive direction of the x-axis at a certain point; the partial derivative refers tothe multivariate function, the functiony = f ( x 0 , x 1 , . . . xn ) y=f({x} _0, {x}_1,...{x}_n)y=f(x0,x1,...xn) at a pointalong( x 0 , x 1 , . . . xn ) ({x}_0, {x}_1,...{x}_n)(x0,x1,...xn) is the rate of change in the positive direction of a certain coordinate axis.
α α xjf ( x 0 , x 1 , . . . xn ) = lim ⁡ Δ x → 0 Δ y Δ x = lim ⁡ Δ x → 0 f ( x 0 , . . . , xj + Δ xj , . . . , xn ) − f ( x 0 , . . . xj , . . . , xn ) Δ x \frac{\alpha}{\alpha{x_j}}{f({x}_0, {x}_1,.. .{x}_n)} = \lim_{\Delta{x\to0}}\frac{\Delta{y}}{\Delta{x}} = \lim_{\Delta{x\to0}}{\frac {f(x_0,...,{x}_j + \Delta{x_j},...,x_n ) - f(x_0, ...x_j, ..., x_n)}{\Delta{x}} }αxjaf(x0,x1,...xn)=limΔx0Δx _y _=limΔx0Δx _f(x0,...,xj+ Δx _j,...,xn)f(x0,...xj,...,xn)

directional derivative

Derivatives and partial derivatives both discuss the rate of change of a function along the positive direction of the coordinate axis . The definition of a directional derivative arises when we consider the rate of change in an arbitrary direction at a point along a function.
In a multivariate function, the partial derivative refers to the change of the independent variable on a certain coordinate axis , and the directional derivative refers to the change of the independent variable on multiple coordinate axes in a certain direction. Partial derivatives are essentially a special case of directional derivatives .
α α lf ( x 0 , x 1 , . . . xn ) = lim ⁡ ρ → 0 Δ y Δ x = lim ⁡ ρ → 0 f ( x 0 + Δ x 0 , . . . , xj + Δ xj , . . . , xn + Δ xn ) − f ( x 0 , . . . xj , . . . , xn ) ρ \frac{\alpha}{\alpha{l}}{f({x}_0, {x} _1,...{x}_n)} = \lim_{ {\rho\to0}}\frac{\Delta{y}}{\Delta{x}} = \lim_{ {\ rho\to0}}{ \frac{f(x_0+\Delta{x_0},...,{x}_j + \Delta{x_j},...,x_n + \Delta{x_n})- f(x_0, ...x_j, . .., x_n)}{\rho}}αlaf(x0,x1,...xn)=limρ0Δx _y _=limρ0rf(x0+ Δx _0,...,xj+ Δx _j,...,xn+ Δx _n)f(x0,...xj,...,xn)
ψ = ( Δ x 0 2 ) + . . . + ( Δ xj 2 ) + . . . + ( Δ xn 2 ) \rho = \sqrt{(\Delta{x_0}^2)+ ... +(\Delta{x_j}^2)+...+(\Delta{x_n}^2)}r=( Δx _02)+...+( Δx _j2)+...+( Δx _n2)

gradient

Gradient, originally intended to be a vector, indicates that the directional derivative of the function at this point obtains the maximum value along this direction , that is, the direction corresponding to the maximum of all directional derivatives at a certain point, and the function changes the fastest along this direction at this point. Maximum rate of change.
image.png
gradf ( x 0 , . . . , xj , . . . xn ) = ( α f α x 0 , . . . α f α xj , . . . , α f α xn ) gradf(x_0, ..., x_j ,... x_n)=(\frac{\alpha{f}}{\alpha{x_0}}, ...\frac{\alpha{f}}{\alpha{x_j}}, ..., \ frac{\alpha{f}}{\alpha{x_n}})gradf(x0,...,xj,...xn)=(αx0a f,...αxja f,...,αxna f)

The relationship between directional derivative and gradient

Under the binary function, the directional derivative can be expressed as follows:
α f α l = α f α x ∗ cos θ + α f α y ∗ sin θ \frac{\alpha{f}}{\alpha{l}} = \ frac{\alpha{f}}{\alpha{x}}*cos\theta + \frac{\alpha{f}}{\alpha{y}}*sin\thetaαla f=αxa fcosθ+a ya fs in θ
whereθ \thetaθ is expressed as the angle between the direction vector and the x-axis. We can convert the above formula into the inner product of two vectors, expressed as follows:
α f α l = [ α f α x , α f α y ] ∗ [ cos θ , sin θ ] \frac{\alpha{f} }{\alpha{l}} = [\frac{\alpha{f}}{\alpha{x}}, \frac{\alpha{f}}{\alpha{y}}] * [cos\theta, sin\theta]αla f=[αxa f,a ya f][cosθ,s in θ ]
The inner product of two vectors can be expressed as the multiplication of the modules of the two vectors, and then multiplied by the angle between them, expressed as follows:
α f α l = α f α x 2 + α f α y 2 ∗ cos θ 2 + sin θ 2 ∗ cos φ = α f α x 2 + α f α y 2 ∗ 1 ∗ cos φ \frac{\alpha{f}}{\alpha{l}} = \sqrt{\frac {\alpha{f}}{\alpha{x}}^2 + \frac{\alpha{f}}{\alpha{y}}^2} *\sqrt{ {cos\theta}^2 + { sin \theta}^2} * cos\varphi = \sqrt{\frac{\alpha{f}}{\alpha{x}}^2 + \frac{\alpha{f}}{\alpha{y}}^ 2} * 1 * cos\varphiαla f=αxa f2+a ya f2 cosθ2+sinθ2 cos _=αxa f2+a ya f2 1cos φAccording
to the geometric meaning of the vector inner product, we can know thatφ \varphiφ is the angle between two vectors, that is,the angle between the gradient and a certain direction. If and only if, the angle between the direction and the gradient is 0, the directional derivative takes the maximum value. That is,the direction corresponding to the maximum directional derivative at a certain point in the function, that is, the direction of the gradient.

How to understand the gradient descent method

In simple terms, in neural networks, gradient descent is a method of finding the minimization of a loss function.

intuitive version

Gradient Descent

In the simplest unary convex function f ( x ) = x 2 f(x)=x^2f(x)=x2 as an example to demonstrate how the gradient descent method finds the minimum value of the function.
Suppose the starting pointx 0 = 10 x_0=10x0=10 , analogy to the initial value of the neural network weight after initialization, as shown in the figure below:
image.png
f ( x 0 ) f(x_0)f(x0) The corresponding gradient is:
grad ( f ( x 0 ) ) = α f α xi = f ′ ( x 0 ) = ( 2 x ∣ x 0 = 10 ) = 20 grad(f(x_0))=\frac{\ alpha{f}}{\alpha{x}}i=f'(x_0)=(2x{|}_{x_0=10})=20grad(f(x0))=αxa fi=f(x0)=(2xx0=10)=20
corresponds toxxThe vector on the x- axis, Δ f ( x ) \Delta{f(x)}Δ f ( x ) is the fastest growing direction of the function, then− Δ f ( x ) -\Delta{f(x)}Δ f ( x ) is the fastest decaying direction of the function:
image.png
according to the gradient descent method, move a certain distance to get the next position:
x 1 = x 0 − η Δ f ( x 0 ) x_1 = x_0 - \eta{\ Delta{f(x_0)}}x1=x0ηΔf(x0)
among them,η \etaη is the step size, through which the moving distance can be controlled, assumingη = 0.1 \eta=0.1the=0.1 , then the position of the next point is:
x 1 = x 0 − η Δ f ( x 0 ) = 10 − 0.1 ∗ 20 = 8 x_1 = x_0 - \eta{\Delta{f(x_0)}} = 10 - 0.1 * 20 = 8x1=x0ηΔf(x0)=100.120=8
image.png
Using the gradient descent method to iterate continuously, the following update process can be obtained:
image.png
Repeat the gradient descent process continuously, and you can see the gradient valueΔ ( f ( x ) \Delta(f(x)Δ ( f ( x ) is constantly decreasing and eventually approaches 0, at this timeΔ f ( x ) = f ′ ( x ) = 0 \Delta{f(x)} = f'(x)=0Δf(x)=f(x)=0 , the function takes the minimum value.

learning step size

In the above calculation process, use the learning step size η \etaη to control the moving distance, now to observe differentη \etaThe effect of η on the final result.

  1. the \etaη is too small

set \etaη is 0.01, and iterates 10 times. It can be seen that there is still a long distance from the bottom.
0.01.png

  1. the \etaη suitable

set \etaη is 0.2, and iterates 10 times, and it can be seen that it has just reached the bottom.
0.2.png

  1. the \etaη is larger

set \etaη is 0.5, iterates 10 times, and it can be seen that the function value swings between two points.
0.5.png

  1. the \etaη is too large

set \etaη is 1.1, and iterates 10 times. At this time, the function value crosses the bottom and continues to rise.
1.1.png
In summary, different learning stepsη \etaUnder η , as the number of iterations increases, the change law of the function value is different.

math version

Gradient descent is a commonly used first-order optimization method, and it is one of the simplest and classical methods for solving unconstrained problems. For an unconstrained optimization problem minf ( x ) minf(x)min f ( x ) , wheref ( x ) f(x)f ( x ) is acontinuous differentiablefunction, for a setx 0 , x 1 , . . . xn {x_0, x_1,...x_n}x0,x1,...xn,满足:
f ( x t + 1 ) < f ( x t ) , t = 0 , 1 , . . . n f(x_{t+1}) < f(x_t), t=0,1,...n f(xt+1)<f(xt),t=0,1,... n
Then it can converge to the local minimum point, as shown in the figure below:
image.png
Thenminf ( x ) minf(x)min f ( x ) becomes how to find the next pointxt + 1 x_{t+1}xt+1, and guarantee f ( xt + 1 ) < f ( xt ) f(x_{t+1}) < f(x_t)f(xt+1)<f(xt) . For unary functions, the function value will only increase withxxThe x value changes, so the nextxt + 1 x_{t+1}xt+1is the previous xt x_txtTake a small step in a certain direction Δ x \Delta{x}Δx is obtained.
By Taylor expansion:
f ( x + Δ x ) ≃ f ( x ) + Δ xf ′ ( x ) f(x+\Delta{x})\simeq{f(x)+\Delta{x}f'(x )}f(x+Δ x )f(x)+Δxf (x)
sub-coordinate is the currentxxx moves a small stepΔ x \Delta{x}Δ x is obtained, which is approximately equal to the right side. Due to the need to ensure thatf ( xt + 1 ) < f ( xt ) f(x_{t+1}) < f(x_t)f(xt+1)<f(xt),即 f ( x + Δ x ) < f ( x ) f(x+\Delta{x})<f(x) f(x+Δ x )<f(x),则 Δ x f ′ ( x ) < 0 \Delta{x}f'(x) <0 Δxf(x)<0 .
assumeΔx = − α f ′ ( x ) , ( α > 0 ) \Delta{x} = -\alpha{f'(x)}, (\alpha>0)Δx _=a f(x),( a>0 ) , whereα \alphaα is the learning step size, soΔ xf ′ ( x ) = − α f ′ ( x ) 2 \Delta{x}f'(x)=-\alpha{ { f'(x)}^2}Δxf(x)=a f(x)2 . Since the square of any non-zero number is greater than 0, it can be guaranteed thatΔ xf ′ ( x ) < 0 \Delta{x}f'(x)<0Δxf(x)<0 .
Thus, setting
f ( x + Δ x ) = f ( x − α f ′ ( x ) ) f(x+\Delta{x}) = f(x -\alpha{f'(x)})f(x+Δ x )=f(xa f (x))
can guarantee thatf ( x + Δ x ) < f ( x ) f(x+\Delta{x})<f(x)f(x+Δ x )<f(x)也即 x t + 1 = x t − α f ′ ( x ) x_{t+1}=x_t-\alpha{f'(x)} xt+1=xta f (x), the moving direction is the negative gradient direction.
The overall flow of the gradient descent method is shown in the figure below:

optimization function

stochastic gradient descent

The stochastic gradient descent method updates the weight parameters by selecting a single sample data each time.
advantage:

  • The training speed is fast, avoiding the computational redundancy problem in the batch gradient update process;
  • When the amount of training data is relatively large, it can also converge at a faster speed;

shortcoming:

  • Since each sample is randomly sampled, after the model is trained, only a small part of the sample may be used, which makes the obtained gradient biased;
  • In addition, with higher variance, greater volatility

batch gradient descent

The batch stochastic gradient descent method needs to traverse all sample data each time to update the weight parameters.
advantage:

  • Due to traversing the complete data set, the result is the global minimum;

shortcoming:

  • The model training speed is slow. When the amount of data is large, it cannot be fully loaded into the memory, and the gradient calculation complexity is large;

mini-batch gradient descent method

Each iteration considers a small portion of sample data to update the weight parameters.
advantage:

  • Reduce the problem of high variance in SGD, making the model convergence more stable;
  • Matrix multiplication can speed up calculation;

shortcoming:

  • There is also randomness in sample selection, which does not guarantee good convergence of the model;

exponentially weighted average

Before introducing the subsequent multiple optimization functions, it is necessary to understand the exponential weighted average method, which is essentially a method similar to averaging . The exponential weighted average method can be used to estimate the local mean of the variable , so that the update of the variable is related to the historical value within a period of time , in which each observation value is given different weights, and its weighting coefficient decreases exponentially with time .
Compared with the average method, the exponential weighted average method does not need to save all the past values, and the amount of calculation is significantly reduced.
The formula of the exponential weighted average method is as follows:
vt + 1 = β vt + ( 1 − β ) θ t + 1 v_{t+1} = \beta{v_t} + (1-\beta)\theta_{t+1}vt+1=b vt+(1b ) it+1
where vt + 1 v_{t+1}vt+1Represents up to t+1 t+1t+1 -dayaverage,θ t + 1 \theta_{t+1}it+1Indicates the t + 1 t+1t+1- day temperature value,β \betaβ is a tunable hyperparameter. The three important properties of the exponential weighted average method are verified as follows:

  • local average
  • Weighting coefficient
  • The coefficient decreases exponentially with time

Suppose β = 0.9 \beta=0.9b=0.9 , then the average process of the exponential weighted average method can be obtained as follows:
v 100 = 0.9 v 99 + 0.1 θ 100 ; v_{100} = 0.9v_{99} + 0.1\theta_{100};v100=0.9v _99+0.1 i100;
v 99 = 0.9 v 98 + 0.1 θ 99 ; v_{99} = 0.9v_{98} + 0.1\theta_{99};v99=0.9v _98+0.1 i99;
v 98 = 0.9 v 97 + 0.1 θ 98 ; v_{98} = 0.9v_{97} + 0.1\theta_{98};v98=0.9v _97+0.1 i98;
Simplify the above formula to get the following expression:
v 100 = 0.1 θ 100 + 0.9 v 99 v_{100} = 0.1\theta_{100} + 0.9v_{99}v100=0.1 i100+0.9v _99
= 0.1 θ 100 + 0.9 ∗ ( 0.1 θ 99 + 0.9 v 98 ) = 0.1 θ 100 + 0.1 ∗ 0.9 θ 99 + 0.9 2 ∗ ( 0.1 θ 98 + 0.9 v 97 ) = 0.1\theta_{100} + 0.9*( 0.1\theta_{99} + 0.9v_{98}) = 0.1\theta_{100} + 0.1 * 0.9 \theta_{99} + {0.9}^2* (0.1\theta_{98} + 0.9v_{97} )=0.1 i100+0.9( 0.1 i99+0.9v _98)=0.1 i100+0.10.9 i99+0.92( 0.1 i98+0.9v _97)
= 0.1 θ 100 + 0.1 ∗ 0.9 θ 99 + 0.1 ∗ 0.9 2 θ 98 + 0.9 3 v 97 = 0.1\theta_{100} + 0.1 * 0.9 \theta_{99} + 0.1 * {0.9}^2 \theta_{ 98} + {0.9}^3 v_{97}=0.1 i100+0.10.9 i99+0.10.92 i98+0.93 v97

From the above expression, we can see that v 100 v_{100}v100It is equal to each observation value multiplied by a weight, and the weight decreases exponentially with time. Another point is that the average obtained by the exponential weighted average method is a local average, so how many observations are averaged, roughly 1 1 − β \frac{1}{1-\beta}1 b1, in this example β = 0.9 \beta=0.9b=0.9 , then 10 observations are referenced.
In order to make the average calculation more accurate, it is necessary tocorrect the bias. The following is an example to introduce why bias correction is needed, assumingβ = 0.98 \beta=0.98b=0.98 , initializev 0 = 0 v_0=0v0=0 θ 1 = 40 \theta_1=40 i1=40。 sov
1 = β v 0 + ( 1 − β ) θ t = 0.98 ∗ v 0 + 0.02 θ t = 0.02 θ t = 8 v_1=\beta{v_0}+(1-\beta)\theta_t=0.98 *v_0+0.02\theta_t=0.02\theta_t=8v1=b v0+(1b ) it=0.98v0+0.02 it=0.02 it=8
So the estimated value on the first day is not accurate,v 2 = 0.98 v 1 + 0.02 θ 2 = 0.02 ∗ 0.98 ∗ θ 1 + 0.02 θ 2 v_2=0.98v_1+0.02\theta_2 = 0.02*0.98*\theta_1 + 0.02 \theta_2v2=0.98v _1+0.02 i2=0.020.98i1+0.02 i2, due to θ 1 \theta_1i1Sum θ 2 \theta_2i2are all integers, so the calculated v 2 v_2v2Much smaller than θ 1 \theta_1i1Sum θ 2 \theta_2i2.
For inaccurate early estimates, use vt 1 − β t \frac{v_t}{1-\beta^t}1 btvtIndicates the value of the day, where ttt represents the number of days, so the estimated value for the second day is:
v 2 0.0396 \frac{v_2}{0.0396}0.0396v2
with ttThe t value increases,β t \beta_tbtclose to 0, so when ttWhen t is large, bias correction has little effect.

Momentum gradient descent method

The basic idea of ​​momentum gradient descent is to compute an exponentially weighted average of the gradients and use that gradient to update the weights.
For the mini-batch gradient descent method, due to the randomness of sampling, each drop is not strictly in the direction of the minimum, but the overall downward trend is in the direction of the minimum.
image.png
The momentum gradient descent method is to reduce the amplitude of the swing and improve the learning speed.
image.png
The expression of the momentum gradient descent method is as follows:
vdw = β vdw + ( 1 − β ) d W v_{dw} = \beta{v_{dw}} + (1-\beta)dWvdw=b vdw+(1β ) d W
vdb = β vdb + ( 1 − β ) db v_{db} = \beta{v_{db}} + (1-\beta)dbvdb=b vdb+(1β)db
W = W − α v d w ; b = b − α v d b W= W-\alpha{v_{dw}};b = b-\alpha{v_{db}} W=Wαvdw;b=bαvdb
Use the exponential weighted average method to obtain the local gradient average value, and replace the original single iteration gradient value for parameter update. It borrows the concept of momentum in physics and imagines parameter optimization as the task of pushing a ball down a hill. The gradient value is similar to the acceleration of the ball. When the direction of the gradient before and after is consistent, learning can be accelerated. When the direction of the gradient before and after is inconsistent, fluctuations will be suppressed . .

RMS plug

Although the momentum optimization algorithm has initially solved the problem of large swings in the parameter optimization process, the so-called swings refer to the range of parameters after optimization and update. As shown in the figure below, the blue one is the route taken by the Momentum optimization algorithm. Green is the route taken by the RMSprop optimization algorithm.
image.png
The expression of RMSprop is as follows:
S dw = β S dw + ( 1 − β ) ( d W ) 2 S_{dw} = \beta{S_{dw}} + (1-\beta)(dW)^2Sdw=βSdw+(1β ) ( dW ) _2
S d b = β S d b + ( 1 − β ) ( d b ) 2 S_{db} = \beta{S_{db}} + (1-\beta)(db)^2 Sdb=βSdb+(1b ) ( d b )2
W = W − α d W S d w ; b = b − α d b S d b W=W-\alpha{\frac{dW}{\sqrt{S_{dw}}}};b=b-\alpha{\frac{db}{\sqrt{S_{db}}}} W=WaSdw dW;b=baSdb db
The RMSprop gradient descent method first uses the exponential weighted average method to obtain the weighted average of the gradient square , and then uses the square root of the gradient ratio to update the weight parameter .
The RMSprop algorithm computes a differential square weighted average of the gradients. This approach is beneficial to eliminate the direction of large swing amplitude (when there is a relatively large value in dW and db, then when updating the weight or bias, it is divided by the square root of the gradient accumulated before it, and the update amplitude will become Small ), which is used to correct the swing range, so that the swing range of each dimension is smaller, and on the other hand, it makes the network converge faster.

Adam

The Adam optimization algorithm is basically a combination of RMSprop and Momentum. The specific calculation process is as follows:

  1. Calculate the gradient d W , db dW,dbdW,db;
  2. Calculate the exponentially weighted average of momentum vdw = β 1 vdw + ( 1 − β 1 ) d W v_{dw} = \beta_1{v_{dw}} + (1-\beta_1)dWvdw=b1vdw+(1b1) d W ,vdb = β vdb + ( 1 − β ) db v_{db} = \beta{v_{db}} + (1-\beta)dbvdb=b vdb+(1b ) d b ;
  3. Update using RMSprop, S dw = β 2 3 S dw + ( 1 − β 2 ) ( d W ) 2 S_{dw} = \beta_23{S_{dw}} + (1-\beta_2)(dW)^2Sdw=b23S _dw+(1b2)(dW)2 S d b = β S d b + ( 1 − β ) ( d b ) 2 S_{db} = \beta{S_{db}} + (1-\beta)(db)^2 Sdb=βSdb+(1b ) ( d b )2
  4. 进行偏差修正, v d W c o r r e c t e d = v d W 1 − β 1 t , v d b c o r r e c t e d = v d b 1 − β 1 t , S d W c o r r e c t e d = S d W 1 − β 2 t , S d b c o r r e c t e d = S d b 1 − β 2 t v_{dW}^{corrected} = \frac{v_{dW}}{1-\beta_1^t},v_{db}^{corrected} = \frac{v_{db}}{1-\beta_1^t},S_{dW}^{corrected}=\frac{S_{dW}}{1-\beta_2^t},S_{db}^{corrected}=\frac{S_{db}}{1-\beta_2^t} vdWcorrected=1 b1tvdW,vdbcorrected=1 b1tvdb,SdWcorrected=1 b2tSdW,Sdbcorrected=1 b2tSdb;
  5. 进行权值更新, W = W − α v d W c o r r e c t e d S d W c o r r e c t e d + ϵ , b = b − α v d b c o r r e c t e d S d b c o r r e c t e d + ϵ W=W-\alpha{\frac{v_{dW}^{corrected}}{\sqrt{S_{dW}^{corrected}}+\epsilon}},b=b-\alpha{\frac{v_{db}^{corrected}}{\sqrt{S_{db}^{corrected}}+\epsilon}} W=WaSdWcorrected + ϵvdWcorrected,b=baSdbcorrected + ϵvdbcorrected

The Adam algorithm combines RMSprop and Momentum and is an extremely commonly used learning algorithm for different neural networks.

Reference link:

https://blog.csdn.net/wo164683812/article/details/90382330
https://www.zhihu.com/question/36301367
https://blog.csdn.net/weixin_44492824/article/details/122270260
https://www.zhihu.com/question/305638940/answer/1639782992
https://www.zhihu.com/question/305638940/answer/606831354
https://zhuanlan.zhihu.com/p/36564434

Guess you like

Origin blog.csdn.net/hello_dear_you/article/details/128992906