Neural Networks: Fundamentals of Deep Learning

1. Concept and simple derivation of backpropagation algorithm (BP)

The Backpropagation (BP) algorithm is a common algorithm used in conjunction with optimization methods (such as gradient descent) to train artificial neural networks . The BP algorithm calculates the gradient of the loss function for all weights in the network and feeds the gradient back to the optimization method to update the weights to minimize the loss function. The algorithm first calculates (and caches) the output value of each node through forward propagation, and then calculates the partial derivative of the loss function value with respect to each parameter through back propagation through the graph .

Next, we take a fully connected layer, a neural network using sigmoid activation function, and Softmax+MSE as the loss function as an example to deduce the logic of the BP algorithm. Due to space limitations, only a simple derivation will be carried out here. Rocky will write a dedicated article on the complete derivation process of the PB algorithm in the future, so stay tuned.

First, let's look at the expression of the sigmoid activation function and its derivatives:

sigmoid expression: σ ( x ) = 1 1 + e − x sigmoid expression: \sigma(x) = \frac{1}{1+e^{-x}}s i g m o i d expression: σ ( x )=1+ex1
s i g m o i d 导数: d d x σ ( x ) = σ ( x ) − σ ( x ) 2 = σ ( 1 − σ ) sigmoid导数:\frac{d}{dx}\sigma(x) = \sigma(x) - \sigma(x)^2 = \sigma(1- \sigma) s i g m o i d derivative:dxdσ ( x )=σ ( x )σ ( x )2=s ( 1s )

It can be seen that the derivative of the sigmoid activation function can ultimately be expressed as a simple operation on the output value.

Let’s look at the expression of the MSE loss function and its derivatives:

Expression of MSE loss function: L = 1 2 ∑ k = 1 K ( yk − ok ) 2 Expression of MSE loss function: L = \frac{1}{2}\sum^{K}_{k=1 }(y_k - o_k)^2Expression of MSE loss function: L=21k=1K(ykok)2

Among them yk y_kykRepresents the ground truth (gt) value, ok o_kokRepresents the network output value.

Partial derivative of MSE loss function: ∂ L ∂ oi = ( oi − yi ) Partial derivative of MSE loss function: \frac{\partial L}{\partial o_i} = (o_i - y_i)Partial derivative of MSE loss function:oiL=(oiyi)

Since the partial derivative is simple and only if k = ik = ik=It will only work if i , so it is simplified.

Next, let’s look at the gradient output by the fully connected layer:

Expression of MSE loss function: L = 1 2 ∑ i = 1 K ( oi 1 − ti ) 2 Expression of MSE loss function: L = \frac{1}{2}\sum^{K}_{i= 1}(o_i^1 - t_i)^2Expression of MSE loss function: L=21i=1K(oi1ti)2

Partial derivative of MSE loss function: ∂ L ∂ wjk = ( ok − tk ) ok ( 1 − ok ) xj Partial derivative of MSE loss function: \frac{\partial L}{\partial w_{jk}} = (o_k - t_k)o_k(1-o_k)x_jPartial derivative of MSE loss function:wjkL=(oktk)ok(1ok)xj

We use δ k = ( ok − tk ) ok ( 1 − ok ) \delta_k = (o_k - t_k)o_k(1-o_k)dk=(oktk)ok(1ok) , it can be simplified again:

Partial derivative of MSE loss function: d L dwjk = δ kxj Partial derivative of MSE loss function: \frac{dL}{dw_{jk}} = \delta_kx_jPartial derivative of MSE loss function:dwjkdL=dkxj

Finally, let’s look at the partial derivatives of each layer in the PB algorithm:

output layer:
∂ L ∂ wjk = δ k K oj \frac{\partial L}{\partial w_{jk}} = \delta_k^K o_jwjkL=dkKoj
δ k K = ( ok − tk ) ok ( 1 − ok ) \delta_k^K = (o_k - t_k)o_k(1-o_k)dkK=(oktk)ok(1ok)

Penultimate layer:
∂ L ∂ wij = δ j J oi \frac{\partial L}{\partial w_{ij}} = \delta_j^J o_iwijL=djJoi
δ j J = oj ( 1 − oj ) ∑ k δ k K wjk \delta_j^J = o_j(1 - o_j) \sum_{k}\delta_k^Kw_{jk}djJ=oj(1oj)kdkKwjk

Third layer of inverted numbers:
∂ L ∂ wni = δ i I on \frac{\partial L}{\partial w_{ni}} = \delta_i^I o_nwit isL=diIon
δ i I = o i ( 1 − o i ) ∑ j δ j J w i j \delta_i^I = o_i(1 - o_i) \sum_{j}\delta_j^Jw_{ij} diI=oi(1oi)jdjJwij

By deriving it backwards like this, and then iteratively optimizing the network parameters through the gradient descent algorithm, you can complete the PB algorithm logic.

2. Related concepts of moving average

The moving average (exponential moving average), or exponentially weighted moving average (exponentially weighted moving average), can be used to estimate the local mean of a variable, so that the update of the variable is related to the historical value over a period of time .

variablevv _vattt __Time t is recorded as vt v_{t}vt θ t \theta_{t} itfor the variable vvvattt __The value after training at time t , when the moving average model is not used,vt = θ t v_{t} = \theta_{t}vt=it, after using the moving average model, vt v_{t}vtThe update formula is as follows:

In the above formula, β ϵ [ 0 , 1 ) \beta\epsilon[0,1)β ϵ [ 0 ,1 )β = 0 \beta = 0b=0 is equivalent to not using a moving average.

t t variablevv at time tThe sliding average of v is roughly equal to the past 1 / ( 1 − β ) 1/(1-\beta)1/(1β ) timesθ \thetaaverage of theta values. And use bias correction to changevt v_{t}vtDivide by ( 1 − β t ) (1 - \beta^{t})(1bt )corrects the estimate of the mean.

After adding Bias correction, vt v_{t}vt v b i a s e d t v_{biased_{t}} vbiasedtThe update formula is as follows:

when ttThe larger t is,1 − β t 1 - \beta^{t}1bThe closer t is to 1, the results obtained by formulas (1) and (2) (vt v_{t}vtvbiased 1 v_{biased_{1}}vbiased1) will get closer and closer.

This β \betaThe larger β is, the closer the value obtained by the sliding average is toθ \thetaRelated to historical values ​​of θ . Ifβ = 0.9 \beta = 0.9b=0.9 , which is roughly equal to the past 10θ \thetaaverage of θ values; ifβ = 0.99 \beta = 0.99b=0.99 , which is roughly equal to the past 100θ \thetaaverage of theta values.

The figure below represents the results of calculating weights in different ways:

As shown in the figure above, the sliding average can be regarded as the average value of the variable in the past period. Compared with the direct assignment of the variable, the value obtained by the sliding average is smoother and smoother on the image, with less jitter, and will not be affected by Certain abnormal values ​​cause the sliding average to fluctuate greatly .

The advantage of sliding average: it takes up less memory and does not need to save the past 10 or 100 histories θ \thetaThe value of θ can be estimated as its mean value. Although the sliding average is not as accurate as calculating the average by saving all historical values, the latter takes up more memory and is more computationally expensive.

Why is the sliding average used during testing?

Moving average can make the model more robust on test data .

When using the stochastic gradient descent algorithm to train a neural network, using a sliding average can improve the performance of the final model on test data to a certain extent in many applications.

The weights of the neural network during training weights weightsw e i g h t s uses a moving average, and then uses the weights weightsafter the moving average during the testw e i g h t s is used as the weight during testing, which has better effect on test data. Because the weights weightsafter the moving averageThe update of w e i g h t s is smoother. For stochastic gradient descent, the smoother update means that it will not deviate very far from the optimal point. For example, assuming decay=0.999, a more intuitive understanding is that during the last 1,000 times of training, the model has already been trained and is in the jitter stage, and the sliding average is equivalent to averaging the last 1,000 jitters, so that The weights will be more robust.

Guess you like

Origin blog.csdn.net/weixin_51390582/article/details/135172910