The Backpropagation Principle and Derivation of Deep Neural Network DNN

DNN backpropagation derivation


1. Uniform symbols

insert image description here

As shown in the figure above, let's unify the symbols first:

w w w heavy
zzz input value
bbb bias bais
aaa activation value
σ \sigmaσ activation function

w 43 2 w_{43}^{2} w432Represents the weight b of the fourth neuron on the second hidden layer connected by the third neuron of the previous layer
b 2 3 b_{2}^{3}b23Represents the bias of the second neuron of the third hidden layer
a 1 2 a_{1}^{2}a12Indicates the activation value z of the first neuron on the second hidden layer
1 2 z_{1}^{2}z12Represents the input value of the first neuron on the second hidden layer

So there is

z 2 3 = ( a 1 2 w 21 3 + a 2 2 w 22 3 + a 3 2 w 23 3 + a 4 2 w 24 3 ) + b 2 3 z_2^3= (a_1^2w_{21}^3 + a_2^2w_{22}^3 + a_3^2w_{23}^3 +a_4^2w_{24}^3) +b_2^3 z23=(a12w213+a22w223+a32w233+a42w243)+b23
a 2 3 = σ ( z 2 3 ) a_2^3=\sigma(z_2^3) a23=s ( z23)


2. Forward propagation

OK, we then replace the numbers with symbols:

w j k l w_{jk}^{l} wjklRepresents the weight bjl b_{j}^{l} of the jth neuron on the hidden layer of the first layer linked by the kth neuron of the previous layer
bjlRepresents the bias ajl a_{j}^{l} of the jth neuron of the hidden layer of the l layer
ajlIndicates the activation value zjl z_{j}^{l} of the jth neuron on the hidden layer of layer l
zjlIndicates the input value of the jth neuron on the hidden layer of the lth layer

So we have:
zjl = ∑ kwjklakl − 1 + bjl z_j^l=\sum_kw_{jk}^la_k^{l-1}+b_j^lzjl=kwjklakl1+bjl
a j l = σ ( z j l ) = σ ( ∑ k w j k l a k l − 1 + b j l ) a_j^l=\sigma(z_j^l)=\sigma(\sum_kw_{jk}^la_k^{l-1}+b_j^l) ajl=s ( zjl)=s (kwjklakl1+bjl)

Use vector display:
al = σ ( zl ) = σ ( wlal − 1 + bl ) a^l=\sigma(z^l)=\sigma(w^la^{l-1}+b^l)al=s ( zl)=s ( whe tol1+bl)


3. Loss function

Suppose we use the mean square error MSE as the loss function
C = 1 2 n ∑ x ∥ y ( x ) − a L ( x ) ∥ 2 C=\frac{1}{2n} \sum_x \| y(x)-a ^L(x)\|^2C=2 n1xy(x)aL(x)2

a L a^L aL is the activation output of the last layer, L represents the last layer,y ( x ) y(x)y ( x ) represents the Ground Truth when the input is x

We write it as a vector representation:
C = 1 2 ∥ y − a L ∥ 2 = 1 2 ∑ j ( yj − aj L ) 2 C=\frac{1}{2}\|ya^L\|^2=\frac {1}{2} \sum_j (y_j-a_j^L)^2C=21yaL2=21j(yjajL)2

Our goal is to optimize CCC , making it the smallest.


4. Important note about BP

to CCFor C , the variable is not an inputxxx andaaa , butwww . We're going to useC vs. w C vs. wC performs gradient descent on the gradient of w to changewww , in order to achieve one purpose, it is to obtain a set ofwww such that for all input images or dataxxx makesaverage C average CThe average C is the smallest. So actually what we are most concerned about in this problem isC vs. w C vs. wGradient of C to w :

∂ C ∂ w j k l \frac{\partial C}{\partial w_{jk}^l} wjklC
And the gradient of C to bias: The gradient of C to bias:Gradient of C to b i a s :
∂ C ∂ bjl \frac{\partial C}{\partial b_j^l}bjlC

BP backpropagation is relative to forward derivative calculation, so why use BP backpropagation to calculate these two gradients? The
answer is because the calculation is faster. Why the calculation is faster, let's first look at why the direct calculation of the gradient is slower.

Let's look at such a neural network with 2 hidden layers, we require ∂ C ∂ w 11 2 \frac{\partial C}{\partial w_{11}^{2}}w112C

insert image description here

∂ C ∂ w 11 2 = ∂ ( l o s s ( a 1 4 ) + l o s s ( a 2 4 ) ) ∂ w 11 2 = ∂ ( l o s s ( a 1 4 ) ) ∂ w 11 2 + ∂ ( l o s s ( a 2 4 ) ) ∂ w 11 2 = ∂ C ∂ a 1 4 ∂ a 1 4 ∂ w 11 2 + ∂ C ∂ a 2 4 ∂ a 2 4 ∂ w 11 2 \textcolor{red} { \frac{\partial C}{\partial w_{11}^{2}}=\frac{\partial(loss(a_1^4)+loss(a_2^4))}{\partial w_{11}^{2}} \\ =\frac{\partial(loss(a_1^4))}{\partial w_{11}^{2}} +\frac{\partial(loss(a_2^4))}{\partial w_{11}^{2}} \\ = \frac{\partial C}{\partial a_1^4} \frac{\partial a_1^4}{\partial w_{11}^{2}} + \frac{\partial C}{\partial a_2^4} \frac{\partial a_2^4}{\partial w_{11}^{2}} } w112C=w112(loss(a14)+loss(a24))=w112(loss(a14))+w112(loss(a24))=a14Cw112a14+a24Cw112a24

Suppose we want to calculate a 1 4 to C a_{1}^{4} to Ca14The path to C , as shown by the red arrow,w 11 2 w_{11}^{2}w112All neurons affected are included in the computation

∂ C ∂ a 1 4 is easy to find \textcolor{red}{\frac{\partial C}{\partial a_1^4} easy to find }a14Ceasy to get

Focus on ∂ a 1 4 ∂ w 11 2 \textcolor{red}{Focus on \frac{\partial a_1^4}{\partial w_{11}^{2}} \\ }focus onw112a14
由于 a 1 4 = σ ( z 1 4 ) 而 z 1 4 = a 1 3 w 11 4 + a 2 3 w 12 4 + a 3 3 w 13 4 + a 4 3 w 13 4 a_1^4=\sigma(z_1^4) 而z_1^4=a_1^3w_{11}^4+a_2^3w_{12}^4+a_3^3w_{13}^4+a_4^3w_{13}^4 a14=s ( z14) and z14=a13w114+a23w124+a33w134+a43w134所以,
∂ a 1 4 ∂ w 11 2 = ∂ a 1 4 ∂ z 1 4 ( ∂ a 1 3 w 11 4 ∂ w 11 2 + ∂ a 2 3 w 12 4 ∂ w 11 2 + ∂ a 3 3 w 13 4 ∂ w 11 2 + ∂ a 4 3 w 13 4 ∂ w 11 2 ) \frac{\partial a_1^4}{\partial w_{11}^{2}} =\frac{\partial a_1^4}{\partial z_{1}^{4}} ( \textcolor{red}{ \frac{\partial a_1^3w_{11}^4}{\partial w_{11}^{2}} } + \frac{\partial a_2^3w_{12}^4}{\partial w_{11}^{2}} + \frac{\partial a_3^3w_{13}^4}{\partial w_{11}^{2}} + \frac{\partial a_4^3w_{13}^4}{\partial w_{11}^{2}} ) w112a14=z14a14(w112a13w114+w112a23w124+w112a33w134+w112a43w134)
OK, let's continue to push down, the red part in the above formula:
∂ a 1 3 w 11 4 ∂ w 11 2 = w 11 4 ∂ a 1 3 ∂ z 1 3 ( ∂ a 1 2 w 11 3 ∂ w 11 2 + ∂ a 2 2 w 12 3 ∂ w 11 2 + ∂ a 3 2 w 13 4 ∂ w 13 3 ) \frac{\partial a_1^3w_{11}^4}{\partial w_{11}^{2} } \\ = w_{11}^4 \frac{\partial a_1^3}{\partial z_1^3} ( \frac{\partial a_1^2w_{11}^3}{\partial w_{11}^{ 2}} + \frac{\partial a_2^2w_{12}^3}{\partial w_{11}^{2}} + \frac{\partial a_3^2w_{13}^4}{\partial w_{ 13}^{3}} )w112a13w114=w114z13a13(w112a12w113+w112a22w123+w133a32w134)
At this point we find that
∂ a 2 2 w 12 3 ∂ w 11 2 and ∂ a 3 2 w 13 4 ∂ w 13 3 are not functions of w 11 2, so both are 0 \textcolor{red}{ \frac{ \partial a_2^2w_{12}^3}{\partial w_{11}^{2}} and \frac{\partial a_3^2w_{13}^4}{\partial w_{13}^{3}} None of them are functions of w_{11}^2, so they are all 0 }w112a22w123andw133a32w134neither w _112function of _0
于是,
∂ a 1 3 w 11 4 ∂ w 11 2 = w 11 4 ∂ a 1 3 ∂ z 1 3 ( ∂ a 1 2 w 11 3 ∂ w 11 2 ) \frac{\partial a_1^3w_{11}^4}{\partial w_{11}^{2}} \\ = w_{11}^4 \frac{\partial a_1^3}{\partial z_1^3} ( \frac{\partial a_1^2w_{11}^3}{\partial w_{11}^{2}} ) w112a13w114=w114z13a13(w112a12w113)
So in the same way,

∂ a 1 4 ∂ w 11 2 = ∂ a 1 4 ∂ z 1 4 ( ∂ a 1 3 w 11 4 ∂ w 11 2 + ∂ a 2 3 w 12 4 ∂ w 11 2 + ∂ a 3 3 w 13 4 ∂ w 11 2 + ∂ a 4 3 w 13 4 ∂ w 11 2 ) = ∂ a 1 4 ∂ z 1 4 ( w 11 4 ∂ a 1 3 ∂ z 1 3 ( ∂ a 1 2 w 11 3 ∂ w 11 2 ) + w 12 4 ∂ a 2 3 ∂ z 2 3 ( ∂ a 1 2 w 21 3 ∂ w 11 2 ) + w 13 4 ∂ a 3 3 ∂ z 3 3 ( ∂ a 1 2 w 31 3 ∂ w 11 2 ) + w 14 4 ∂ a 4 3 ∂ z 4 3 ( ∂ a 1 2 w 41 3 ∂ w 11 2 ) ) = ∂ a 1 4 ∂ z 1 4 ( w 11 4 ∂ a 1 3 ∂ z 1 3 ( w 11 3 x 1 ) + w 12 4 ∂ a 2 3 ∂ z 2 3 ( w 21 3 x 1 ) + w 13 4 ∂ a 3 3 ∂ z 3 3 ( w 31 3 x 1 ) + w 14 4 ∂ a 4 3 ∂ z 4 3 ( w 41 3 x 1 ) ) = A \frac{\partial a_1^4}{\partial w_{11}^{2}} =\frac{\partial a_1^4}{\partial z_{1}^{4}} ( \frac{\partial a_1^3w_{11}^4}{\partial w_{11}^{2}} + \frac{\partial a_2^3w_{12}^4}{\partial w_{11}^{2}} + \frac{\partial a_3^3w_{13}^4}{\partial w_{11}^{2}} + \frac{\partial a_4^3w_{13}^4}{\partial w_{11}^{2}} ) \\= \frac{\partial a_1^4}{\partial z_{1}^{4}} (w_{11}^4 \frac{\partial a_1^3}{\partial z_1^3} ( \frac{\partial a_1^2w_{11}^3}{\partial w_{11}^{2}} ) +w_{12}^4 \frac{\partial a_2^3}{\partial z_2^3} ( \frac{\partial a_1^2w_{21}^3}{\partial w_{11}^{2}} ) +w_{13}^4 \frac{\partial a_3^3}{\partial z_3^3} ( \frac{\partial a_1^2w_{31}^3}{\partial w_{11}^{2}} ) +w_{14}^4 \frac{\partial a_4^3}{\partial z_4^3} ( \frac{\partial a_1^2w_{41}^3}{\partial w_{11}^{2}} ) ) \\= \frac{\partial a_1^4}{\partial z_{1}^{4}} (w_{11}^4 \frac{\partial a_1^3}{\partial z_1^3} ( w_{11}^{3} x_1 ) +w_{12}^4 \frac{\partial a_2^3}{\partial z_2^3} ( w_{21}^{3} x_1 ) +w_{13}^4 \frac{\partial a_3^3}{\partial z_3^3} ( w_{31}^{3} x_1 ) +w_{14}^4 \frac{\partial a_4^3}{\partial z_4^3} ( w_{41}^{3} x_1 ) )\\=\textcolor{red}{A} w112a14=z14a14(w112a13w114+w112a23w124+w112a33w134+w112a43w134)=z14a14(w114z13a13(w112a12w113)+w124z23a23(w112a12w213)+w134z33a33(w112a12w313)+w144z43a43(w112a12w413))=z14a14(w114z13a13(w113x1)+w124z23a23(w213x1)+w134z33a33(w313x1)+w144z43a43(w413x1))=A

∂ a 2 4 ∂ w 11 2 = ∂ a 1 4 ∂ z 1 4 ( w 21 4 ∂ a 1 3 ∂ z 1 3 ( w 11 3 x 1 ) + w 22 4 ∂ a 2 3 ∂ z 2 3 ( w 21 3 x 1 ) + w 23 4 ∂ a 3 3 ∂ z 3 3 ( w 31 3 x 1 ) + w 24 4 ∂ a 4 3 ∂ z 4 3 ( w 41 3 x 1 ) ) = B \frac{\partial a_2^4}{\partial w_{11}^{2}}= \frac{\partial a_1^4}{\partial z_{1}^{4}} (w_{21}^4 \frac{\partial a_1^3}{\partial z_1^3} ( w_{11}^{3} x_1 ) +w_{22}^4 \frac{\partial a_2^3}{\partial z_2^3} ( w_{21}^{3} x_1 ) +w_{23}^4 \frac{\partial a_3^3}{\partial z_3^3} ( w_{31}^{3} x_1 ) +w_{24}^4 \frac{\partial a_4^3}{\partial z_4^3} ( w_{41}^{3} x_1 ) )\\=\textcolor{red}{B} w112a24=z14a14(w214z13a13(w113x1)+w224z23a23(w213x1)+w234z33a33(w313x1)+w244z43a43(w413x1))=B

最后,
∂ C ∂ w 11 2 = ∂ C ∂ a 1 4 ∂ a 1 4 ∂ w 11 2 + ∂ C ∂ a 2 4 ∂ a 2 4 ∂ w 11 2 = ∂ C ∂ a 1 4 A + ∂ C ∂ a 2 4 B \frac{\partial C}{\partial w_{11}^{2}} = \frac{\partial C}{\partial a_1^4} \frac{\partial a_1^4}{\partial w_{11}^{2}} + \frac{\partial C}{\partial a_2^4} \frac{\partial a_2^4}{\partial w_{11}^{2}} \\ = \frac{\partial C}{\partial a_1^4} \textcolor{red} {A} \\+ \frac{\partial C}{\partial a_2^4} \textcolor{red} {B} w112C=a14Cw112a14+a24Cw112a24=a14CA+a24CB

At this point, the problem is basically clear.

We can see that every time a ∂ C ∂ wjkl \frac{\partial C}{\partial w_{jk}^l} is calculated during forward calculationwjklCmust put wjkl to C w_{jk}^l to CwjklCalculate the local partial derivatives on all paths to C. The number of links in a fully connected deep neural network is in the billions. It is very time-consuming and labor-intensive to calculate hundreds of millions of paths for every gradient . .As can be seen from the above derivation,A \textcolor{red}{A}A B \textcolor{red}{B} Many calculations in B are repeated. Is there a way to calculate these repeated calculations in advance and then simply calculate the gradient through the relationship between the front and back layers? In this way, it is not necessary to search for all relevant path calculations.

Yes, that is BP. Of course, BP is just a mathematical technique for seeking partial derivatives, and it is a method rather than an end.

With the gradient, we only need to know the learning rate η \etaη , then take allwjkl − η ∗ ∂ C ∂ wjkl w_{jk}^l-\eta*\frac{\partial C}{\partial w_{jk}^l}wjklthewjklCA new set of weights is obtained. (The gradient is the fastest rising direction of the function, so here is a minus sign, the fastest falling direction.
) This set of weights will inevitably make CCC decreases, and so on, until learning∂ C ∂ wjkl \frac{\partial C}{\partial w_{jk}^l}wjklCThey are all approaching 0. In theory, they have reached a local bottom, and the learning of weights will slow down until they no longer change. In this way, the learning of the network is completed.


5.BP derivation

1)我们令
δ jl = ∂ C ∂ zjl \delta_j^l=\frac{\partial C}{\partial z_j^l}djl=zjlC
**Called BP error, Error, the essence of BP is it**

2) The error of the last layer:
δ j L = ∂ C ∂ zj L \delta_j^L=\frac{\partial C}{\partial z_j^L}djL=zjLC
由于:
a j L = σ ( z j L + b j L ) a_j^L=\sigma(z_j^L+b_j^L) ajL=s ( zjL+bjL)

根据链式法则:
δ j L = ∂ C ∂ aj L ∂ aj L ∂ zj L = ∂ C ∂ aj L σ ′ ( zj L ) (1) \delta_j^L=\frac{\partial C}{\partial a_j^L} \frac{\partial a_j^L}{\partial z_j^L}=\frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) \space\space\textbf{ (1)}djL=ajLCzjLajL=ajLCp(zjL)  (1)

3) sinceδ
jl = ∂ C ∂ zjl = ∑ k ∂ C ∂ zkl + 1 ∂ zkl + 1 ∂ zjl \delta_j^l= \frac{\partial C}{\partial z_j^l} \\ \qquad \qquad \qquad = \sum_k \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_j^l}djl=zjlC=kzkl+1Czjlzkl+1
(The above formula needs to be savored carefully, especially why ∑ k \sum_k is neededkSymbols, mainly to contain wwall paths affected by w
) and
δ jl + 1 = ∂ C ∂ zjl + 1 \delta_j^{l+1}=\frac{\partial C}{\partial z_j^{l+1}}djl+1=zjl+1C
则:
δ jl = ∑ k ∂ zkl + 1 ∂ zjl δ kl + 1 (2) \delta_j^l=\sum_k \frac{\partial z_k^{l+1}}{\partial z_j^l} \delta_k^ {l+1} \space\space\textbf{(2)}djl=kzjlzkl+1dkl+1  (2)
So far, the relationship between the error of the previous layer and the next layer has surfaced

4) Next, let’s look at the C vs. w C vs. w we are most concerned aboutThe gradient of C to w ∂ C ∂ wjkl \frac{\partial C}{\partial w_{jk}^l}wjklCAnd the gradient of C to bias ∂ C ∂ bjl \frac{\partial C}{\partial b_j^l}bjlC

由于
z j l = ∑ k w j k l a k l − 1 + b j l z_j^l=\sum_k w_{jk}^la_k^{l-1}+b_j^l zjl=kwjklakl1+bjl

∂ C ∂ wjkl = ∂ C ∂ zjl ∂ zjl ∂ wjkl = δ jlakl − 1 (3) \frac{\partial C}{\partial w_{jk}^l}=\frac{\partial C}{\partial z_j^l}\frac{\partial z_j^l}{\partial w_{jk}^{l}} \\ \space \\= \delta_j^la_k^{l-1} \space\space\textbf{( 3)}wjklC=zjlCwjklzjl =djlakl1  (3)

∂ C ∂ bjl = ∂ C ∂ zjl ∂ zjl ∂ bjl = δ jl ∗ 1 = δ jl (4) \frac{\partial C}{\partial b_j^l} = \frac{\partial C}{\partial z_j ^l}\frac{\partial z_j^l}{\partial b_{j}^{l}} \\ \space \\= \delta_j^l *1 = \delta_j^l \space\space\textbf{( 4)}bjlC=zjlCbjlzjl =djl1=djl  (4)

So far we have ( 1 ) ( 2 ) ( 3 ) ( 4 ) (1)(2)(3)(4)( 1 ) ( 2 ) ( 3 ) ( 4 ) The four BP formulas are proved.


6. Summarize

when we have

δ j L = ∂ C ∂ aj L σ ′ ( zj L ) (1) \delta_j^L=\frac{\partial C}{\partial a_j^L} \sigma'(z_j^L) \space\space\ textbf{(1)}djL=ajLCp(zjL)  (1)

δ jl = ∑ k ∂ zkl + 1 ∂ zjl δ kl + 1 (2) \delta_j^l=\sum_k \frac{\partial z_k^{l+1}}{\partial z_j^l} \delta_k^{l +1} \space\space\textbf{(2)}djl=kzjlzkl+1dkl+1  (2)

∂ C ∂ wjkl = δ jlakl − 1 (3) \frac{\partial C}{\partial w_{jk}^l}= \delta_j^la_k^{l-1} \space\space\textbf{(3) }wjklC=djlakl1  (3)

∂ C ∂ bjl = δ jl ∗ 1 = δ jl (4) \frac{\partial C}{\partial b_j^l} = \delta_j^l *1 = \delta_j^l \space\space\textbf{(4 )}bjlC=djl1=djl  (4)

And because of the loss function:
C = 1 2 ∥ y − a L ∥ 2 = 1 2 ∑ j ( yj − aj L ) 2 C=\frac{1}{2}\|ya^L\|^2=\frac {1}{2} \sum_j (y_j-a_j^L)^2C=21yaL2=21j(yjajL)2
所以
∂ C ∂ aj L = ( yj − aj L ) \frac{\partial C}{\partial a_j^L} = (y_j-a_j^L)ajLC=(yjajL)
We found that all the values ​​on the right side of the equal sign can be solved. The most amazing thing is that the gradient is only related to the error and the output of the previous layer. It is not necessary to calculate all the paths, and for a gradient descent, the δ of eachneuron \deltaδ only needs to be calculated once, and it can be used directly for calculation of other nodes after being saved, and the complexity is greatly reduced. So far, we have fully understood why BP is used and how BP is derived.

Guess you like

Origin blog.csdn.net/catscanner/article/details/110003102