How to understand the back propagation algorithm in neural network?

Author: Zhihu User
Link: https://www.zhihu.com/question/24827633/answer/91489990
Source: Zhihu
The copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Generally, the chain rule is used to explain
neural networks such as the following
  • forward propagation
For the node h_1, h_1the net input net_{h_1}is as follows: then do a sigmoid function to get the output of the node: Similarly, we can get the output of the node , , , , .
net_{h_1}=w_1\times i_1+w_2\times i_2+b_1\times 1
net_{h_1}h_1
out_{h_1}=\frac{1}{1+e^{-net_{h_1}}}
h_2o_1o_2out_{h_2}out_{o_1}out_{o_2}

  • error
After the result is obtained, the output error of the entire neural network can be expressed as: where is just calculated through forward propagation ; is the target value of node . used to measure the error of the two. This can also be considered as a cost function, but the regularization term ( ) to prevent overfit is omitted here.
E_{total}=\sum\frac{1}{2}(target-output)^2
outputout_{o_1}out_{o_2}targeto_1o_2E_{total}
E_{total}\sum{w_i^2}

E_{total}=E{o_1}+E{o_2}=\frac{1}{2}(target_{o_1}-out_{o_1})^2+\frac{1}{2}(target_{o_2} -out_{o_2})^2

  • backward propagation
for the output layerw_5
Adjustment by gradient descent w_5requires \frac{\partial {E_{total}}}{\partial {w_5}}the chain rule:
\frac{\partial {E_{total}}}{\partial {w_5}}=\frac{\partial {E_{total}}}{\partial {out_{o_1}}}\frac{\partial {out_{o_1}}}{\partial {net_{o_1}}}\frac{\partial {net_{o_1}}}{\partial {w_5}}, as shown in the
following figure: The gradient is obtained by multiplying the above 3 , and then this gradient can be used for training: many textbooks, such as Stanford courses, will record the intermediate results as , Indicates how much responsibility this node is responsible for the final error. . So there is .
\frac{\partial {E_{total}}}{\partial {out_{o_1}}}=\frac{\partial}{\partial {out_{o_1}}}(\frac{1}{2}(target_{o_1}-out_{o_1})^2+\frac{1}{2}(target_{o_2}-out_{o_2})^2)=-(target_{o_1}-out_{o_1})
\frac{\partial {out_{o_1}}}{\partial {net_{o_1}}}=\frac{\partial }{\partial {net_{o_1}}}\frac{1}{1+e^{-net_{o_1}}}=out_{o_1}(1-out_{o_1})
\frac{\partial {net_{o_1}}}{\partial {w_5}}=\frac{\partial}{\partial {w_5}}(w_5\times out_{h_1}+w_6\times out_{h_2}+b_2\times 1)=out_{h_1}
\frac{\partial {E_{total}}}{\partial {w_5}}
w_5 ^ + = w_5- \ eta \ frac {\ partial {E_ {total}}} {\ partial {w_5}}
\frac{\partial {E_{total}}}{\partial {net_{o_1}}}=\frac{\partial {E_{total}}}{\partial {out_{o_1}}}\frac{\partial {out_{o_1}}}{\partial {net_{o_1}}}\ delta_ {o_1}\frac{\partial {E_{total}}}{\partial {w_5}}=\delta_{o_1}out_{h_1}



for the hidden layerw_1
Through gradient descent adjustment w_1, it needs to be demanded \frac{\partial {E_{total}}}{\partial {w_1}}by the chain rule:
\frac{\partial {E_{total}}}{\partial {w_1}}=\frac{\partial {E_{total}}}{\partial {out_{h_1}}}\frac{\partial {out_{h_1}}}{\partial {net_{h_1}}}\frac{\partial {net_{h_1}}}{\partial {w_1}}, as shown in the
following figure: the
parameter w_1affects net_{h_1}, and then affects out_{h_1}, and then affects E_{o_1}, E_{o_2}.
Solve for each part:
\frac{\partial {E_{total}}}{\partial {out_{h_1}}}=\frac{\partial {E_{o_1}}}{\partial {out_{h_1}}}+\frac{\partial {E_{o_2}}}{\partial {out_{h_1}}},
where previously computed \frac{\partial {E_{o_1}}}{\partial {out_{h_1}}}=\frac{\partial {E_{o_1}}}{\partial {net_{o_1}}}\times \frac{\partial {net_{o_1}}}{\partial {out_{h_1}}}=\delta_{o_1}\times \frac{\partial {net_{o_1}}}{\partial {out_{h_1}}}=\delta_{o_1}\times \frac{\partial}{\partial {out_{h_1}}}(w_5\times out_{h_1}+w_6\times out_{h_2}+b_2\times 1)=\delta_{o_1}w_5here . \ delta_ {o_1}
\frac{\partial {E_{o_2}}}{\partial {out_{h_1}}}The calculation is also similar, so get
\frac{\partial {E_{total}}}{\partial {out_{h_1}}}=\delta_{o_1}w_5+\delta_{o_2}w_7.
\frac{\partial {E_{total}}}{\partial {w_1}}The other two items in the chain are as follows:
\frac{\partial {out_{h_1}}}{\partial {net_{h_1}}}=out_{h_1}(1-out_{h_1}), after multiplying to get the gradient, you can iterate: . In the previous formula can also be defined, , so the entire gradient can be written as
\frac{\partial {net_{h_1}}}{\partial {w_1}}=\frac{\partial }{\partial {w_1}}(w_1\times i_1+w_2\times i_2+b_1\times 1)=i_1

\frac{\partial {E_{total}}}{\partial {w_1}}=\frac{\partial {E_{total}}}{\partial {out_{h_1}}}\frac{\partial {out_{h_1}}}{\partial {net_{h_1}}}\frac{\partial {net_{h_1}}}{\partial {w_1}}=(\delta_{o_1}w_5+\delta_{o_2}w_7)\times out_{h_1}(1-out_{h_1}) \times i_1
w_1
w_1 ^ + = w_1- \ eta \ frac {\ partial {E_ {total}}} {\ partial {w_1}}
\delta_{h_1}\delta_{h_1}=\frac{\partial {E_{total}}}{\partial {out_{h_1}}}\frac{\partial {out_{h_1}}}{\partial {net_{h_1}}}=(\delta_{o_1}w_5+\delta_{o_2}w_7)\times out_{h_1}(1-out_{h_1}) =(\sum_o \delta_ow_{ho})\times out_{h_1}(1-out_{h_1})\frac{\partial {E_{total}}}{\partial {w_1}}=\delta_{h_1}\times i_1

========================
The above \deltais the origin of the third step calculation in the tutorial Unsupervised Feature Learning and Deep Learning Tutorial . .

 

The so-called backward communication is actually "If there is a deviation in the propaganda and communication in the future, you have to be responsible! ”, the amount that each node is responsible for is \deltarepresented by, then, the amount that the hidden node is responsible for is transmitted forward layer by layer by the amount that the output node is responsible for.

参考:
【1】A Step by Step Backpropagation Example
【2】Unsupervised Feature Learning and Deep Learning Tutorial

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325113936&siteId=291194637