Artificial neural network back-propagation algorithm is derived prove

To see if this did little to search the Internet, but the results are unsatisfactory, and finally found a very very detailed proof process, finishing as latex excerpt follows.

(Original: https://blog.csdn.net/weixin_41718085/article/details/79381863 )

\ Article This article was DocumentClass {} 
\ {xeCJK usepackage} 
\ {amsmath usepackage} 
\ {setCJKmainfont Noto Serif SC CJK} 

\ title {artificial neural network back-propagation algorithm is applied \\ chain rule in Artificial Intelligence} 

\} the begin {Document 
\ maketitle 

\ sectionTop {background} 

\ subsection {} artificial neural 

artificial neuron is a computation, it will re-enter respectively add the weights of different combinations for the output of a nonlinear operation to obtain the operation and once a shift, a mathematical described in language that is 

\ [ 
X ^ {(. 1 + J)} = \ Sigma (B + \ sum_i W_i x_i ^ {(J)}) 
\] 

where \ (J \) refers to the number of layers of neurons which present Examples \ (x_i ^ {(j) } \) derived from the first \ (J \) 
of \ (I \) layer neurons input, \ (W_i \) 
is the current neuron all provide input to neurons the section \ (I \) th weighting, \ (B \) 
is an offset value, \ (F \) 
is called the activation function, typically a non-linear function, some classical activation function has the Sigmoid, 
RELU, and arc tangent function. 

\ subsection {} Artificial Neural Network 

Artificial Neural Network (Artificial Neural Network, the abbreviated
ANN) consists of several layers of artificial neurons in the model considered herein, each layer of neurons receiving input layer neurons, calculation, and outputs the result to the next layer. In fact, if the parameters of the whole network, each neuron is referred to as a matrix 
\ (A \), the entire neural network may be considered a function \ (f (A, x) \), where \ (X \) 
is the output of the entire network. 

In the application process, we often have a set of input \ (X \) and a corresponding desired output 
\ (y \). The process of training the neural network is to find the right \ (A_0 \), so that 
\ (f (A_0, x) = y \). To solve this problem, we introduce the mathematical loss function \ (J (y, y ^ *) \) 
to indicate \ degree of training (f (A, x) \ ), where \ (y ^ * \) 
is a neural network Output. A typical loss is a function of the square of the Euclidean distance, i.e. 
\ (J_0 (y, y ^ *) = (y - y ^ *) ^ T (y - y ^ *) \). Thus, it is converted to the problem, to find 
\ (A_0 \) such that \ (J (y, f ( A_0, x)) \) can be taken to a minimum. 

Since for a given training data \ (Y \) is a constant value, for intuitive, might remember 
\ (J (y, f ( A, x)) = h (A, x) \), so that we need to solve the question is: 

find the parameters \ (a \) such that the function \ (h (a, x) \) can be taken to a minimum. 

This is a class of problems to solve, we often learn in calculus, just requires a 
\ ({\ partial h (A , x)} / {\ partial A} \) and make it to zero 
can be. But in fact, to \ (h \) 
derivation function almost unlikely, we used a method in each iteration, obtains \ (H \) 
at a point \ (A_0 \) 
derivative value at, and in accordance with the negative and the absolute value of derivative values appropriately adjustment \ (A \)
And into the next iteration. This method is called gradient descent, he has a significant drawback is that he can only obtain the local minimum (which is obvious), but considering the usual neural network of local minimum has been good enough, we can barely accept the existence of this shortcoming. However, even with the gradient descent method of thinking, we need a back-propagation algorithm can be obtained (Back derivative values, algorithm, will be described herein 
Propagation Algorithm, often referred BP algorithm) is one example. 

\ section {} derived 

\ subsection {} symbols are as defined 

for convenience of explanation of the derivation, where re-defined symbol, the next derivation symbology will not be used in the context of use. 

Sign convention: 
\ the begin {itemize} 
\ Item section \ (l - 1 \) of \ (J \) Layer \ (K \) neurons propagated to the first \ (L \) layer neurons value weights referred to as \ (W_ {jk} ^ { (l)} \), the ownership of each layer weight referred to as a matrix \ (W ^ {(l) } \), all parameters generically referred to as \ (W is \); 
(p \ (j l \) layer \) neurons referred to as an offset value \ (b_j ^ {(l) } \ item of \ \), each referred to as the offset vector all \ (b ^ {(l) } \), all offsets generically referred to as \ (B \); 
\ section \ (j \) item of \ (L \) layer of neurons referred to as input values \ (z_j ^ {(l) } \), each of all input values referred to as a vector \ (^ {Z (L)} \); 
\ Item section \ (L \) layer neuron number referred to as \ (s_l \); - total neural network \ (n-\) layer ; 
\ Item section \ (L \) layer \ (J \) th neuron output value referred to as \ (x_j ^ {(l) } \), each output value of all the vector referred to as \ (x ^ {(l) } \);
\ item of \ (L \) activation function layer is referred to as \ (\ sigma_l \); (Generally each neuron has the same activation function)
\ item loss function is referred to as \ (J (W, b; y, y_0) \), where \ (y_0 \) means the true value, \ (Y \) refers to the output value of the network; (in one iteration, \ ( Y \) and \ (y_0 \) is constant, these two parameters later omitted) 
\ Item \ ({\ partial J (W is, B)} / {\ partial z_j ^ {(L)}} \) referred to as \ (\ Delta_j ^ {(l )} \), a layer of all \ (\ of Delta \) referred to as \ (\ {^ of Delta (L)} \); 
\ training set of Item \ (T = \ {(x_0, y_0), ( x_1, y_1), ..., (x_m, y_m) \} \), and \ (| T | = m \); 
\ Item is applied between the matrix or vector \ ( * \) refers to bit multiplication operation. 
\ end {itemize} 
definition, clearly: 

$$ 
z_j ^ {(L +. 1)} = \ left (\ sum_ {K = 0} ^ {S _ {(L-. 1)}} W_ {JK} ^ { (L) x_k} ^ {(L -. 1)} \ right) + b_j ^ {(L)} \ eqno (A) 
$$ 

$$ 
x_j ^ {(L)} = \ sigma_l (z_j ^ {(L) }) \ eqno (B) 
$$ 

\ subsection {} proved object 

is given an algorithm to calculate any neural network \ (\ partial J / \ partial W_ {jk} ^ {(l)} \) and \ (\ partial J / \ partial b_ {j } ^ {(l)} \). For limited space, this article illustrates the calculation only on the former, the latter process is omitted prove consistent proof part.

\ subsection {proof} 

known by the chain rule: 

$$ 
{\ partial J (W is, B) \ over \ partial JK W_} {^ {(L)}} = {\ partial J (W is, B) \ over \ partial x_j ^ {(l + 1)}} {\ partial x_j ^ {(l + 1)} \ over \ partial z_j ^ {(l + 1)}} {\ partial z_j ^ {(l + 1)} \ over \ partial JK W_} {^ {(L)}} \ eqno (0) 
$$ 

deviation split into three parts, each of the following three parts solved. 

\ paragraph {2.3.1.} The first portion 

has a first portion for 

$$ 
\ the aligned the begin {} 
{\ partial J (W is, B) \ over \ partial x_j ^ {(. 1 + L)}} 
$$ 
& = \ sum_ {I = 1} ^ {s_ {l + 2}} {\ partial J \ over \ partial z_i ^ {(l + 2)}} {\ partial z_i ^ {(l + 2)} \ over \ partial x_j ^ {( l + 1)}} \\
& \ overset {(a)} = \ sum_ {i = 1} ^ {s_ {l + 2}} {\ partial J \ over \ partial z_i ^ {(l + 2)}} {\ partial \ over \ partial x_j ^ {(l + 1) }} \ left (b_j ^ {(l + 1)} + \ sum_ {k = 0} ^ {s_ {l + 1}} {W_ {ji} ^ {(l + 1 )} {x_i ^ (. 1 + L)}} \ right) \\ 
& = \ sum_. 1 {I} = S_ ^ {2}} + {L \ delta_i ^ {(L + 2)} W_} {^ JI {(L)} 
\ the aligned End {} \ eqno (2) 

\ {paragraph 2.3.2.} The second portion 

has a second portion for 

\ [ 
\ the aligned the begin {} 
  \ {^ partial x_j (. 1 + L)} \ over \ partial z_j ^ {(L +. 1)} 
  & \ overset {(B)} = {\ partial \ over \ partial z_j ^ {(L +. 1)}} \ sigma_ {L +. 1} (z_j ^ {( . 1 + L)}) \\ 
  & = \ sigma_. 1} + {L '(z_j ^ {(. 1 + L)}) 
\ the aligned End {} 
\] 

for activation function \ (\ sigma_l \), if present function \ (f_L \) such that 
\ [ 
f_L (\ sigma_l (X)) = \ sigma_l '(X) 
\] 

there 

\ [ 
\ the aligned the begin {}
  {\ partial x_j ^ {(L +. 1)} \ over \ partial z_j ^ {(L +. 1)}} = F ^ {(L +. 1)} (x_j ^ {(L +. 1)} 
\ End {the aligned } \ {2} Tag 
\] 

2.3.3. part III \ paragraph {} 

\ [  
\ the aligned the begin {} 
  {\ partial z_j ^ {(L + 1)} \ over \ partial W_ {jk} ^ {(l)}}
  & \ overset {(A)} = {\ partial \ over \ partial JK W_} {^ {(L)}} \ left (b_j ^ {(L)} + \ sum_ {K ^ = {0}. 1-L of N_ {} {} W_ JK} ^ {(L) x_k} ^ {(L -. 1)} \ right) \\ 
  & x_j {^ = (L)} 
\ the aligned End {} \ Tag. 3} { 
\] 

\ {paragraph 2.3.4.} integrated 

integrated \ ((0) \) \ ((1) \) \ ((2 ) \) \ ((3) \), we have 
\ [ 
\ the begin {the aligned} 
  \ left (\ sum_ {I =. 1} ^ {S_ {L + 2}} \ delta_i ^ {(L + 2)} W_ {JI} ^ {(L)} \ right) F ^ {(L +. 1)} (x_j ^ {(L +. 1)}) x_j ^ {(L)} 
\ End {the aligned} \ Tag {. 4} 
\] 

and because 
\ [
\ Delta_j ^ {(l + 1 )} = \ frac {\ partial J (W, b)} {\ partial z_j ^ {(l + 1)}} = {\ partial J (W, b) \ over \ partial x_j ^ {(l + 1) }} {\ partial x_j ^ {(l + 1)} \ over \ partial z_j ^ {(l + 1)}} \ overset {(1), (2)} {=} \ left (\ sum_ {i = 1} ^ {s_ {l + 2}} \ delta_i ^ {l + 2} W_ {ji} ^ {(l)} \ right) f ^ {(l + 1)} ( x_j ^ {(l + 1)
\ [
\]

Wherein \ (0 <J <S_ {L}. 1 + \), \ (0 <L <n-\). 

\ {Paragraph 2.3.5.} Rewritten in vector form 

written in vector form with a 

\ [ 
\ FRAC {\ partial J ( W is, B)} {\ partial W is ^ {(L)}} = \ of Delta ^ {(L +. 1)} \ left (X ^ {(L)} \ right) ^ T \ Tag {. 4 *} 
\] 

\ [ 
\ of Delta ^ {(L)} = \ left (W is ^ {(L)} \ right) ^ T \ of Delta ^ {(L +. 1)} * f_L (X ^ {(L +. 1)}) \ {*}. 5 Tag 
\] 

\ {paragraph 2.3.6.} boundary conditions 

hereinbefore we obtained the two recursion formulas. Obviously, for \ ((5 ^ *) \) type, since 
\ (l = n \) when \ (l + 1 \) 
bounds, it can not be applied in this case. It should be considered for the additional boundary conditions. 

\ the aligned the begin {} 
  \ Delta_j ^ {(n-)} 
  & = \ FRAC {\ partial J (W is, B; Y, y_0)} {\ partial z_j ^ {(n-)}} \\ 
  & \ overset {Y = x ^ {(n)}} = \ frac {\ partial J (W, b; x ^ {(n)}, y_0)} {\ partial x_i ^ {(n)}} \ frac {\ partial x_i ^ { (n)}} {\ partial z_i ^ {(n)}} \\ 
  & \ overset {(2)} = \ FRAC {\ partial J (W is, B; X ^ {(n-)}, y_0)} {\ partial x_i ^ { (n-)}} ^ {F (n-)} (x_j ^ {(n-)}) 
\ the aligned End {} \ Tag. 6} { 
\] 

DETAILED form after calculating the loss function related, can not be given in general, but taking into account the following operations are obvious, harmless omitted here. 

\ section {} conclusion 

through a boundary condition and two recursion formulas: 
\ [ 
\ Cases the begin {} 
  \ Delta_j ^ {(n-)} = \ FRAC {\ partial {J} \ partial x_i ^ {(n-)}} (W is, B; X ^ {(n-)}, y_0) F ^ {(n-)} (x_j ^ {(n-)}) \\ 
  \ FRAC {\ partial {J} \ {^ W is partial (L)} } (W is, B) = \ {^ of Delta (. 1 + L)} \ left (X ^ {(L)} \ right) \\ ^ T 
  \ {^ of Delta (L)} = \ left ({^ W is ( L)} \ right) ^ T \ of Delta ^ {(L +. 1)} * f_L (X ^ {(L +. 1)}) 
\ End {Cases} \ eqno (Conclusion) 
\] 
to complete a propagation neural derivation process network. 

\ end {document}

Guess you like

Origin www.cnblogs.com/rabbull/p/10995697.html