Error back propagation algorithm (BP algorithm) derivation

Back propagation (Error Back Proragation, BP) algorithm application object: multilayer perceptron (multilayer feedforward network)with nonlinear continuous transformation function.

The basic idea of the BP algorithm is that the learning process consists of two processes: forward propagation of signals and back propagation of errors. In the forward propagation, the input samples are passed in from the input layer, processed by each hidden layer layer by layer, and then passed to the output layer. If the actual output of the output layer does not match the expected output (teacher signal), the back propagation stage of the incoming error is passed. Error backpropagation is to pass the output error back to the output layer layer by layer through the hidden layer in some form, and apportion the error to all units of each layer, so as to obtain the error signal of each layer unit, and this error signal is used to correct each unit The basis of the weight. This process of adjusting the weights of each layer in the forward propagation of the signal and the propagation of the error is carried out cyclically. The process of constant weight adjustment is the learning and training process of the network. This process is known to proceed until the error of the network output is reduced to an acceptable level, or until the preset number of learning times.

The multilayer perceptron using the BP algorithm is the most widely used neural network so far. In the application of the multilayer perceptron, the single hidden layer network shown in the following figure is the most common. Generally speaking, the single hidden layer feedforward network is called a three-layer perceptron. The so-called three-layer includes the input layer, the hidden layer and the output layer.

$x_{0}=-1$ Is hidden neurons introducing a threshold set
$y_ {0} = - 1$ for the output layer neuron threshold set
input vector is $X=(x_{1},x_{2},...,x_{i},...,x_{n})^{T}$
hidden layer output vector $Y = (y_ {1}, y_ {2}, ..., y_ {j}, ..., y_ {m}) ^ {T}$
output layer output vector $O=(o_{1},...,o_{k},...,o_{l})^{T}$
a desired output vector of $d=(d_{1},d_{2},...,d_{k},...,d_{l})^{T}$
the weight matrix between the input layer to the hidden layer $V = (V_ {1}, V_ {2}, ..., V_ {j}, ..., V_ {m})$ , wherein the column The vector $V_{j}$ is $j$ the weight vector corresponding to the first neuron in the hidden layer. The weight matrix from the
hidden layer to the output layer $W=(W_{1},W_{2},...,W_{k},...,W_{l})$ , where the column vector $W_{k}$ is $k$ the weight vector corresponding to the first neuron in the
output $j=0,1,2,..,m;\: k=1,2,..,l$
layer. $i=0,1,2,..,n;\: j=1,2,..,m$

Next, analyze the mathematical relationship between the signals
(Note: The blue font below indicates the output layer , the purple font indicates the hidden layer , and the red font indicates the input layer )

For the output layer, there are

${\color{Blue} o_{k}=f(net_{k})\: \: \: \: \: \: \: \: \: k=1,2,...,l}$ （1）
${\color{Blue} net_{k}=\sum_{j=0}^{m}w_{jk}y_{j}\: \: \: \: \: \: \: \: \: k=1,2,...,l}$ （2）

For the hidden layer, there are

${\color{Magenta} y_{j}=f(net_{j})\: \: \: \: \: \: \: \: \: j=1,2,...,m}$ （3）
${\color{Magenta} net_{j}=\sum_{i=0}^{n}v_{ij}x_{j}\: \: \: \: \: \: \: \: \: j=1,2,...,m}$ （4）

In the above two formulas, the transformation functions $f(x)$ are unipolar $Sigmoid$ functions

$f(x)=\frac{1}{1+e^{-x}}$ （5）

$f(x)$ It has the characteristics of continuous and directivity, and has

$f'(x)=f(x)[1-f(x)]$ （6）

Equation (1) ~ Equation (5) together constitute the mathematical model of the three-layer perceptron.

When the network output is not equal to the expected output, there is an output error $E$ , which is defined as follows

$E=\frac{1}{2}(d-O)^{2}=\frac{1}{2}\sum_{k=1}^{l}(d_{k}-o_{k})^{2}$ （7）

The above error definition expands to the hidden layer

${\color{Magenta} E=\frac{1}{2}\sum_{k=1}^{l}[d_{k}-f(net_{k})]^{2}=\frac{1}{2}\sum_{k=1}^{l}[d_{k}-f(\sum_{j=0}^{m}w_{jk}y_{j})]^{2}}$ （8）

Expand further to the input layer

${\color{Red} E=\frac{1}{2}\sum_{k=0}^{l}\left \{ d_{k}-f[\sum_{j=0}^{m}w_{jk}f(net_{j})] \right \}^{2}=\frac{1}{2}\sum_{k=0}^{l}\left \{ d_{k}-f[\sum_{j=0}^{m}w_{jk}f(\sum_{i=0}^{n}v_{ij}x_{i})] \right \}^{2}}$ （9）

The layers are network input error weights $w_{jk}$ , $v_{jk}$ a function, an error value may be changed to adjust the weight $E$ , the principle is to adjust the weights decreasing error, a gradient should be adjusted so the amount of errors of the drop is proportional, i.e.,

$\Delta w_{jk}=-\eta \frac{\partial E}{\partial w_{jk}}\: \: \: \: \: \: \: \: j=0,1,2,...,m;\: \: k=1,2,...,l$ （10 a）
$\ Delta v_ {ij} = - \ eta \ frac {\ partial E} {\ partial v_ {ij}} \: \: \: \: \: \: \: \: j = 0,1,2,. .., m; \: \: k = 1,2, ..., l$ （10 b）

In the formula, the negative sign represents the gradient descent, and the constant $\ and \ in (0,1)$ represents the proportional coefficient, which reflects the learning rate during training.

For the output layer, formula (10 a) can be written as

${\color{Blue} \Delta w_{jk}=-\eta \frac{\partial E}{\partial w_{jk}}=-\eta \frac{\partial E}{\partial net_{k}}\frac{\partial net_{k}}{\partial w_{jk}}}$ （11 a）

For the hidden layer, equation (10 b) can be written as

${\color{Magenta} \Delta v_{ij}=-\eta \frac{\partial E}{\partial v_{ij}}=-\eta \frac{\partial E}{\partial net_{j}}\frac{\partial net_{j}}{\partial v_{ij}}}$ （11 b）

Define an error signal for the output layer and the hidden layer, and let

$\delta _{k}^{o}=-\frac{\partial E}{\partial net_{k}}$
$\delta _{j}^{y}=-\frac{\partial E}{\partial net_{j}}$

Comprehensive formula (2) and formula (11 a), the weight adjustment of formula (10 a) can be rewritten as

$\ Delta w_ {ij} = \ eta \ delta _ {k} ^ {o} y_ {j}$ （12 a）

Comprehensive formula (3) and formula (11 b), the weight adjustment of formula (10 b) can be rewritten as

$\ Delta v_ {ij} = \ eta \ delta _ {j} ^ {y} x_ {i}$ （12 b）

Let's continue to derive how to calculate $\ delta _ {k} ^ {o}$ and $δ j y$

For the output layer, it $\ delta _ {k} ^ {o}$ can be expanded to

${\color{Blue} \delta _{k}^{o}=-\frac{\partial E}{\partial net_{k}}=\frac{\partial E\partial o_{k}}{\partial o_{k}\partial net_{k}}=-\frac{\partial E}{\partial o_{k}}f'(net_{k})}$ （13 a）

For hidden layers, it $δ j y$ can be expanded to

${\color{Magenta} \delta _{j}^{y}=-\frac{\partial E}{\partial net_{j}}=\frac{\partial E\partial y_{j}}{\partial y_{j}\partial net_{j}}=-\frac{\partial E}{\partial y_{j}}f'(net_{j})}$ （13 b）

The partial derivative of the network error to the output of each layer in equation (13) is calculated below.

For the output layer, using equation (7), we can get

${\color{Blue} \frac{\partial E}{\partial o_{k}}=-(d_{k}-o_{k})}$ （14 a）

For the hidden layer, using equation (8), we can get

${\color{Magenta} \frac{\partial E}{\partial y_{i}}=-\sum_{k=1}^{l}(d_{k}-o_{k})f'(net_{k})w_{jk}}$ （14 b）

Put the above result into equation (13), and apply equation (6), we get

$\ delta _ {k} ^ {o} = (d_ {k} -o_ {k}) o_ {k} (1-o_ {k})$ （15 a）
$\delta _{j}^{y}=[\sum_{k=1}^{l}(d_{k}-o_{k})f'(net_{k})w_{jk}]f'(net_{j})=(\sum_{k=1}^{l}\delta _{k}^{o}w_{jk})y_{j}(1-y_{j})$ （15 b）

So far the derivation of the two error signals has been completed, and equation (15) is returned to equation (12), and the calculation formula for adjusting the weight of the three-layer perceptron is obtained as

$\ left \ {\ begin {matrix} \ Delta w_ {jk} = \ eta \ delta _ {k} ^ {o} y_ {j} = \ eta (d_ {k} -o_ {k}) o_ {k (1-o_ {k}) y_ {j} \\ \ Delta v_ {ij} = \ eta \ delta _ {j} ^ {y} x_ {i} = \ eta (\ sum_ {k = 1} ^ l \ \ delta _ {k} ^ {o} w_ {jk}) y_ {j} (1-y_ {j}) x_ {i \ \ end {matrix \ \ right.$

Error back propagation algorithm (BP algorithm) derivation

Guess you like