Forward calculation and backpropagation principle of fully connected neural network

Derivation of the weight update principle of the fully connected neural network (forward calculation + backpropagation)

In this question, we first introduced the basic mathematical principle of the weight update of the fully connected neural network - the gradient descent algorithm. Afterwards, this question will introduce the weight update principle of a single-layer perceptron (single-output fully connected neural network), so as to further explain the principle of weight update of a fully connected neural network.

1 Gradient descent algorithm

The gradient descent algorithm is an algorithm for finding the minimum value of a function. The gradient descent algorithm will be used to solve the minimum value of the objective function in the weight update later (the objective function will be explained later).

1.1 Gradient

The first thing to understand is: what is a gradient?
The gradient is a vector along which the function increases the fastest. The gradient contains the partial derivatives of the function with respect to the respective variables. For example for functions:

  • f ( x , y ) = x 2 + y 2 f (x, y) = x^2 + y^2 f(x,y)=x2+y2

Given the exponential function:
gradient = ( σ x σ f ( x , y ) , σ y σ f ( x , y ) ) = ( 2 x , 2 y ) gradient = ( \frac{\sigma x}{\sigma f(x, y)},\frac{\sigma y}{\sigma f(x, y)}) = (2x, 2y)gradient=(σ f ( x , y )x _σ f ( x , y )y _)=( 2 x ,2y ) _

1.2 Gradient Descent Algorithm

Since the direction of the gradient is the direction in which the function increases the fastest, the opposite direction of the gradient must be the direction in which the function value decreases the fastest. Therefore, we can make the independent variable change in the opposite direction of the gradient, so that the function will gradually approach the minimum value (in fact, it is not the opposite direction of the gradient, as long as this direction is the direction in which the function value decreases).

Then, for example, for the function to be optimized f ( w , b ) = w 2 ∗ x + bf (w, b) = w^2*x + bf(w,b)=w2x+b
We can get the basic implementation steps of the gradient descent algorithm:

  1. Initialize a w 0 w_0w0and b 0 b_0b0the value of
  2. root w 1 = w 0 − α 1 ∗ 2 wx w_1 = w_0 - α_1*2wxw1=w0a12 w xb 1 = b 0 − α 2 ∗ 1 b_1 = b_0 - α_2*1b1=b0a21 (α is the learning rate, which can be 0.001) updatewww andbbvalue of b
  3. Keep updating until convergence

2. Weight update principle of single output perceptron

First define the naming rules of variables, as shown in the figure:
insert image description here
www represents the weight, the subscript represents the numbers of the two connected nodes in their respective layers, and the superscript represents the number of connected upper layers.
xxxx indicates the input value, the subscript indicates the number of the input in the respective layer, and the superscript indicates the number of layers in the layer.
OOO represents the value of the input value after the activation function, andxxThe x -like subscript indicates the number of the input in the respective layer, and the superscript indicates the number of layers in the layer.

The structure of a single-output perceptron is shown in the figure below:
insert image description here
The mathematical model of a single-layer perceptron is:

x 1 = x 0 ∗ w 0 + x 1 ∗ w 1 + x 2 ∗ w 2 + . . . + x n ∗ w n = ∑ x j ∗ w j x_1 = x_0*w_0 + x_1*w_1 + x_2*w_2 + ... + x_n*w_n = ∑x_j*w_j x1=x0w0+x1w1+x2w2+...+xnwn=xjwj
O 1 = s i g m o i d ( x 1 ) O_1 = sigmoid(x_1) O1=sigmoid(x1)
E = 1 2 ∗ ( O 1 − t 1 ) 2 E = \frac{1}{2}*(O_1 - t_1)^2 E=21(O1t1)2

Among them, sigmoid is the activation function (a derivable activation function must be selected, otherwise the gradient cannot be obtained later); E is the error function, which is the object of optimization; t1 is the real value (label).
For a single-layer perceptron, our goal is: input a sample (including features from x0 to xn), and the model can output an accurate (or close to the real value) O1 value. So in a single-layer perceptron, we need weights w 0 , w 1 , w 2 , . . . , wn w_0, w_1, w_2, ..., w_nw0,w1,w2,...,wnto update. Until the error function converges to the minimum value, the update is completed, w 0 , w 1 , w 2 , . . . , wn w_0, w_1, w_2, ..., w_nw0,w1,w2,...,wnThe value of is determined and used to predict new data.
So let's do the math and understand w 0 , w 1 , w 2 , . . . , wn w_0, w_1, w_2, ..., w_nw0,w1,w2,...,wnThe update process (the mathematical derivation part is temporarily skipped to see the results):
wj w_jwjagainst EEThe partial derivative formula of E
is: σ wj σ E = ( O 0 − t ) O 0 ( 1 − O 0 ) xj \frac{\sigma w_j}{\sigma E}=(O_0-t)O_0(1-O_0)x_jin Es wj=(O0t)O0(1O0)xj
Therefore, according to the gradient descent formula, wj w_jwj的手机最好的:
wj = w 0 − α ∗ σ wj σ E = w 0 − α ∗ ( O 0 − t ) O 0 ( 1 − O 0 ) xj w_j = w_0 - α*\frac{\sigma w_j}{\sigma E}=w_0 - α*(O_0-t)O_0(1-O_0)x_jwj=w0ain Es wj=w0a(O0t)O0(1O0)xj
Among them w 0 w_0w0Represents wjk w_{jk}wjkThe initialization value.
That is, for each input sample XXX (a sample includes featuresw 0 , w 1 , w 2 , . . . , wn w_0, w_1, w_2, ..., w_nw0,w1,w2,...,wn), w 0 , w 1 , w 2 , . . . , w n w_0, w_1, w_2, ..., w_n w0,w1,w2,...,wnJust update once. Therefore, if it is necessary to continuously update w 0 , w 1 , w 2 , . . . , wn w_0, w_1, w_2, ..., w_nw0,w1,w2,...,wn, you need enough samples for training.

3. Weight update principle of multi-output perceptron

The schematic diagram of the multi-output perceptron is as follows:
insert image description here
Similar to the single-output perceptron, the mathematical model of the multi-output perceptron is:

x 0 1 = x 0 0 ∗ w 00 1 + x 1 0 ∗ w 01 1 + x 2 0 ∗ w 02 1 + . . . + x n 0 ∗ w 0 n 1 = ∑ x j 0 ∗ w j k 1 x_0^1 = x_0^0*w_{00}^1 + x_1^0*w_{01}^1 + x_2^0*w_{02}^1 + ... + x_n^0*w_{0n}^1 = ∑x_j^0*w_{jk}^1 x01=x00w001+x10w011+x20w021+...+xn0w0n1=xj0wjk1
O 0 1 = s i g m o i d ( x 0 1 ) O_0^1 = sigmoid(x_0^1) O01=sigmoid(x01)
E = ∑ k = 1 m 1 2 ∗ ( O k 1 − t k ) 2 E = \displaystyle\sum_{k=1}^{m}\frac{1}{2}*(O_k^1 - t_k)^2 E=k=1m21(Ok1tk)2

For multi-output perceptrons, the objective function is still EEE , the objective functionEEThe independent variable of E is w 00 1 , w 01 1 , . . . , w 0 n 1 w_{00}^1,w_{01}^1,...,w_{0n}^1w001,w011,...,w0n1, so for any weight wjk 0 w_{jk}^0wjk0, the gradient derivation process is similar.

For any wjk 0 w_{jk}^0wjk0, after inputting a sample and performing a forward calculation, wjk 0 w_{jk}^0wjk0A weight update will be performed:
wjk 0 = w 0 − α ∗ σ E σ wjk = w 0 − α ∗ ( O k 1 − t ) O k 1 ( 1 − O k 1 ) xj 0 w_{jk}^0 = w_0 - α*\frac{\sigma E}{\sigma w_{jk}}=w_0 - α*(O_k ^1-t)O_k^1(1-O_k^1)x_j^0wjk0=w0as wjkin E=w0a(Ok1t)Ok1(1Ok1)xj0
Among them w 0 w_0w0means wjk 0 w_{jk}^0wjk0The initialization value.
Therefore, it is similar to the single-output perceptron: each time a multi-output perceptron inputs a sample, after a forward calculation and backpropagation, we can calculate all wjk 0 w_{jk}^ 0wjk0Perform a weight update. When enough samples are input and the weights are updated enough times, the predictive performance of the multi-output perceptron model will be strong enough (input a sample, and the multi-output perceptron can output a predicted value with a sufficiently small error from the true value).
Next, we can introduce the fully connected neural network model and its weight update principle by adding a hidden layer.

4. Weight update principle of fully connected neural network

The structure of the fully connected neural network is shown in the figure below:
insert image description here
between the input layer and the output layer, there are several hidden layers (here, for the convenience of analysis, only one hidden layer is drawn). Therefore, the input layer to the output layer are respectively I layer, J layer, and K layer.
For details, please refer to the forward calculation process of the multi-output perceptron. This section mainly derives the update process of the gradient ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
: —————————————————————————————————————————————————————————————————————————————————————————————— Therefore, according to the above derivation: For the output layer node k∈K: σ E σ wjk K = ( O k K − tk ) O k K ( 1 − O k K ) O j J = δ k K ∗ O j J \frac{\sigma E}{\sigma w_{jk}^K}= (O_k ^K-t_k)O_k^K(1-O_k^K)O_j^J = \delta_{k}^K*O_j^J

insert image description here



s wjkKin E=(OkKtk)OkK(1OkK)OjJ=dkKOjJ
Wherein: δ k K = ( O k K − tk ) O k K ( 1 − O k K ) \delta_{k}^K=(O_k^K-t_k)O_k^K(1-O_k^K)dkK=(OkKtk)OkK(1OkK)

对于隐含层节点j∈J:
σ E σ w i j J = O j J ( 1 − O j J ) O i I ∑ k ∈ K ( O k K − t k ) O k K ( 1 − O k K ) = δ j J ∗ O i I \frac{\sigma E}{\sigma w_{ij}^J}= O_j^J(1-O_j^J)O_i^I\sum_{k∈K}(O_k^K-t_k)O_k^K(1-O_k^K)\quad= \delta_{j}^J*O_i^I s wijJin E=OjJ(1OjJ)OiIkK(OkKtk)OkK(1OkK)=djJOiI
Among them: δ j J = O j J ( 1 − O j J ) ∑ k ∈ K ( O k K − tk ) O k K ( 1 − O k K ) \delta_{j}^J=O_j^J(1-O_j^J)\sum_{k∈K}(O_k^K-t_k)O_k^K(1-O_k^K)\quaddjJ=OjJ(1OjJ)kK(OkKtk)OkK(1OkK)

For output layer node k∈K, ( O k K − tk ) O k K ( 1 − O k K ) (O_k^K-t_k)O_k^K(1-O_k^K)(OkKtk)OkK(1OkK) is a variable that is only related to the K layer and the nodes after the K layer, so it is defined asδ k K \delta_{k}^KdkK. Similarly, for the hidden layer node j∈J, O j J ( 1 − O j J ) ∑ k ∈ K ( O k K − tk ) O k K ( 1 − O k K ) O_j^J(1-O_j^J)\sum_{k∈K}(O_k^K-t_k)O_k^K(1-O_k^K)\quadOjJ(1OjJ)kK(OkKtk)OkK(1OkK)Is a variable only related to J layer and nodes after J layer, so it is defined as δ j J \delta_{j}^JdjJ

At the same time, we can find that when solving, only first solve δ k K \delta_{k}^KdkK, in order to solve δ j J \delta_{j}^JdjJ. Similarly, when the number of layers increases, we can only solve the gradient update operator from the back to the front. Therefore, this process of updating weights is called backpropagation .

In actual programming, the following definitions can be made to facilitate the solution of the gradient (taking the kth layer as an example):
δ K = [ δ 0 K δ 1 K δ 2 K . . . δ m K ] \delta^K=\begin{bmatrix} \delta^K_0 \\ \delta^K_1 \\ \delta^K_2 \\ ... \\ \delta^K_m\end{bmat rix}dK= d0Kd1Kd2K...dmK
O J = [ O j 0 O j 1 O j 2 . . . O j n ] O^J=\begin{bmatrix} O_j^0 & O_j^1 & O_j^2 & ... &O_j^n \end{bmatrix} OJ=[Oj0Oj1Oj2...Ojn]
Indicates,σ E σ wjk K \frac{\sigma E}{\sigma w_{jk}^K}s wjkKin EAccording to 下式求解:
σ E σ w K = δ KOJ \frac{\sigma E}{\sigma w^K}=\delta^KO^Js wKin E=dK. OJ

Guess you like

Origin blog.csdn.net/weixin_41670608/article/details/115059295