Neural network basic study notes (four) error back propagation method

table of Contents

Error back propagation

Foreword:

5.1 Calculation

5.1.1 Solve with calculation graph

5.2 Chain Rule

5.2.1 Backpropagation of computational graph

5.2.2 What is the chain rule

5.3 Backpropagation

5.3. 1  back propagation adder node

5.3.2 Backpropagation of multiplication nodes

5.3.3 Apple's example

5.4 Implementation of Simple Layer

5.4.1 Implementation of the multiplication layer

5.4.2 achieve adder layer

5.5 Implementation of the activation function layer

5.5.2 Sigmoid layer

5.6.2 Batch version of Affine layer

5.7 The realization of error back propagation method

5.7.1 Overall picture of neural network learning

5.7.2 The realization of neural network corresponding to error back propagation method

5.7.3 Gradient confirmation of error back propagation method

5.7.4 Learning using error back propagation method

5.8 Summary


Error backpropagation

Foreword:

Although numerical differentiation is simple and easy to implement, its disadvantage is that it takes time to calculate. In this chapter, we will learn a method that can efficiently calculate the gradient of weight parameters-error back propagation method

Correctly understand the error back propagation method: one is based on mathematical formula; the other is based on computational graph (computational graph)

5.1 Calculation

5.1.1 Solve with calculation graph

Nodes are represented by ○, and ○ is the content of calculation.

It can also be expressed as:

5.2 Chain Rule

5.2.1 Backpropagation of computational graph

5.2.2 What is the chain rule

Take compound functions in high numbers as inspiration

z = (x + y) ^2

rule:

If a certain function is represented by a composite function, the derivative of the composite function can be represented by the product of the derivative of each function constituting the composite function.

5.2.3 Chain rule and calculation diagram

We try to express the calculation of the chain rule of formula (5.4) with a calculation diagram

The above formula is easy to get:

5.3 Backpropagation

The previous section introduced that the back propagation of computational graphs is based on the chain rule. Below are examples of operations such as + *

5.3.1 Backpropagation of the addition node

z = derivative of x + y

Then get:

5.3.2 Backpropagation of multiplication nodes

"Flip value"---it's too intuitive, it is the situation when xy seeks partial derivative

5.3.3 Apple's example

Exercise:

answer:

5.4 Implementation of Simple Layer

The multiplication node is called the "multiplication layer" (MulLayer), and the addition node is called the "addition layer"

5.4.1 Implementation of the multiplication layer

Give a chestnut:

Implementation code:

Note: each layer should be separated

The derivative of each variable can be obtained by backward()

5.4.2 Implementation of the addition layer

The forward() of the addition layer receives two parameters x and y, adds them and outputs them.

backward() passes the derivative (dout) from upstream to downstream intact

Implementation code:

5.5 Implementation of the activation function layer

We apply the idea of ​​computational graphs to neural networks.

We implement the layers that make up the neural network as a class. Let's first implement the ReLU layer and Sigmoid layer of the activation function

5.5.1 ReLU layer

In the implementation of neural network layers, it is generally assumed that the parameters of forward() and backward() are NumPy arrays.

Implementation code:

Implementation code:

Explanation about (x<=0)

If the input value during forward propagation is less than or equal to 0 , the value of backward propagation is 0 . Therefore, the mask saved during forward propagation will be used in backpropagation, and the element in the dout mask passed from upstream will be set to True

5.5.2 Sigmoid layer

Implement the sigmoid function. The sigmoid function is represented by formula (5.9)

If the calculation graph expression (5.9), then as shown in Figure 5-19

There are new "exp" and "/" nodes, respectively, y = exp(x), y=1/x

Next, calculate his direction propagation

Divided into four steps, the picture on the book is more intuitive

step 1:

In the back propagation, the upstream value is multiplied by −y 2 (the square of the output of the forward propagation multiplied by −1) and then passed to the downstream.

Step 2

The "+" node passes the upstream value intact to the downstream. The calculation diagram is shown below.

Step 3

Step 4

For multiplication, you only need to do a flip

Finally, we can get:

Therefore, we found that we can find the value of this formula through the forward propagation of x and y

The simplified version of the calculation diagram can omit the calculation process in backpropagation, so the calculation efficiency is higher

, You can ignore the trivial details in the Sigmoid layer, but only need to focus on its input and output

The back propagation of the Sigmoid layer shown in Figure 5-21 can be calculated based only on the output of the forward propagation

5.6 Implementation of Affine/Softmax layer

In the forward propagation of the neural network, in order to calculate the sum of the weighted signals, the matrix product operation is used

The weighted sum of neurons can be Y = np.dot(X, W) + B

The product of X and W must make the number of elements of the corresponding dimension the same.

Note that this is not the meaning of 2-row matrix multiplication

The matrix product operation performed in the forward propagation of the neural network is called " affine transformation " A in the field of geometry . Therefore, the process of performing affine transformation is implemented as an " Affine layer " here.

The calculation of np.dot(X, W) + B can be represented by the calculation diagram shown in Figure 5-24

We saw that the flow between the nodes in the calculation graph is a scalar, and in this example, the propagation between the nodes is a matrix.

Why pay attention to the shape of the matrix? Because the matrix product operation requires the number of elements in the corresponding dimensions to be consistent, by confirming the consistency, the formula (5.13) can be derived.

5.6.2 Batch version of Affine layer

In the case where we consider the forward propagation of N data together, that is, the batch version of the Affine layer.

Unlike just now, the shape of the input X is now (N, 2). Then, as before, perform simple matrix calculations on the calculation graph.

Due to the forward propagation, the offset will be added to each data (first, second...). When backpropagating, the backpropagation value of each data needs to be summarized as the offset element.

In this example, assume that there are 2 data (N = 2). Biased backpropagation will sum the derivatives of these 2 data elements by element

Here np.sum() is used to sum the elements in the direction of the 0th axis (the axis with data as the unit, axis=0 )

The implementation considers the case where the input data is a tensor (four-dimensional data), which is slightly different from the one introduced here

5.6.3 Softmax-with-Loss 层

Softmax function, such as when handwritten digits are recognized:

Because handwritten digit recognition needs to be classified into 10 categories, there are also 10 inputs to the Softmax layer.

note:

The processing in the neural network has two stages: inference and learning . Neural network reasoning usually does not use Softmax layer. In other words, when the reasoning of the neural network only needs to give an answer, because at this time only the maximum score is interested, the Softmax layer is not needed. However, the learning phase of the neural network requires a Softmax layer.

Considering that it also contains cross entropy error as a loss function, it is called "Softmax-with-Loss layer". The calculation diagram of the Softmax-withLoss layer (Softmax function and cross entropy error) is shown in Figure 5-29.

The calculation diagram of Figure 5-29 can be simplified to Figure 5-30

The softmax function is recorded as the Softmax layer , and the cross entropy error is recorded as the Cross Entropy Error layer . It is assumed here that there are 3 types of classification and 3 inputs (scores) are received from the previous layer. As shown in Figure 5-30, the Softmax layer normalizes the input (a1, a2, a3) and outputs (y1, y2, y3). The Cross Entropy Error layer receives the output of Softmax (y1, y2, y3) and the teacher label (t1, t2, t3), and outputs the loss L from these data.

The backpropagation of the Softmax layer has obtained a "beautiful" result (y1 − t1, y2 − t2, y3 − t3). Since (y1, y2, y3) is the output of the Softmax layer, (t1, t2, t3) is the supervised data, so (y1 − t1, y2 − t2, y3 − t3) is the difference between the output of the Softmax layer and the teacher label .

The purpose of neural network learning is to make the output of the neural network (the output of Softmax) close to the teacher label by adjusting the weight parameters .

The error between the output of the neural network and the teacher label must be efficiently passed to the previous layer

Specific examples:

For example, consider the situation where the teacher label is (0, 1, 0) and the output of the Softmax layer is (0.3, 0.2, 0.5). Because the probability of correctly solving the label is 0.2 (20%), the neural network at this time fails to make the correct recognition. At this time, the back propagation of the Softmax layer transmits a large error of (0.3, −0.8 , 0.5) . Because this large error will propagate to the previous layer, the layer in front of the Softmax layer will learn the "big" content from this large error.

note:

Using the "sum of square error" as the loss function of the "identity function", backpropagation can get such "beautiful" results as (y1 − t1, y2 − t2, y3 − t3).

For another example, consider the situation where the teacher label is (0, 1, 0) and the output of the Softmax layer is (0.01, 0.99, 0) (this neural network recognizes it quite accurately). At this time, the back propagation of the Softmax layer transmits a small error of (0.01, −0.01, 0). This small error will also propagate to the previous layer, because the error is very small, so the content learned in the previous layer of the Softmax layer is also very "small".

Implementation of Softmax-with-Loss layer

Please note that when backpropagating, after dividing the value to be propagated by the batch size (batch_size) , what is passed to the previous layer is the error of a single data

5.7 The realization of error back propagation method

5.7.1 Overall picture of neural network learning

The learning of neural network is divided into the following 4 steps

The error backpropagation method will appear in step 2.

In the experiment in the previous section, we adopted the method of numerical differentiation. Although simple, it takes too long.

5.7.2 The realization of neural network corresponding to error back propagation method

Here we are going to implement the 2-layer neural network as TwoLayerNet

Very similar to the previous chapter

The main difference lies in the use of layers here . By using layers, the processing of obtaining the recognition result (predict()) and the processing of calculating the gradient (gradient()) can be completed only through the transfer between layers.

Only intercept the different parts:

Please pay attention to the bold code in this implementation, especially the important thing to save the neural network layer as OrderedDict . OrderedDict is an ordered dictionary. "Ordered" means that it can remember the order in which elements are added to the dictionary. The forward propagation of the neural network only needs to call the forward() method of each layer in the order of adding elements to complete the processing, while the backward propagation only needs to call the layers in the reverse order .

Because the Affine layer and the ReLU layer will correctly handle forward propagation and back propagation , the only thing to do here is to connect the layers in the correct order , and then call the layers in order (or reverse order)

Just add the necessary layers like assembling Lego blocks to build a larger neural network.

5.7.3 Gradient confirmation of error back propagation method

Two methods to find the gradient:

One is a method based on numerical differentiation, and the other is a method of solving mathematical formulas analytically. The latter method uses the error back propagation method to efficiently calculate the gradient even if there are a large number of parameters. Use the error back propagation method to find the gradient.

When confirming whether the implementation of the error back propagation method is correct , numerical differentiation is needed.

The operation of confirming whether the gradient result obtained by the numerical differentiation and the result obtained by the error back propagation method are consistent (strictly speaking, it is very similar) is called gradient check.

The error calculation method here is to find the absolute value of the difference of the corresponding elements in each weight parameter, and calculate the average value. After running the above code, the following results will be output

For example, the offset error of the first layer is 9.7e-13 (0.00000000000097). In this way, we know that the gradient obtained by the error back propagation method is correct, and the implementation of the error back propagation method is correct.

5.7.4 Learning using error back propagation method

Let's take a look at the implementation of neural network learning using the error back propagation method. Compared with the previous implementation, the only difference is that the gradient is obtained by the error back propagation method.

The use part is different from the previous one only in:

5.8 Summary

Using the calculation graph, the error back propagation method in the neural network is introduced, and the processing in the neural network is implemented in units of layers. The layers we have learned are ReLU layer, Softmax-with-Loss layer, Affine layer, Softmax layer Etc., forward and backward methods are implemented in these layers. By propagating the data forward and backward, the gradient of the weight parameter can be calculated efficiently. By using layers for modularization, layers can be assembled freely in the neural network, and you can easily build your favorite network.

 

 

Guess you like

Origin blog.csdn.net/qq_37457202/article/details/107566659