[Pytorch] Learning Record (3) Backpropagation

Backpropagation is a very important algorithm in neural networks, which can perform gradient propagation on the graph. As shown in Figure 1 is the simplest neural network, ω is the weight existing in the model, which is the goal we want to train. y is the final output, * is where the calculation is performed. During training, the weight of ω is actually updated. When updating, it is calculated \omega = \omega-\alpha\frac{\partial loss}{\partial \omega}. Our goal is to minimize the loss value .

Figure 1 The simplest neural network

 The full formula for gradient descent includes a partial derivative, and the derivative part is derived \frac{\partial loss}{\partial \omega}=2x_n(x_n\omega-y_n). For simple models we can use the above analytical method to do it, but for complex models we can no longer do so.

Figure 2 Complex model

As shown in Figure 2, each circle has its own ω, and the number of weights is very large. There are 6 elements in the hidden layer h1 in the second column, so h1 belongs to a six-dimensional vector, and x is a five-dimensional vector as input. If you want to get h in the form of h=ωx, you need to perform matrix multiplication. From the shape of the matrix, we know that ω needs to be a 6×5 matrix with a total of 30 different weights. In the same way, there are 7 elements in the third layer, and there are 6×7=42 of their ωs, and there are hundreds of them in total, which is very many. It is almost impossible to write analytical formulas one by one.

Therefore, we need an algorithm that treats the complex network as a graph, lets the gradient propagate on the graph , and finally calculates the gradient according to the chain rule. This algorithm is called the backpropagation algorithm.

For example, there is a two-layer neural network \hat{y}=W_2(W_1\cdot X+b_1)+b_2, the input X is first multiplied by W1 at +b, and an inner layer output is obtained. Then multiply it with the outside W2 and add it to b2 to get the final y output. This is a process of computing graphs. Different green boxes (computing modules) use different methods when calculating local partial derivatives.

For a multi-layer neural network, it is inappropriate if only this is the case , because y can be expanded and simplified, and finally no matter how many layers of neural network it is, it can be expanded into the form of y=WX+b (as shown in Figure 3). To solve this problem, we apply a non-linear transformation function to the output of each layer. For example, for the three outputs of the first layer, we let x1 x2 x3 be substituted respectively 1+e^{-x}, and this result is substituted into the next layer, and so on.

Figure 3 Merge of multi-layer neural networks

Figure 4 Backward feedback from z forward


The following will explain how to implement data feed-forward feedback in Pytorch. First of all, the basic data type in pytorch is Tensor type, and all data values ​​must be stored in Tensor. Tensor can store vectors, matrices, scalars, etc. The tensor contains two parts, data and grad, which are used to store the value of ω itself and the derivative of the loss function to ω respectively. When building a model, you are actually building a computational graph .

Summary: The steps of the backpropagation algorithm, the first step is to calculate the loss, the second step is to do backward, the third step is to do it \omega = \omega-\alpha\frac{\partial loss}{\partial \omega}, the fourth step is to clear the grad, and the coat cycle is enough.

Guess you like

Origin blog.csdn.net/m0_55080712/article/details/122837290