Explanation of pytorch's automatic derivative mechanism

Ps: When I understood the automatic derivation mechanism in depth, I was very lucky to read this blog translated by ronghuaiyang god at the beginning https://blog.csdn.net/u011984148/article/details/99670194  so that the overall thinking did not appear to be biased, but because There are many reasons for my lack of proficiency because of ignorance and half-understanding. So I spent an afternoon looking for relevant information and sorting it out, and combined the blog understanding with the code to understand this simple but not so simple thing from a more intuitive perspective. The organization framework is still in accordance with the above blog, and the details are more simple. Thanks for other reference articles: https://ptorch.com/news/172.html , https://blog.csdn.net/tsq292978891/article/details/79333707 , https://blog.csdn.net/ch1209498273/ article/details/79018160 , https://blog.csdn.net/ronaldo_hu/article/details/91359018 , https://blog.csdn.net/douhaoexia/article/details/78821428 (This reference helps understand the most complete A lot of details).

1.PyTorch basics

The version of PyTorch is after 0.4.0, regardless of the previous version. Not much nonsense, go directly to the code:

import torch
import numpy as np

x = torch.randn(2, 2, requires_grad = True)
print(x.requires_grad)
print('##############')
y = np.array([1., 2., 3.])
y = torch.from_numpy(y)
print(y.requires_grad)
y.requires_grad_(True)
print('after:',y.requires_grad)

True
##############
False
after: True

 This example is changed to this. It is very intuitive that x defines an x ​​directly defines a gradient-enabled tensor, and y defines a tensor first, and then uses y.requires_grad_(True) to start the gradient function of the tensor.

Of course " Note : According to the design of PyTorch, gradients can only calculate floating-point tensors. That's why I created a floating-point numpy array and then set it as a PyTorch tensor with gradients enabled." This sentence is also copied. .

2. Neural network and backpropagation

1. The network initializes the weight;

2. The input data is propagated forward through the convolutional layer, down-sampling layer, and fully connected layer to obtain the output value;

3. Find the error between the output value of the network and the target value, which is the loss;

4. Backpropagation, calculate the gradient of each weight

5. Perform weight update according to the calculated error. Then enter the second step.

The slight change in the input weight caused by the loss change is called the gradient of the weight (this gradient is required when we update the weight), and it is calculated by back propagation. Then use the gradient to update the weights, use the learning rate to reduce the overall loss and train the neural network.

3. Dynamic calculation graph

I will paste these two pictures under the copy, and then use my words to describe the understanding.

The first is that the requirements_grad=False or True in the following table of the two graphs only represents the tensor X. Secondly, can X be regarded as the weight W, and Y as the input of the network, it is the output. Then output=input*W, then the recorded gradient is \frac{\partial output}{\partial input}= W, and W, that is, Y is 2, then the grad of X is 2. It is worth noting that the gard_fn of Z is MulBackward, indicating that it is the back propagation derivation of multiplication.

4.Backward() function

This is a direct example to illustrate.

import torch
w1 = torch.tensor(2.0,requires_grad=True) #认为w1 与 w2 是函数f1 与 f2的参数
w2 = torch.tensor(2.0,requires_grad=True)
x2 =torch.tensor(3.0,requires_grad=True)
y2 = x2**w1            # f1 运算
z2 = w2*y2+1           # f2 运算
z2.backward()
print(x2.grad)
print(y2.grad)
print(w1.grad)
print(w2.grad)


tensor(12.)
None
tensor(19.7750)
tensor(9.)

The original process of calculating the gradient of 4 variables is as follows:

 

You must be surprised that y2.gard can be calculated, but none is displayed. In fact, the understanding is quite simple. Some bloggers explained that: x2.grad, w1.grad, and w2.grad are found to be values, but y2.grad is None, indicating that the gradient of x2, w1, and w2 is retained, and the gradient of y2 cannot be obtained. In fact, if you think about it carefully, you will find that x2, w1, and w2 are all leaf nodes. In this calculation tree, x2 and w1 are leaf nodes of the same depth (bottom layer), y2 and w2 are the same depth, w2 is a single leaf node, and y2 is the parent node of x2 and w1, so only y2 does not retain the gradient Value, confirms the previous statement. This also shows that the nature of the computational graph is a structure similar to a binary tree. As shown below:

In fact, for a simpler understanding, you can think that x is the input and z is the output. You only need to know the derivative of z with respect to w1 and w2 for backpropagation. There is no need to know x and y.

It’s easier to understand this by explaining this and then going to Example 2 again

import torch

x = torch.tensor([1.0, 2.0, 8.0], requires_grad=True)
y = torch.tensor([5.0, 1.0, 7.0], requires_grad=True)
z = x * y
z.backward(torch.FloatTensor([1.0, 1.0, 1.0]))
print(y.grad)


tensor([1., 2., 8.])


import torch

x = torch.tensor([1.0, 2.0, 8.0], requires_grad=True)
y = torch.tensor([5.0, 1.0, 7.0], requires_grad=True)
z = x * y
v = torch.tensor( [0.1,1.0,0.001],dtype =torch.float )
z.backward(v)
print(y.grad)


tensor([0.1000, 2.0000, 0.0080])

In fact, I didn't understand what the author gave at the beginning ( RuntimeError: grad can be implicitly created only for scalar outputsz.backward()  will be given  ). Combine the following from another perspective to understand what z.backward(v) is to change.

The backward function is the entry point of backpropagation. Calling the backward function on the node that needs to be derivated will calculate the gradient value to the corresponding node. Backward requires an important parameter grad_tensor, but if the node only contains a scalar value, this parameter can be omitted (for example, the most common loss.backward() is equivalent to loss.backward(torch.tensor(1)))
 

5. Mathematics-Jacobian matrices and vectors

Torch.autograd should essentially be a vector-Jacobi product calculation engine, calculating vT⋅J, and the so-called parameter
grad_tensor is here v. It is easy to know from the definition that the parameter grad_tensor needs to have the same size as the Tensor itself. By setting grad_tensor appropriately, it is easy to calculate arbitrary combinations.

 

The process of backpropagation is generally used to transfer the gradient from upstream, so as to realize the chain rule. A simple derivation is as follows

vWhat is the key , in fact, vcan be understood as the actual meaning of the above torch.tensor([0.1,1.0,0.001],dtype =torch.float) is the loss of the comparison label. For example, Z label=[5,2,56], Z actual=[5.1,3.0,56.001], then the loss is [0.1,1.0,0.001]. Then tensor([0.1000, 2.0000, 0.0080]) in the previous example can be easily Jvcalculated.

 

Guess you like

Origin blog.csdn.net/qq_36401512/article/details/100934259