pytorch autograd organizing

There are a lot of introductions on the Internet, and I will sort it out myself to strengthen my understanding and facilitate future reference.
Pytorch has become the most popular deep learning framework beyond Tensorflow. I think that as a deep learning framework, the two most important parts are the convenient operation of the GPU, and the second is the automatic implementation of backpropagation. Pytorch has done a very good job. The autograd module of pytorch realizes the automatic derivation of backpropagation

1. Chain derivation rule and backpropagation

The deep neural network can be regarded as a multi-layer nested compound function. For example, each layer of the convolution layer and the ReLU layer is a function. The compound function derivation uses the classic chain derivation rule, that is, dydx = dydu ⋅ dudx
\ frac{dy}{dx} = \frac{dy}{du}\cdot \frac{du}{dx}dxdy=of udydxof _
du dud u is the middle layer, there can be more than one. For multivariate functions, such as each layer in deep learning, at this time, the derivative can be changed to partial derivative, and the derivative at this time is also renamed as gradient.
This formula is actually easy to understand, because the meaning of the derivative is the rate of change of the output with the input, and the rate of change of multiple serial input and output is equal to the rate of change of the final output with respect to the initial input. Of course, a compound function is not only a series structure with only one input and one output, but also a form of multiple inputs and one output (if you can't have one input and multiple outputs, then it is not called a function). If you understand the composite function as a network in which data is continuously processed by each node, it will be much easier to understand. Quotinga picture on Zhihu: the arrow is the process of back propagation, and the forward process follows the opposite direction of the arrow .
insert image description here
The purpose of backpropagation is to obtain the gradient of the final output with respect to the input. We resolve it into the combined form of the gradient of each layer function. The equation of the gradient of each layer function is fixed and known in advance, but the function of each layer The input and output values ​​still need to be calculated through forward propagation to know them all, and the calculation of the derivative is the opposite process. The gradient solution of each layer depends on the gradient value of the subsequent layer and the derivative formula of this layer.
Note that in the deep neural network, the input and the parameters of each layer are leaf nodes. The gradient of the leaf nodes does not participate in the main process of backpropagation, so the leaf nodes can not be derived, and it does not affect the derivatives of other nodes. Of course, the gradient of the leaf nodes is exactly what we need, so we need to derive the gradients of each leaf node from the main node of backpropagation.

2,requires_grad

After the above combing, let's combine the specific code of pytorch to understand. Since the gradient of the leaf nodes is not necessary for the backpropagation process, it is possible to set whether to obtain their gradients according to the needs of the user. For non-leaf nodes, because their gradients are required, they cannot be set to requires_grad=False

3,retain_grad

For non-leaf nodes, although their gradients must be obtained, they are usually only intermediate processes. In order to save memory (video memory) occupation, Pytorch defaults to delete them when they are used up, and does not save them. If we need it, we can set retain_grad=True to the variable of the required non-leaf node, so that its gradient will be preserved.

4, retain_graph

Pytorch uses a dynamic graph mechanism, that is to say, every time it propagates forward to the output node (the output node is usually the loss value), it saves the relationship of all the previous nodes that generated it at the loss (using a directed acyclic graph Indicates), this relationship is used to guide the path of backpropagation, but this relationship is deleted only once, which allows a different path to be used in the next forward propagation. This method is the dynamic graph mechanism, which is more flexible. But sometimes we may not want to forward propagation after a backpropagation, and we need to add some new calculations and then backpropagate again, (for example, we use multiple loss functions to add up, and each loss function can be backpropagated Calculate the next loss function once and then backpropagate, which saves video memory). At this time, the calculation graph of the previous backpropagation must be retained, so you need to set loss.backward(retain_graph=True). Note that there is no need to keep the last time, which can reduce some video memory.

5,inplace operation

Note that after a tensor is used by other functions, the inplace operation cannot be used to modify the value, because the original value that needs to be used when backpropagating to calculate the gradient will not be found. If modification is required, non-inplace operations can be used. Inplace operations include: +=1, slice assignment[0]=1, etc.

6, non-differentiable functions

If there are non-differentiable functions in the calculation process, backpropagation will have problems.
The common ones are argmax(), sign(), etc., and an error will be reported.
For round(), ceil(), and floor(), although these will not report an error, they will cause the gradient to become 0, meaningless, and do not use them.
For abs(), relu, etc., it seems that mathematics is not strictly differentiable, and there are individual non-differentiable points. Pytorch has carried out special treatment, but it can be used.

7, detach()

x1 = x.detach() is used to separate x in the current calculation graph. The new x1 and x share the same storage space, but the new x1 does not require gradients.

Finally, put a simple piece of code that can be used to test the various functions just mentioned. You can also experiment with the optimizer, and its usage will be summarized next time.

import torch

x1 = torch.tensor([[1.,2.],[2.,3.],[1.,1.]],requires_grad=True)
w1 = torch.tensor([[1.,2.,3.],[3.,2.,1.]],requires_grad=True)
for epoch in range(1):
    y1 = torch.mm(w1,x1)
    y1.retain_grad()
    w2 = torch.tensor([[1.,3.],[2.,1.]])
    
    y2 = torch.mm(w2,y1)
    #w1 += 1
    y = y2.view(-1).max()
    
    optimizer = torch.optim.SGD([x1,w1],lr=0.001 )
    optimizer.zero_grad()
    y.backward(retain_graph=True)
    y3 = y**2
    y3.backward()
    
    optimizer.step()
    print(epoch,y)
print('w1:',w1,'\n','w1.grad:',w1.grad)
print('w2:',w2,'\n','w2.grad:',w2.grad)
print('x1:',x1,'\n','x1.grad:',x1.grad)

Guess you like

Origin blog.csdn.net/Brikie/article/details/115397713