Summary of gradient knowledge in Pytorch

1. Leaf nodes, intermediate nodes, gradient calculation

  • All tensors with attribute requires_grad=False are leaf nodes (ie: leaf tensor, leaf node tensor).
  • The tensor for the attribute requires_grad=True may be a leaf node tensor, or it may not be a leaf node tensor but an intermediate node (intermediate node tensor). If the attribute of the tensor requires_grad=True, and it is used for direct creation, that is, its attribute grad_fn=None, then it is a leaf node. If the attribute requires_grad=True of the tensor, but it is not directly created by the user, but is generated by other tensors through some operations, then it is not a leaf tensor, but an intermediate node tensor, and its The attribute grad_fn is not None, such as: grad_fn=, which means that the tensor is generated by torch.mean() operation, which is an intermediate result, so it is an intermediate node tensor, so it is not a leaf node tensor.
  • To judge whether a tensor is a leaf node, you can check it through its attribute is_leaf.
  • The attribute requires_grad of a tensor is used to indicate whether the gradient needs to be calculated for this tensor during backpropagation. If the attribute requires_grad=False of this tensor, then there is no need to calculate the gradient for this tensor, and there is no need to optimize learning for this tensor.
  • In the operation of PyTorch, if the attribute requires_grad of all input tensors participating in this operation is False, then the result of this operation, that is, the attribute requires_grad of the output tensor is also False, otherwise it is True. That is, as long as one of the input tensors requires a gradient (attribute requires_grad=True), then the resulting tensor also requires a gradient (attribute requires_grad=True). Only when all input tensors do not require gradients, the resulting tensors do not require gradients.
  • For tensors with attribute requires_grad=True, gradients are computed for that tensor during backpropagation. However, pytorch's automatic gradient mechanism will not save the gradient for the intermediate result, that is, it will only save the gradient calculated for the leaf node, and save it in the attribute grad of the leaf node tensor, and will not save it in the attribute grad of the intermediate node tensor The gradient of this tensor is for the sake of efficiency. The attribute grad of the intermediate node tensor is None. If the user needs to save the gradient for the intermediate node, the intermediate node can call the method retain_grad(), so that the gradient will be saved in the grad attribute of the intermediate node.
  • Only leaf nodes have a gradient value grad, and non-leaf nodes are None. Only non-leaf nodes have grad_fn, and leaf nodes are None.

2. leaf tensor leaf tensor (leaf node) (detach)

  • In Pytorch, by default, the gradient values ​​of non-leaf nodes will be cleared after use in the backpropagation process and will not be retained. Only the gradient values ​​of leaf nodes are preserved.
  • In the Pytorch neural network, we backpropagate backward() to find the gradient of the leaf nodes. In pytorch, the tensor of the weight w in the neural network layer is a leaf node. Their require_grad is True, but they are all created by users, so they are all leaf nodes. And backpropagation backward() is to find their gradients.
  • When calling backward(), the gradient value of the node will only be calculated if requires_grad and is_leaf are both true.

2.1 Why do we need leaf nodes?

Those non-leaf nodes are generated by a series of operations of the leaf nodes defined by the user, that is, these non-leaf nodes are all intermediate variables. Generally, users do not go back and use the derivatives of these intermediate variables, so in order to save memory, They are freed after they are used up.

In Pytorch's autograd mechanism, when the tensor's requires_grad value is True, it will only be calculated when the backward () backpropagation calculates the gradient. In all require_grad=True, by default, the gradient values ​​​​of non-leaf nodes will be cleared after use in the backpropagation process and will not be retained (that is, calling loss.backward() will hide the calculation graph variable gradient clear). By default, only the gradient values ​​of leaf nodes can be preserved. The gradient value of the retained leaf node will be stored in the grad attribute of the tensor, and the data attribute value of the leaf node will be updated during the optimizer.step() process, thereby realizing the update of the parameters.

2.2 detach() strips the node into a leaf node

If you need to make a node a leaf node, just use detach() to separate it from the calculation graph that created it. That is, the function of the detach() function is to strip a node from the calculation graph and make it a leaf node.

2.3 What kind of node is a leaf node

① All tensors whose requires_grad is False are conventionally attributed to leaf tensors. Just like the input of our training model, they all require_grad=False, because they do not need to calculate the gradient (we train the network to train the weight of the network model, without training input). They are the starting points of a computational graph.

② requires_grad is True tensors, if they are created by users, they are leaf tensors (leaf Tensor). For example, various network layers, nn.Linear(), nn.Conv2d(), etc., they are created by users, and their network parameters also need to be trained, so requires_grad=True means that they are not the result of the operation, so gra_fn is None .

2.3 The function and difference of detach(), detach_()

detach()

Returns a new tensor, which is separated from the current calculation graph, but still points to the storage location of the original variable. The only difference is that requires_grad is false, and the obtained tensor never needs to calculate its gradient and does not have grad. Even if its requires_grad is set to true again later, it will not have a gradient grad. In this way, we will continue to use this new tensor for calculation. Later, when we perform backpropagation, the tensor that should call detach() will stop and cannot continue to propagate forward.

The tensor returned by detach and the original tensor share the same memory, that is, if one is modified, the other will also change accordingly.

detach_()

Detach a tensor from the graph that created it and set it as a leaf tensor. In fact, it is equivalent to the relationship between variables is x -> m -> y, the leaf tensor here is x, but at this time m.detach_() operation is performed on m, in fact, two operations are performed:

Set the value of m's grad_fn to None, so that m will no longer be associated with the previous node x, and the relationship here will become x, m -> y, and m will become a leaf node at this time. Then the requires_grad of m will be set to False, so that the gradient of m will not be sought when performing backward() on y.

detach() is very similar to detach_(), the difference between the two is that detach_() is a change to itself, and detach() generates a new tensor

For example, if detach() is performed on m in x -> m -> y, it is still possible to operate on the original calculation graph if you regret it later. But if detach_() is performed, then the original calculation graph has also changed, and you cannot go back on your word

2.4 The difference between clone() and detach()

Clone does not share memory with the original tensor, and detach shares memory with the original tensor.

clone supports gradient return, detach does not support gradient return.

If you want to keep the gradient of non-leaf nodes, you can use retain_grad().

5.optimizer.zero_grad()

optimizer.zero_grad() clears all xx in the optimizerx x . g r a d x.grad x . g r a d , before each loss.backward(), don't forget to use it, otherwise the previous gradients will accumulate, which is usually not what we expect (and it is not excluded that some people need to take advantage of this function).

3.loss.backward()

The loss function loss defines the standard for the quality of the model. The smaller the loss, the better the model. Common loss functions such as mean square error MSE (Mean Square Error), MAE (Mean Absolute Error), cross-entropy CE (Cross-entropy), etc.

loss.backward() As the name suggests, it is to backpropagate the loss loss to the input side, and at the same time, for all variables xx that need gradient calculationx (requires_grad=True), calculate the gradientddxloss {\frac{d}{dx}}lossdxdl oss , and accumulate it to the gradientx.grad x.gradx . g r a d is reserved, namely:

x . g r a d = x . g r a d + d d x l o s s x.grad=x.grad + {\frac{d}{d x}}l o s s x.grad=x.grad+dxdloss

4.optimizer.step()

optimizer.step() is the optimizer for xxThe value of x is updated, taking stochastic gradient descent SGD as an example: the learning rate (learning rate, lr) is used to control the stride, ie:x = x − lr ∗ x . gradx = x - lr*x.gradx=xlrx . g r a d , the minus sign is due to adjusting the variable value along the opposite direction of the gradient to reduce the Cost.

Guess you like

Origin blog.csdn.net/qq_45724216/article/details/127645900