basic concept

1）is_leaf

The difference between leaf nodes and non-leaf nodes: The nodes in the calculation graph are divided into leaf nodes and non-leaf nodes. Leaf nodes can be understood as no other tensors and then use it for calculation (for example, b = a+1, then b needs a to calculate , then a is not a leaf node, b is a leaf node), but it should be noted that if the tensor is created by the user, the tensor is a leaf node (even if other tensors use it for calculation).

Is it a leaf node

All tensors whose requires_grad is False are leaf nodes, that is, the is_leaf attribute returns true.
If the tensor is created by the user, the tensor is a leaf node. The variable obtained by further calculation of the leaf node is called a non-leaf node, and the gradient value of the leaf node is not None. The gradient values of non-leaf nodes are not stored in memory, so the gradient value of non-leaf nodes is None.

import torch

# 若tensor是由用户创建，则该tensor为叶子节点
x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
print(x.is_leaf)

# 叶子节点经过进一步计算得到的变量叫非叶子节点
x = x.view(2,2)
print(x.is_leaf)

out = x.sum()
print(out.is_leaf)

out.backward()
# 非叶子节点的梯度值没有保存在内存中，所以对非叶子节点进行求梯度则为None
print(x.grad)

The node generated by the operation of the node whose requires_grad is False is still a leaf node. At this time, setting requires_grad to true does not affect whether it is a leaf node, but it will affect whether the subsequent node is a leaf node. It is guessed that the reason for this design is: Since it is impossible to judge whether it is a node generated by operation, it is impossible to update whether it is a leaf node by setting requires_grad.

2）grad

This is the gradient value that holds the parameters. After backpropagation, for requires_grad=Trueparameters with , gradients are computed and stored in param.grad. param.gradFor parameters with no gradient or with no gradient computed, None. You can use to param.gradget the gradient value of the parameter, and then perform custom operations, such as parameter update or gradient clipping.

When calculating the gradient, only the leaf nodes will retain the gradient, and the grads of all intermediate nodes (non-leaf nodes) will be cleared to save memory after calculating backward()

When using the backward() function to perform backpropagation to calculate the gradient of tensors, it is not to calculate the gradients of all tensors, but to calculate the gradients of tensors that meet all the following conditions.

The type is a leaf node,
requires_grad==True
The require_grad of all tensors that depend on this tensor is True

3) grade_fn

How to record the variables, for example: y = x*3, grad_fn records the process of y being calculated by x. so:

When grad_fn is None : Whether requires_grad is True or False, it is a leaf variable , that is, as long as it is directly initialized, it is a leaf variable .
When grad_fn is not None : requires_grad = False is a leaf variable, requires_grad = True is a non-leaf variable

4）requires_grad

This is a boolean property indicating whether to compute gradients for the parameters. By default, all model parameters requires_gradhave the attribute set to True, so that gradients are computed when doing backpropagation. If you want to freeze parameters or prevent their gradients from being updated, you can set this to False. When you freeze a parameter, the gradient calculation is stopped, which means that parameter will not be updated in subsequent training iterations.

By convention, all tensors with attribute requires_grad=False are leaf nodes (ie: leaf tensor, leaf node tensor).

For the tensor with attribute requires_grad=True, it may be a leaf node tensor or not a leaf
node tensor but an intermediate node (intermediate node tensor). If the attribute requires_grad=True of the tensor,
and it is used for direct creation , that is, its attribute grad_fn=None, then it is a leaf node.
If the attribute of the tensor requires_grad=True, but it is not directly created by the user, but is
generated by other tensors through certain operations, then it It is not a leaf tensor, but an intermediate node tensor, and its attribute
grad_fn is not None, such as: grad_fn=<MeanBackward0>, which means that the tensor is
generated by the operation of torch.mean(), which is an intermediate result. So it is an intermediate node tensor, so it is not a leaf node tensor.
To judge whether a tensor is a leaf node, you can check it by its attribute is_leaf.

The attribute requires_grad of a tensor is used to indicate whether the gradient needs to be calculated for this tensor during backpropagation.
If the attribute requires_grad=False of this tensor, then there is no need to calculate the gradient for this tensor, and there is
no need for This tensor is optimized for learning.

In the operation of PyTorch, if the attribute requires_grad of all input tensors participating in this operation is
False, then the result of this operation, that is, the attribute requires_grad of the output tensor is also False,
otherwise it is True. As long as one of the tensors needs to ask for gradient (attribute requires_grad=True), then the obtained
result tensor also needs to ask for gradient (attribute requires_grad=True). Only when all input tensors do not need to ask for
gradient, the obtained result Tensor will not need to find the gradient.

For a tensor with the property requires_grad=True, the gradient will be calculated for the tensor during backpropagation. However, the
automatic gradient mechanism of pytorch will not save the gradient for the intermediate result, that is, it will only save the gradient calculated for the leaf node. In the attribute grad of the
leaf node tensor, the gradient of this tensor will not be saved in the attribute grad of the intermediate node tensor. This is
for the sake of efficiency. The attribute grad of the intermediate node tensor is None. If the user If you need to save the gradient for the intermediate node
, you can let the intermediate node call the method retain_grad(), so that the gradient will be saved in the grad attribute of the intermediate node.

During training, params.grad is NoneType (value is None)

First of all, this is not empty, but does not exist at all. There are many reasons, such as:

params is not a leaf node
The requires_grad attribute of params is False
When the network is defined, a certain layer of network is defined, which is not used in the forward propagation. When outputting the network gradient, since there is no forward propagation (that is, def _init__(self): written in it, but def forward() When a certain layer above is not used, there will be cases where the grad is NoneType, which may be the reason), so there is no backpropagation, and naturally there is no gradient information, and the nonetype type appears.
Before calling the backward() function, the grad attribute of the leaf/non-leaf node is none, no matter whether requires_grad=True (leaf node) is set, or retain_grad() (non-leaf node) is called

Case 1:

import torch

a = torch.ones((2, 2), requires_grad=True).to("cuda")
b = a.sum()
b.backward()

print(a.is_leaf)
print(a.grad)

Output: False, None

reason:

Since .to(device) is an operation, a is no longer a leaf node at this time

change into:

import torch

a = torch.ones((2, 2), requires_grad=True)
c = a.to("cuda")
b = c.sum()
b.backward()

print(a.is_leaf)
print(a.grad)

Case two:

Defining parameter multiplication should put all operations in torch.nn.Parameter() instead of multiplying outside

mistake:

self.miu = torch.nn.Parameter(torch.ones(self.dimensional)) * 0.01

should be

self.miu = torch.nn.Parameter(torch.ones(self.dimensional) * 0.01)

What if the grad is NoneType (the value is None)?

Print the parameter gradient of all networks (model.named_parameters()), and check which layer begins to have the problem of gradient disappearance: https://blog.csdn.net/weixin_43135178/article/details/131754210?

Found out is the problem here:

Search for " q_proj ", ".key_value " , and " query_key_value " respectively , and found that they all use " mpu.ColumnParallelLinear + mpu.RowParallelLinear ", then it should be the problem of these two classes

[Note] Mainly check whether the torch.no_grad() or detach() function is decorated, the decorator will keep the gradient in the function

Debug solutions have the following ideas:

1. Check whether the gradient of the variable is 0 or None. For the intermediate variable of pytorch, see the blog for the output gradient method: pytorch obtains the gradient of the intermediate variable

If it is None or 0, it means that the gradient has not been passed to the variable, and the gradient of the variable is output along the code until the gradient appears, and then check why the gradient disappears.

2. After outputting the gradient, check whether the gradient multiplied by the learning rate is too small. For example, if the gradient is 5e-2, the learning rate is 1e-4, and the value of the variable only retains five decimal places, then the learning rate is too small to update Ignored by variables, the learning rate needs to be increased.

3. The most important thing is to check whether the class of the parameter is added to the optimization parameter sequence of optimal.

(Otherwise, although the gradient is returned, the optimizer will not respond to your parameters. (This time, the reason for the problem in my code is this)

4. Check whether the variable is replaced before the optimal step function, that is, after the gradient is returned and before the step function, the parameter is reassigned. (Unusual)

Note: If it is a list of model classes: Please do not use the list type, use nn.ModuleList, if a list contains three A classes, when using the list as a parameter of the B class (assigned in the init function), then All parameters in this list (parameters in class A) will not be optimized, and this can be avoided by using nn.ModuleList.

Custom network parameters in Pytorch, gradients exist but are not updated - Programmer Sought

The problem that the gradient is none after pytorch loss backpropagation - Programmer Sought

pytorch calculation graph, gradient-related operations, fixed parameter training and the reason why grad is Nonetype during the training process- Know about

Some Pitfalls of Torch requires_grad / backward / is_leaf - Short Book

pytoch sets requires_grad=True, but the calculation gradient (grad) is none_requires_grad=true but there is no gradient_AINLPer's Blog-CSDN Blog

https://www.cnblogs.com/Monster1728/p/15865708.html

param.grad, requires_grad, grad_fn, grad/gradient are None?