I encountered the use of with torch.no_grad() in programming. I wanted to understand that there were some unexpected things in the process, so I recorded it.
First explain the environment, the following tests are given in: python3.6, pytorch1.2.0 environment: The screenshot of the
official website is as follows:
Insert picture description here
There are several important points:

A context manager on torch.no_grad, you Tensor.backward()can use torch.no_grad to shield the gradient calculation when you are sure that you don’t need to call it
Tensor calculated under the control of torch.no_grad, its requirements_grad is False

Let's simulate my thought process from shallow to deep through several calculation diagrams:

1. First is the simplest, but also in line with expectations

import torch
a = torch.tensor([1.1], requires_grad=True)
b = a * 2
print(b)
c = b + 2
print(c)
c.backward()
print(a.grad)
###answer
tensor([2.2000], grad_fn=<MulBackward0>)
tensor([4.2000], grad_fn=<AddBackward0>)
tensor([2.])

It can be seen that both tensor b and c have been recorded grad_fn, indicating that their requirements_grad are both True, and after c is backpropagated, the gradient of a is 2.0 [note that it is not 2, the general type of gradient tensor is torch.float32, no Integer]

Insert picture description here
With a slight change, take a look at the function of with torch.no_grad():

import torch
a = torch.tensor([1.1], requires_grad=True)
b = a * 2
print(b)
with torch.no_grad():
    c = b + 2
print(c)
print(c.requires_grad)　
c.backward()
print(a.grad)
### answer
tensor([2.2000], grad_fn=<MulBackward0>)
tensor([4.2000])
False

It can be seen that under the with torch.no_grad() package, tensor c no longer has grad_fn, that is, the gradient is not tracked, and requires_grad is False. At this time, c.backward() will report an error:RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

2. So far, the content of the official website is almost the same, but I also think that if tensor c participates in the operation, then when the tensor is backpropagated, what will happen to c here

So there is the following calculation graph and the following code:

import torch
a = torch.tensor([1.1], requires_grad=True)
b = a * 2
print(b)
with torch.no_grad():
    c = b + 2
print(c)
print(c.requires_grad)
d = torch.tensor([10.0], requires_grad=True)
e = c * d
print(e)
print(e.requires_grad)
e.backward()
print(a.grad)
### answer
tensor([2.2000], grad_fn=<MulBackward0>)
tensor([4.2000])
False
tensor([42.], grad_fn=<MulBackward0>)
True
None

You can see that something magical happened . The tensor e obtained after the operation of c with requirements_grad as False and d with requirements_grad as True has requirements_grad as True . Then I let e go backward and see what gradient a will have. It turns out that If it is None, this is "an error", so the official website said that it should only be used when backward is not needed with torch.no_grad(), otherwise the gradient of some tensor will appear unexpectedly.
Analysis: because c here the gradient is no longer tracked, which is equivalent The upstream gradient is blocked here, so a does not know what gradient should be, so it throws None.
Insert picture description here

3. Let a have another path

Since the gradient from c is blocked, then I let a participate in another operation to create an extra path. What will the gradient of a be like at this time? So there is the following calculation diagram and code:

import torch
a = torch.tensor([1.1], requires_grad=True)
b = a * 2
print(b)
with torch.no_grad():
    c = b + 2
print(c)
print(c.requires_grad)
d = torch.tensor([10.0], requires_grad=True)
e = c * d
print(e)
print(e.requires_grad)
f = a + e
print(f)
f.backward()
print(a.grad)
### answer
tensor([2.2000], grad_fn=<MulBackward0>)
tensor([4.2000])
False
tensor([42.], grad_fn=<MulBackward0>)
True
tensor([43.1000], grad_fn=<AddBackward0>)
tensor([1.])

You can see that the gradient of f is passed to a from another path, but it is obviously not the gradient we want.
Insert picture description here

to sum up

Here is a summary of the points of attention:

1. As long as it is a quantity with requirements_grad=True, it must be a floating-point number, not an integer, and so is grad, otherwise there will be an error like this: RuntimeError: Only Tensors of floating point dtype can require gradients
2. When using with torch.no_grad(), although some tensor gradients can be calculated less and the computational burden can be reduced, but if there is a backward, there is likely to be an error, or it can be used if you are sure that there is no backward, or If the graphics card allows, there is no need to with torch.no_grad() to avoid unnecessary errors
3. After experiments, even if the in place operation is packaged under with torch.no_grad(), it still has grad_fn, see the following example:

import torch
a = torch.tensor([1.1], requires_grad=True)
b = a * 2
with torch.no_grad():
    b.mul_(2)
print(b)
b.backward()
print(a.grad)
### answer
tensor([4.4000], grad_fn=<MulBackward0>)
tensor([2.])

with torch.no_grad()和backward()

1. First is the simplest, but also in line with expectations

2. So far, the content of the official website is almost the same, but I also think that if tensor c participates in the operation, then when the tensor is backpropagated, what will happen to c here

3. Let a have another path

to sum up

Guess you like