PyTorch derivation correlation (backward, autograd.grad)

PyTorch is a dynamic graph, that is, the construction and operation of the calculation graph are simultaneous, and the results can be output at any time; while TensorFlow is a static graph.

There are only two elements in the calculation graph of pytorch: data (tensor) and operation (operation)

Operations include: addition, subtraction, multiplication and division, square root, exponent pair, trigonometric functions, etc.

Data can be divided into: leaf nodes (leaf nodes) and non-leaf nodes ; leaf nodes are nodes created by users and do not depend on other nodes; the difference between them is that after the backpropagation ends, the gradient of non-leaf nodes will be released , only retaining the gradient of the leaf nodes, which saves memory. If you want to preserve the gradient of non-leaf nodes, you can use retain_grad()the method.

torch.tensor has the following properties:

Check if it is possible to derive requires_grad
View operation name grad_fn
Check if it is a leaf node is_leaf
View derivative values grad

For the requires_grad attribute, self-defined leaf nodes default to False, non-leaf nodes default to True, and weights in the neural network default to True. One of the principles for judging which nodes are True/False is that there is a path that can be derived from the leaf node you need to derive to the loss node.

When we want to Tensorfind the gradient of a variable, we need to specify requires_gradthe attribute first True. There are two main ways to specify:

x = torch.tensor(1.).requires_grad_() # 第一种

x = torch.tensor(1., requires_grad=True) # 第二种

PyTorch provides two methods for finding gradients: backward() and torch.autograd.grad() , the difference between them is that the former fills .gradthe fields for leaf nodes, while the latter directly returns the gradient to you. I will give an example later. It is also necessary to know that y.backward()it is actually equivalent totorch.autograd.backward(y)

usebackward()

x = torch.tensor(2., requires_grad=True)

a = torch.add(x, 1)
b = torch.add(x, 2)
y = torch.mul(a, b)

y.backward()
print(x.grad)
>>>tensor(7.)

Take a look at the properties of these tensors

print("requires_grad: ", x.requires_grad, a.requires_grad, b.requires_grad, y.requires_grad)
print("is_leaf: ", x.is_leaf, a.is_leaf, b.is_leaf, y.is_leaf)
print("grad: ", x.grad, a.grad, b.grad, y.grad)

>>>requires_grad:  True True True True
>>>is_leaf:  True False False False
>>>grad:  tensor(7.) None None None

When using the backward() function to backpropagate to calculate the gradient of tensor, it does not calculate the gradient of all tensors, but only calculates the gradient of tensors that meet these conditions: 1. The type is a leaf node, 2. requires_grad=True, 3 .requires_grad=True for all tensors that depend on this tensor. All variable gradients that meet the conditions will be automatically saved to the corresponding gradattributes.

use`autograd.grad()`

x = torch.tensor(2., requires_grad=True)

a = torch.add(x, 1)
b = torch.add(x, 2)
y = torch.mul(a, b)

grad = torch.autograd.grad(outputs=y, inputs=x)
print(grad[0])
>>>tensor(7.)

yBecause the output and input are specified x, the return value is the gradient of ∂y/∂x. The complete return value is actually a tuple, just keep the first element, and the following elements are?

To find the first derivative you can usebackward()

x = torch.tensor(2., requires_grad=True)
y = torch.tensor(3., requires_grad=True)

z = x * x * y

z.backward()
print(x.grad, y.grad)
>>>tensor(12.) tensor(4.)

can also be usedautograd.grad()

x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

grad_x = torch.autograd.grad(outputs=z, inputs=x)
print(grad_x[0])
>>>tensor(12.)

yWhy not find the right derivative at the same time ? Because the graph is released no matter whether backwardthe gradient is calculated once, if you want to keep it, you need to addautograd.gradretain_graph=True

x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

grad_x = torch.autograd.grad(outputs=z, inputs=x, retain_graph=True)
grad_y = torch.autograd.grad(outputs=z, inputs=y)

print(grad_x[0], grad_y[0])
>>>tensor(12.) tensor(4.)

Let's look at how to find the higher-order derivative. In theory, it is actually the above grad_xand then find xthe gradient. Try it out.

x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

grad_x = torch.autograd.grad(outputs=z, inputs=x, retain_graph=True)
grad_xx = torch.autograd.grad(outputs=grad_x, inputs=x)

print(grad_xx[0])
>>>RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

An error is reported. Although retain_graph=Truethe calculation graph and the gradient of the intermediate variable are retained, there is no saved grad_xoperation method. It is necessary to use creat_graph=Truethe original graph to create an additional calculation graph for derivation, that is, ∂z /∂x=2xy save the operation

# autograd.grad() + autograd.grad()
x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

grad_x = torch.autograd.grad(outputs=z, inputs=x, create_graph=True)
grad_xx = torch.autograd.grad(outputs=grad_x, inputs=x)

print(grad_xx[0])
>>>tensor(6.)

grad_xxIt can also be used directly here backward(), which is equivalent to returning directly from ∂z/∂x=2xy

# autograd.grad() + backward()
x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

grad = torch.autograd.grad(outputs=z, inputs=x, create_graph=True)
grad[0].backward()

print(x.grad)
>>>tensor(6.)

You can also use it first backward()and then x.gradcontinue to derive the first derivative

# backward() + autograd.grad()
x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

z.backward(create_graph=True)
grad_xx = torch.autograd.grad(outputs=x.grad, inputs=x)

print(grad_xx[0])
>>>tensor(6.)

Can it be used directly twice backward()? x.gradFor the second time , return directly from the beginning, let's try

# backward() + backward()
x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

z.backward(create_graph=True) # x.grad = 12
x.grad.backward()

print(x.grad)
>>>tensor(18., grad_fn=<CopyBackwards>)

The problem was found, the result was not 6, but 18, and it was found that the output x gradient was 12 during the first return. This is because PyTorch backward()will accumulate the gradient by default when using it, and you need to manually clear the previous gradient

x = torch.tensor(2.).requires_grad_()
y = torch.tensor(3.).requires_grad_()

z = x * x * y

z.backward(create_graph=True)
x.grad.data.zero_()
x.grad.backward()

print(x.grad)
>>>tensor(6., grad_fn=<CopyBackwards>)

Did you find out that the previous derivation is for scalars, what if it is not a scalar?

x = torch.tensor([1., 2.]).requires_grad_()
y = x + 1

y.backward()
print(x.grad)
>>>RuntimeError: grad can be implicitly created only for scalar outputs

An error was reported, because only scalar to scalar, scalar to vector gradient, x can be scalar or vector, but y can only be scalar; so you only need to convert x to scalar first, and the one that has no effect on the separate derivatives is the sum .

x = torch.tensor([1., 2.]).requires_grad_()
y = x * x

y.sum().backward()
print(x.grad)
>>>tensor([2., 4.])

x = torch.tensor([1., 2.]).requires_grad_()
y = x * x

y.backward(torch.ones_like(y))
print(x.grad)
>>>tensor([2., 4.])

can also be used autograd. The positions above and here torch.ones_like(y) refer to the vector that is left multiplied by the Jacobian matrix.

x = torch.tensor([1., 2.]).requires_grad_()
y = x * x

grad_x = torch.autograd.grad(outputs=y, inputs=x, grad_outputs=torch.ones_like(y))
print(grad_x[0])
>>>tensor([2., 4.])

or:

x = torch.tensor([1., 2.]).requires_grad_()
y = x * x

grad_x = torch.autograd.grad(outputs=y.sum(), inputs=x)
print(grad_x[0])
>>>tensor([2., 4.])

Gradient reset

Pytorch's automatic derivation gradient will not be automatically cleared, it will accumulate, so it needs to be manually cleared after a backpropagation.

x.grad.zero_()

Whereas in a neural network, we only need to perform

optimizer.zero_grad()

use `detach()`cut off

The gradient will not be calculated later, assuming that there are model A and model B, we need to use the output of A as the input of B, but we only train model B during training, then we can do this:

input_B = output_A.detach()

If we still take the previous example and cut off a, there will be only one path of b, and a becomes a leaf node.

x = torch.tensor([2.], requires_grad=True)

a = torch.add(x, 1).detach()
b = torch.add(x, 2)
y = torch.mul(a, b)

y.backward()

print("requires_grad: ", x.requires_grad, a.requires_grad, b.requires_grad, y.requires_grad)
print("is_leaf: ", x.is_leaf, a.is_leaf, b.is_leaf, y.is_leaf)
print("grad: ", x.grad, a.grad, b.grad, y.grad)

>>>requires_grad:  True False True True
>>>is_leaf:  True True False False
>>>grad:  tensor([3.]) None None None

In-situ operation in-place

(change the value, not change the object address)

Leaf nodes cannot perform in-place operations, because the original object address will be accessed during backpropagation,

Reprint: An article explaining PyTorch's derivation related (backward, autograd.grad) - Zhihu (zhihu.com)