backward best explanation

The basic process of building a deep learning model is: build a calculation graph, obtain a loss function, then calculate the derivative of the loss function to the model parameters, and then use methods such as gradient descent to update the parameters. The process of building a calculation graph is called "forward propagation". This requires us to do it ourselves, because we need to design the structure of our model. The process of deriving from the loss function is called "backpropagation". Derivation is hard work, so automatic derivation is basically one of the basic and most important functions of various deep learning frameworks, and PyTorch does not exception.

1. Preliminary understanding of pytorch automatic derivation

For example, there is a function, the square of y=x (y=x2), its derivative is 6 when x=3, we demonstrate such a process through code.

x=torch.tensor(3.0,requires_grad=True)y=torch.pow(x,2)#Judge whether x, y can be derived print(x.requires_grad)print(y.requires_grad)#Derivative, through Backward function to implement y.backward()#View derivative, also known as gradient print(x.grad)

The final running result is:

TrueTruetensor(6.) #This is exactly the same as our own calculation.

Here are some key points

1.1 tensor creation and attribute setting

First look at the definition of tensor:

tensor(data, dtype=None, device=None, requires_grad=False) -> Tensor

Parameters: data: (array_like): initial value of tensor. It can be a list, tuple, numpy array, scalar, etc.; dtype: data type of tensor element device: specify CPU or GPU device, default is Nonerequires_grad: Whether derivative can be obtained , that is, to find the gradient, the default is False, that is, non-differentiable

(1) The requires_grad attribute of the tensor object

Each tensor has a requires_grad attribute, which indicates whether the tensor can be differentiated. If it is true, it can be derived, otherwise it cannot be derived. The syntax format is:

x.requires_grad determines whether a tensor can be derived and returns a Boolean value

It should be noted that only when all "leaf variables", that is, the so-called leaf variables are not differentiable, then the function y cannot be differentiable. What is a leaf variable? This actually involves the knowledge related to "calculation graph", but we can understand it through the following example, as follows:

#Create a binary function, that is, z=f(x,y)=x2+y2, x can be differentiated, y can not be differentiated x=torch.tensor(3.0,requires_grad=True)y=torch.tensor(4.0 ,requires_grad=False)z=torch.pow(x,2)+torch.pow(y,2)#Judge whether x, y can be derived print(x.requires_grad)print(y.requires_grad)print(z .requires_grad)#Derivation, implemented through the backward function z.backward()#View the derivative, also known as the gradient print(x.grad)print(y.grad)

The result of the operation is:

True # x is derivable False # y is not derivable True # z is derivable because it has a leaf variable that is derivable, that is, x can be deduced tensor(6.) # derivative of x None # because y Not guide, so it is none

If the above leaf variable x is also set to be non-guideable, then z is also non-guideable, because x and y are non-guideable, then z is naturally non-guideable.

(2) The requires_grad_() method of leaf variable (also tensor)

What should I do if a certain leaf variable is not derivable at the beginning, but I want to set it as derivable later, or vice versa? Tensor provides a method, namely

x.requires_grad_(True/False) Set the leadable and non-leadable tensor, note that there is an underscore behind it!

But it should be noted that I can only set the leaf variable, that is, the method of leaf variable, otherwise the following error will occur:

RuntimeError: you can only change requires_grad flags of leaf variables.

1.2 Derivation method of function - y.backward() method

The above only demonstrates the derivation rules of simple functions,

It should be noted that if there is a compound function, such as y is a function of x, z is a function of y, and f is a function of z, then when deriving, f.backwrad() will only be used to find f by default. The derivative value of the leaf variable leaf variable, while the derivative value of the intermediate variables y and z is not known, it is known directly through x.grad, and the values ​​of y.grad and z.grad are none.

Let's take a look at the definition of this function backward:

backward(gradient=None, retain_graph=None, create_graph=False)

Its three parameters are optional, and none of the parameters have been used in the above example. I will talk about these three parameters in detail later, and skip them here.

1.3 View the value of the calculated derivative - x.grad attribute

View the obtained gradient value through the grad attribute of tensor.

Summarize:

(1) torch.tensor() sets the requires_grad keyword parameter

(2) Check whether the tensor can be guided, x.requires_grad attribute

(3) Set the derivability of the leaf variable leaf variable, x.requires_grad_() method

(4) The automatic derivation method y.backward(), directly calling the backward() method, will only calculate the derivative to the leaf nodes of the calculation graph.

(5) View the obtained value, x.grad attribute

Error-prone points:

Why are the values ​​of scalar x above 3.0 and 4.0 instead of integers? This is because, in order for x to support derivation, x must be a floating-point type, that is, when we give the initial value, we need to add a dot: ".". Otherwise, an error will be reported. That is, [1,2,3] cannot be defined, but should be defined as [1.,2.,3.], the former is an integer, and the latter is a floating-point number, and only floating-point numbers can be derived.

2. The core function of derivation - detailed explanation of backwrad function

2.1 Default derivation rules

In pytorch, the default: only [scalar] to [scalar], or [scalar] to [quantity/matrix] derivative! This is critical, very important!

(1) Derivation of scalar to scalar

See the above example, x, y, and z are all scalars, so the derivation process is also very simple, so I won’t repeat them here.

(2) Derivation of scalar to vector/matrix

Why is scalar the default for vectors/matrices? Because in deep learning, we usually derive the loss function when deriving. The loss function is generally a scalar, that is, adding up the losses of all items, but the parameters are often vectors or matrices, so this is the default that's it. See the example below.

For example, there is an input layer whose input layer is 3 nodes, and the output layer is an output layer of one node. Such a simple neural network, for a group of samples, has

X=(x1,x2,x3)=(1.5,2.5,3.5), X is (1,3) dimensional, and the weight matrix of the output layer is W=(w1,w2,w3)T=(0.2,0.4 ,0.6)T, here represents the initialized weight matrix, T represents the transpose, then W represents the (3,1) dimension, the bias item is b=0.1, which is a scalar, then a model can be constructed as follows:

Y=XW+b, where W and b are the variables that require the reciprocal, where Y is a scalar, W is a vector, b is a scalar, W and b are leaf nodes, leaf variable,

Expand the above to get:

Y=x1w1+x2w2x3w3+b (1, 2, 3 here are subscripts, not powers! It is rare to use the formula to take a screenshot)

Calculated manually,

The derivative of Y with respect to w1 is 1.5

The derivative of Y with respect to w2 is 2.5

The derivative of Y with respect to w3 is 3.5

The derivative of Y with respect to b is 1

Let's verify it:

#Create a multivariate function, that is, Y=XW+b=Y=x1*w1+x2*w2*x3*w3+b, x cannot be differentiated, W, b can be differentiated X=torch.tensor([1.5, 2.5,3.5],requires_grad=False)W=torch.tensor([0.2,0.4,0.6],requires_grad=True)b=torch.tensor(0.1,requires_grad=True)Y=torch.add(torch.dot(X ,W),b)#Judge whether each tensor can be derived print(X.requires_grad)print(W.requires_grad)print(b.requires_grad)print(Y.requires_grad)#Derivative, realized by backward function Y.backward()#View the derivative, also known as the gradient print(W.grad)print(b.grad)

The result of the operation is:

FalseTrueTrueTruetensor([1.5000, 2.5000, 3.5000])tensor(1.)

We found this to be the same as our own calculations.

(3) Further understanding of scalar to vector/matrix derivation

For example, there is a composite function below, and it is a matrix, defined as follows:

x is a (2,3) matrix, set to be derivable, it is a leaf node, that is, leaf variabley is an intermediate variable, since x is derivable, so y can be derivable z is an intermediate variable, since x, y are derivable, so z The derivable f is a summation function, and the final result is a scalar scaler

x = torch.tensor([[1.,2.,3.],[4.,5.,6.]],requires_grad=True)y = torch.add(x,1)z = 2*torch.pow(y,2)f = torch.mean(z)

Then the actual functional relationship of x, y, z, f is as follows:

for:

It can be seen that now we can manually find the derivatives of function f for x11, x12, x13, x21, x22, x23, so let's try it through torch.

print(x.requires_grad)print(y.requires_grad)print(z.requires_grad)print(f.requires_grad)print('===================================')f.backward()print(x.grad)

The result of the operation is:

TrueTrueTrueTrue=================================== tensor([[1.3333, 2.0000, 2.6667],[3.3333, 4.0000, 4.6667]])

Do we now understand the rules of automatic derivation better?

How to find derivatives of scalars, vectors, and matrices for scalars! ! !

2.2 Vector/matrix to vector/matrix derivation - realized by the first parameter gradient of backward

(1) A rule for derivation

For example, the following example:

x is a (2,3) matrix, set to be derivable, and is a leaf node, that is, the leaf variabley is also a (2,3) matrix, that is, y=x2+x (the square of x plus x) is actually Each element of y is required to be derived from the corresponding x

x = torch.tensor([[1.,2.,3.],[4.,5.,6.]],requires_grad=True)y = torch.add(torch.pow(x,2),x)gradient=torch.tensor([[1.0,1.0,1.0],[1.0,1.0,1.0]])y.backward(gradient)print(x.grad)

The result of the operation is:

tensor([[ 3., 5., 7.],[ 9., 11., 13.]])

This is actually the same as our own calculation,

Compared with the above scalar for vector or matrix derivation, the key is the first parameter gradient of the backward() function, so what does this parameter mean?

In order to find out what the incoming gradient parameter does, let's do a further experiment, with the following vector-to-vector derivation, that is

x = torch.tensor([1.,2.,3.],requires_grad=True)y = torch.pow(x,2)gradient=torch.tensor([1.0,1.0,1.0])y.backward(gradient)print(x.grad)

The results obtained:

tensor([2., 4., 6.]) This is what we expect

Because the gradient parameters here are all 1, so no difference can be seen, now change the value of the gradient, as follows:

gradient=torch.tensor([1.0,0.1,0.01])

The output is:

tensor([2.0000, 0.4000, 0.0600])

Judging from the results, the second derivative is reduced by ten times, and the third derivative is reduced by 100 times. This multiple is closely related to the number in the gradient.

If you want different components to have different weights, this is indeed the case in terms of effect. For example, I have three losses, loss1, loss2, and loss3. Their weights may be different, and we can use it. setting, ie

dy/dx=0.1*dy1/dx+1.0*dy2/dx+0.0001*dy3/dx。

It should be noted that the dimension of the gradient is the same as the dimension of the final y that needs to be derived, as can be seen from the above two examples.

Summary: The dimension of the gradient parameter maintains the same shape as the final function y, and each element represents the weight corresponding to the current element

2.3 The second and third parameters of the automatic derivative function backward

(1) Retain operation graph - retain_graph

When constructing functional relationships, especially when there are multiple compound functions, there will be an operation graph, such as the following:

Then there are some functional relations as follows:

p=f(y)——>y=f(x)q=f(z)——>z=f(x)

After a calculation graph is reversely derived, in order to save memory, the calculation graph is destroyed. If you try to derive again, an error will be reported.

For example here,

You first seek p for derivation, then this process is the reverse p to y derivation, y to x derivation. After the derivation is completed, the calculation subgraph composed of these three nodes will be released:

Then there are only z and q left in the calculation graph, which are incomplete and cannot be derived. So at this time, whether you want to run again

p.backward()

still

q.backward()

, cannot be carried out, because x has been destroyed, the error is as follows:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

then what should we do? When encountering this kind of problem, we can retain the calculation graph by setting retain_graph=True,

That is, change your backward function, add the parameter retain_graph=True, and perform backward again. At this time, your calculation graph will be retained and no error will be reported. But this will consume memory! , especially, when you update parameters in a large number of iterations, you will soon run out of memory, so this parameter should not be used in most cases.

(2) Higher order derivatives - create_graph

There is very little information about the create_graph parameter, and I haven't found some more detailed usage. Its official description is as follows:

A higher-level computational graph will be created, allowing the calculation of higher-order derivatives, such as second-order derivatives, third-order derivatives, etc. Here is a simple small example:

x = torch.tensor(5.0,requires_grad=True)y = torch.pow(x,3)grad_x = torch.autograd.grad(y, x, create_graph=True)print(grad_x) # dy/dx = 3 * x^2,即75grad_grad_x = torch.autograd.grad(grad_x[0],x)print(grad_grad_x) # 二阶导数 d(2x)/dx = 30

The result of the operation is:

(tensor(75., grad_fn=<MulBackward0>),)(tensor(30.),)

3. Explanation of Vector-to-Vector Derivation

Supplementary note: Further discussion on vector-to-vector gradient:

For example, the following three-dimensional vector finds the gradient:

Then, to compute the gradient of z with respect to x or y, an external gradient needs to be passed to the z.backward() function as follows:

z.backward(torch.FloatTensor([1.0, 1.0, 1.0])

The tensor passed to the inverse function is like the weights for the gradient weighted output. Mathematically, this is a vector times the Jacobian of a non-scalar tensor (discussed further in this article), so it is almost always a one-dimensional unit tensor, the same as the backward tensor, unless weighted outputs need to be computed.

Note: The backward graph is automatically and dynamically created by the autograd class during the forward pass. Backward() simply computes the gradient by passing its argument to the already generated backward graph.

Math - Jacobians and Vectors

Mathematically, the autograd class is just a Jacobian vector product calculation engine. Jacobian is a very simple word that represents all possible partial derivatives of two vectors. It is the gradient of one vector with respect to another.

Note: PyTorch never explicitly constructs the entire Jacobian matrix in this process. Computing the JVP (Jacobian vector product) directly is usually simpler and more efficient.

If one vector X = [x1, x2, ...xn] computes other vectors via f(X) = [f1, f2, ...fn], the Jacobian matrix (J) contains all the following combinations of partial derivatives:

Note: The Jacobian matrix implements the mapping from n-dimensional vectors to m-dimensional vectors.

Jacobian matrix

The above matrix represents the gradient of f(X) with respect to X.

Assume a tensor X with PyTorch gradients enabled:

X = x1,x2,…,xn

X undergoes some operations to form a vector Y

Y = f(X) = [y1, y2,…,ym]

Y is then used to compute a scalar loss l. Suppose the vector v happens to be the gradient of the scalar loss l with respect to the vector Y, as follows: (Pay attention to this sentence, this is very important!)

The vector v is called

grad_tensor (gradient tensor)

, and passed as a parameter to the backward() function.

To get the gradient of the loss l with respect to the gradient of the weights X, the Jacobian matrix J is the vector multiplied by the vector v

This method of computing the Jacobian and multiplying it with a vector v allows PyTorch to easily provide external gradients for non-scalar outputs.

Fourth, the other two methods of derivation

Method 1: Derivation through torch.autograd.backward()

The basic formula for derivation introduced earlier is:

y.backward(grad_tensors=None, retain_graph=None, create_graph=False), these three parameters I have said before,

Inversely deriving it is equivalent to:

torch.autograd.backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False), the tensors parameter here is equivalent to y,

so:

y.backward() #Scalar y is equivalent to torch.autograd.backward(y).

It should be noted that this function only provides the function of derivation and does not return a value. It always returns None, as in the following example:

import torchx=torch.tensor([1.0,2.0,3.0],requires_grad=True)y=torch.tensor([4.0,5.0,6.0],requires_grad=True)z=torch.sum(torch.pow(x,2 )+torch.pow(y,3)) # z=x2+y3torch.autograd.backward([z]) # Derivation, equivalent to z.backward() print(x.grad) # Get the result of derivation print(y.grad)

output

tensor([2., 4., 6.])tensor([ 48., 75., 108.])

Precautions:

(1) This method is only responsible for derivation and always returns None.

(2) When the vector is derived from the vector, the parameter grad_tensor needs to be passed. The meaning of this parameter is actually the same as the one in y.backward() in the previous article;

(3) retain_graph=None, create_graph=False is also the same as the previous meaning

Method 2: Derivation through torch.autograd.grad()

In addition to the previous two methods to derive derivatives, namely

y.backward() torch.autograd.backward(y) These two methods

There is another way, by

torch.autograd.grad()

To find the derivative, first look at the definition of this function.

def grad(outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False,only_inputs=True, allow_unused=False):

outputs : The dependent variable of the function, that is, the function that needs to be derived. In this example, it is z. Of course, it can be a tensor or several tensors, such as [tensor1, tensor2, tensor3...] inputs : The argument of the function, in this example, corresponds to [x,y], it can be a tensor, or several tensors, such as [tensor1,tensor2,tensor3...]grad_output : This parameter and The grad_tensors in the previous two methods have the same meaning, and this parameter needs to be specified when there is a vector-to-vector derivation

Still using this example, let's see how to do it:

import torchx=torch.tensor([1.0,2.0,3.0],requires_grad=True)y=torch.tensor([4.0,5.0,6.0],requires_grad=True)z=torch.sum(torch.pow(x,2 )+torch.pow(y,3)) # z=x2+y3print(torch.autograd.grad(z,[x,y])) # Derivation and return value

output

(tensor([2., 4., 6.]), tensor([ 48., 75., 108.]))

Precautions:

This function will automatically complete the derivation process, and will automatically return the result of derivation for each independent variable. This is where it is different from before.

Original link:  How to use pytorch to automatically find the gradient (baidu.com)

Guess you like

Origin blog.csdn.net/weixin_40895135/article/details/130001387