Some summary of tensor and backward in pytorch

This article addresses the following issues:

  • For tensor calculation gradient, you need to setrequires_grad=True
  • why do you needtensor.zero_grad()
  • tensor.backward()The two parameters in gradientand retain_graphintroduce

illustrate

In the old version of pytorch, there are two basic objects: Tensor (tensor), Variable (variable). Among them, tensor cannot be backpropagated; Variable is an ever-changing type that conforms to the attribute of parameter update, so it can be backpropagated.

In fact, Variable is a kind of encapsulation of tensor. The operation is basically the same as tensor, but each variable has three more attributes.

  • .data: data of the tensor itself
  • .grad: Gradient corresponding to tensor
  • .grad_fn: .gradThe way of obtaining (this may not be well understood, it will be explained later. Indicates where the calculation of the gradient value of the tensor comes from, if not, it is. Here it can be simply compared Noneto the independent variable and dependent variable in the function, if it is Independent variable, then the attribute value is None; if it is a dependent variable, its gradient comes from the calculation of its independent variable)

Every time a tensor tensor is created, if the gradient needs to be calculated, it needs to be converted to a Variable variable before calculation, and the attribute requires_gradmust be set to True.

from torch.autograd import Variable
import torch

# 创建一个torch.Size([2, 3])的tensor
x_tensor = torch.randn(2, 3)

# 将tensor封装成Variable类型,用于计算梯度,这里的requires_grad要设置为True
x = Variable(x_tensor, requires_grad=True)

y = 3 * x ** 2 + 1

print(y.type())  # torch.FloatTensor
print(y.grad_fn)  # <AddBackward0 object at 0x0000021679AB9700>

# 梯度参数grad_variables形状必须与Variable一致
grad_variables = torch.FloatTensor([[1, 2, 3],
                                    [1, 1, 1]])

# 求函数y对x的梯度,这里需要输入参数,否则会报错grad can be implicitly created only for scalar outputs
y.backward(grad_variables)
print(x.grad)  # 得到x的梯度

In the new version of pytorch, torch.autograd.Variableand torch.Tensorbelong to the same category. All the following operations are performed under the new version of pytorch.

Tensor

Tensor creation

Tensor is an n-dimensional array, which is conceptually the same as a numpy array, except that Tensor can track calculation graphs and calculate gradients. Its creation can come from python's built-in data types, or from others (numpy), etc. Here are several ways to create a tensor:

import torch
import numpy as np

# 从numpy中创建
numpy_array = np.array([2, 1, 2])
torch_tensor1 = torch.from_numpy(numpy_array)
torch_tensor2 = torch.Tensor(numpy_array)
torch_tensor3 = torch.tensor(numpy_array)

# 将tensor转换回numpy类型
numpy_array = torch_tensor1.numpy()  # 如果tensor在CPU上
numpy_array = torch_tensor1.cpu.numpy()  # 如果tensor在GPU上
print(type(numpy_array))  # 输出 : <class 'numpy.ndarray'>

torch.Tensor()Is an alias for the default tensor type torch.FloatTensor(), that is to say, torch.Tensor()the Float data type is returned. Its default data type can be modified:

torch.set_default_tensor_type(torch.DoubleTensor)

Instead torch.tensor(), the corresponding torch.LongTensor, , torch.FloatTensorand are generated based on the input data type torch.DoubleTensor.

# 创建形状为size的全为fill_value的Tensor
a = torch.full(size=[2, 3], fill_value=2)
# 创建形状为size的全为1的Tensor
b = torch.ones(size=[2, 3])
# 创建形状为size的全0的Tensor
c = torch.zeros(size=[2, 3])
# 创建对角阵的Tensor
d = torch.eye(3)
# 在区间[low,high]中随机创建形状为size的Tensor
e = torch.randint(low=1, high=10, size=[2, 2])

When using torch.Tensor()the method to create Tensor, you can set the data type (dtype) and device (device)

a = torch.tensor(data=[[1, 3, 1], # 数据
                       [4, 5, 4]],
                 dtype=torch.float64,# 数据类型
                 device=torch.device('cuda:0')) # 设备

# 可以通过Tensor的属性查看
print(a.dtype)  # torch.float64
print(a.device)  # cuda:0
print(a.is_cuda)  # True

When using tensor, if the tensor is placed in the GPU, the calculation of the tensor can be accelerated.

# 1,定义cuda数据类型
# 把Tensor转换为cuda数据类型
dtype = torch.cuda.FloatTensor
gpu_tensor = torch.tensor(data=[[1, 3, 1],
                                [4, 5, 4]]).type(dtype)

# 2.直接将Tensor放到GPU上
a = torch.tensor(data=[[1, 3, 1],
                       [4, 5, 4]])
gpu_tensor = a.cuda(0)  # 把Tensor直接放在第一个GPU上
gpu_tensor = a.cuda(1)  # 把Tensor直接放在第二个GPU上

# 如果将GPU上tensor放回CPU也简单
a_cpu = gpu_tensor.cpu()

Tensor (tensor) basic data types and common attributes

The most basic data type of Tensor:

  • 32-bit floating point type: torch.float32 (most commonly used)
  • 64-bit floating-point type: torch.float64 (most commonly used)
  • 32-bit integer: torch.int32
  • 16-bit integer: torch.int16
  • 64-bit integer: torch.int64

a.typeA tensor of the same data type can be regenerated.

import torch

# 创建一个tensor
a = torch.tensor(data=[[1, 3, 1],
                       [4, 5, 4]])
# 查看Tensor类型
print(a.dtype)  # torch.int64

# 如果不传入参数,则默认转换为torch.LongTensor类型
b = a.type(torch.float64)
print(a.dtype)  # torch.int64
print(b.dtype)  # torch.float64

# 查看Tensor尺寸
print(a.shape)  # torch.Size([2, 3])

# 查看Tensor是否在GPU上
print(a.is_cuda)  # False

# 查看Tensor存储设备
print(a.device)  # cpu

# 查看Tensor梯度计算,如果没有,则是None
print(a.grad)  # None

# 如果要计算梯度,requires_grad需要设置为True
print(a.requires_grad)  # False

Automatic Differentiation of Tensor

Set the attribute Torch.Tensorof requires_grad to True, then pytorch will start tracking all operations on this tensor. After the calculation is complete, .backward()the method can be called to automatically calculate all the gradients. And the gradient computed value for that tensor will be accumulated into .gradthe attribute.

  • .data: data of the tensor itself
  • .grad: Gradient accumulation sum of the corresponding tensor
  • .grad_fn: .gradThe way to get it (from what function, if it is created by yourself, it is None)
# 创建tensor
x = torch.tensor(data=[[1, 2],
                       [2, 1]],
                 dtype=torch.float64,
                 requires_grad=True)
print(x.requires_grad)  # True

y = x ** 2 + 2
y = y.mean()

# 在计算backward的时候,y必须是标量,否则需要设置一下backward函数的输入参数。
y.backward()
print(x.grad)
"""
tensor([[0.5000, 1.0000],
        [1.0000, 0.5000]], dtype=torch.float64)
"""

In order to illustrate the above gradient calculation principle, some functions need to be defined here, let x = [ x 1 x 2 x 3 x 4 ] \boldsymbol{x}=\left[ \begin{matrix}x_1& x_2\\x_3& x_4\\ \end{matrix} \right]x=[x1x3x2x4],并且 y = f ( x ) = f ( x 1 , x 2 , x 3 , x 4 ) = x 1 2 + x 2 2 + x 3 2 + x 4 2 4 + 2 y=f\left( \boldsymbol{x} \right) =f\left( x_1,x_2,x_3,x_4 \right)=\frac{x_{1}^{2}+x_{2}^{2}+x_{3}^{2}+x_{4}^{2}}{4}+2 y=f(x)=f(x1,x2,x3,x4)=4x12+x22+x32+x42+2

Therefore, find yyy versusx \boldsymbol{x}x 的梯度,有:
d y d x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 ∂ y ∂ x 3 ∂ y ∂ x 4 ] = [ x 1 2 x 2 2 x 3 2 x 4 2 ] \frac{d y}{d \boldsymbol{x}}=\left[\begin{array}{ll} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} \\ \frac{\partial y}{\partial x_3} & \frac{\partial y}{\partial x_4} \end{array}\right]=\left[\begin{array}{cc} \frac{x_1}{2} & \frac{x_2}{2} \\ \frac{x_3}{2} & \frac{x_4}{2} \end{array}\right] dxdy=[x1yx3yx2yx4y]=[2x12x32x22x4]
因为 x . d a t a = [ 1 2 2 1 ] x.data =\left[ \begin{matrix} 1& 2\\ 2& 1\\ \end{matrix} \right] x.data=[1221] into the gradient calculation formula, so its gradientx.grad = [ 1 2 2 2 2 2 1 2 ] = [ 0.5000 1.0000 1.0000 0.5000 ] x.grad = \left[ \begin{matrix} \frac{1}{2 } & \frac{2}{2} \\ \frac{2}{2} & \frac{1}{2} \end{matrix} \right] = \left[ \begin{matrix} 0.5000& 1.0000\ \ 1.0000& 0.5000\\ \end{matrix} \right]x.grad=[21222221]=[0.50001.00001.00000.5000]

But there is a problem here. Since the gradient is accumulated in .grad, if there are multiple functions to solve the gradient for the same data x, the gradient is the sum of accumulation. Can be cleared Tensor.grad.data.zero_()by ..grad

# 创建tensor
x = torch.tensor(data=[[1, 2],
                       [2, 1]],
                 dtype=torch.float64,
                 requires_grad=True)
print(x.requires_grad)  # True

z = 4 * x + 2
z = z.mean()

z.backward()
print(x.grad)
"""
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float64)
"""

# 对x的梯度清0操作,否则会由于x的不同梯度进行叠加操作。
# x.grad.data.zero_()

y = x ** 2 + 2
y = y.mean()

# 在计算backward的时候,y必须是标量,否则需要设置一下backward函数的输入参数。
y.backward()
print(x.grad)
"""
tensor([[1.5000, 2.0000],
        [2.0000, 1.5000]], dtype=torch.float64)
"""

The gradient of x in this example is accumulated by two functions z , yz,yz,y 。其中, z = f ( x ) = f ( x 1 , x 2 , x 3 , x 4 ) = x 1 + x 2 + x 3 + x 4 + 2 z=f\left( \boldsymbol{x} \right) =f\left( x_1,x_2,x_3,x_4 \right) =x_1+x_2+x_3+x_4+2 z=f(x)=f(x1,x2,x3,x4)=x1+x2+x3+x4+2 y = f ( x ) = f ( x 1 , x 2 , x 3 , x 4 ) = x 1 2 + x 2 2 + x 3 2 + x 4 2 4 + 2 y=f\left( \boldsymbol{x} \right) =f\left( x_1,x_2,x_3,x_4 \right) =\frac{x_{1}^{2}+x_{2}^{2}+x_{3}^{2}+x_{4}^{2}}{4}+2 y=f(x)=f(x1,x2,x3,x4)=4x12+x22+x32+x42+2

Attributes starting with x .gradcome from zzz ,于是有
∣ d z d x = [ ∂ z ∂ x 1 ∂ z ∂ x 2 ∂ z ∂ x 3 ∂ z ∂ x 4 ] = [ x 1 x 2 x 3 x 4 ] \mid \frac{d z}{d \boldsymbol{x}}=\left[\begin{array}{cc} \frac{\partial z}{\partial x_1} & \frac{\partial z}{\partial x_2} \\ \frac{\partial z}{\partial x_3} & \frac{\partial z}{\partial x_4} \end{array}\right]=\left[\begin{array}{ll} x_1 & x_2 \\ x_3 & x_4 \end{array}\right] dxdz=[x1zx3zx2zx4z]=[x1x3x2x4]
此时, x . d a t a = [ 1 2 2 1 ] x.data =\left[ \begin{matrix} 1& 2\\ 2& 1\\ \end{matrix} \right] x.data=[1221] into the gradient calculation formula, so its gradientx .grad = [ 1 1 1 ] x.grad =\left[ \begin{matrix} 1& 1\\ 1& 1\\ \end{matrix} \right]x.grad=[1111]

Since the gradient of x is not cleared, then calculate yyy versusx \boldsymbol{x}After the gradient of xx.grad , it consists of two parts, the previous gradient + the new gradient, that is,x .grad = [ 1 1 1 1 ] + [ 0.5000 1.0000 1.0000 0.5000 ] = [ 1.5000 2.0000 2.0000 1.5000 ] x.grad=\left[ \begin{matrix} 1& 1\\ 1& 1\\ \end{matrix} \right] +\left[ \begin{matrix} 0.5000& 1.0000\\ 1.0000& 0.5000\\ \end{matrix} \right] =\ left[ \begin{matrix} 1.5000& 2.0000\\ 2.0000& 1.5000\\ \end{matrix} \right]x.grad=[1111]+[0.50001.00001.00000.5000]=[1.50002.00002.00001.5000]

Set non-integrable calculation

Sometimes there is no need to track calculations. For example, test data for the test set (all set to not track calculations), or freeze weight information (partial track calculations).

The principle of implementing this method is to .requires_gradset the property of the tensor that does not need to calculate the gradient to False.

In pytorch, if the requires_grad parameter of the tensor is set to True, the tensor will be automatically derived during backpropagation. When requires_grad is set to False, there will be no automatic derivation during backpropagation, thus greatly saving video memory or memory. The default attribute of tensor requires_grad is False.

Change the requires_grad value statement of the Tensor (tensor) variable:

# 就地改变Tensor变量a的requires_grad属性为True
a.requires_grad = True 
a.requires_grad_(True)  # 与上面等价,注意下划线

For the test set , use with torch.no_grad(), all computed tensor properties under this module .requires_gradwill be automatically set to False.

import torch

x = torch.randn(3, 4, requires_grad=True)
y = torch.randn(3, 4, requires_grad=True)

u = x + y
print(u.requires_grad)  # True
print(u.grad_fn)  # <AddBackward0 object at 0x000001D0DC709700>

with torch.no_grad():
    w = x + y
    print(w.requires_grad)  # False
    print(w.grad_fn)  # None
print(w.requires_grad)  # False

The with statement is suitable for the occasion of accessing resources, to ensure that no matter whether an exception occurs during use, the necessary "cleanup" operation will be performed to release resources.

statement:with open("1.txt") as file:

with working principle:

  1. After the statement following with is evaluated, __enter__the method is called, and the return value will be assigned to the variable after as;
  2. When the body of the with statement is completely executed, the previous __exit__method .

It is used to freeze the weight , traverse the parameters in the network, set the tensor that needs to freeze the parameters to True, and set the others to False

# 遍历模型权重信息
for name, para in model.named_parameters():
    # 按照名字将其进行冻结,这里冻结最后一层的fc,和fc.bias
    if "fc" not in name and "fc.bias" not in name:
        para.requires_grad_(False)

pytorch calculation graph

Explain that pytorch is a dynamic graph mechanism, so when training the model, a new calculation graph will be built every iteration. The calculation graph actually represents the relationship between variables in the program.

Dynamic computation means that the program will execute in the order we write the commands. This mechanism will make debugging easier, and it will also make it easier to translate the ideas in our heads into actual code. Static computing means that the program will first generate the structure of the neural network when it is compiled and executed, and then perform the corresponding operations. In theory, mechanisms like static evaluation allow the compiler to optimize to a greater degree, but it also means that there is more generation gap between what you expect your program to do and what the compiler actually executes. This also means that errors in the code will be more difficult to find (for example, if there is a problem with the structure of the calculation graph, you may only find it when the code executes the corresponding operation). Although in theory static computation graphs have better performance than dynamic computation graphs, in practice we often find that this is not the case.

Computation graph is a directed acyclic graph used to describe operations, mainly composed of nodes (Node) and edges (Edge), where nodes represent data, and edges represent operations.

y = f ( x 1 , x 2 , x 3 , x 4 ) = ( x 1 + x 2 ) ( x 2 − x 3 − x 4 ) y=f\left( x_1,x_2,x_3,x_4 \right) =(x_{1}+x_{2})(x_{2} - x_{3}-x_{4}) y=f(x1,x2,x3,x4)=(x1+x2)(x2x3x4) as an example, its calculation graph is shown in the figure below, wherex 1 , x 2 , x 3 , x 4 x_1,x_2,x_3,x_4x1,x2,x3,x4is a leaf node.

insert image description here

From the chain rule, we can know that if you want to calculate the derivative of a leaf node to its root node, it is the sum of the derivatives on all paths from the root node to the leaf node. (Also further explains why .gradit is additive.)

For example, to solve for yyy tox 2 x_2x2, there are two paths, therefore,
∂ y ∂ x 2 = ∂ y ∂ f ∂ f ∂ x 2 + ∂ y ∂ g ∂ g ∂ x 2 \frac{\partial y}{\partial x_2}=\frac {\partial y}{\partial f}\frac{\partial f}{\partial x_2}+\frac{\partial y}{\partial g}\frac{\partial g}{\partial x_2}x2y=fyx2f+gyx2g
( x 1 , x 2 , x 3 , x 4 ) = ( 1 , 2 , 3 , 4 ) (x_1,x_2,x_3,x_4) = (1,2,3,4) (x1,x2,x3,x4)=(1,2,3,4 ) , the calculated∂ y ∂ x 2 \frac{\partial y}{\partial x_2}x2y 满足:
∂ y ∂ x 2 = ∂ y ∂ f ∂ f ∂ x 2 + ∂ y ∂ g ∂ g ∂ x 2 = g ∗ 1 + f ∗ 1 = − 5 + 3 = − 2 \frac{\partial y}{\partial x_2}=\frac{\partial y}{\partial f}\frac{\partial f}{\partial x_2}+\frac{\partial y}{\partial g}\frac{\partial g}{\partial x_2} = g * 1 + f * 1 = -5+3 = -2 x2y=fyx2f+gyx2g=g1+f1=5+3=2

import torch

# 创建tensor
x_1 = torch.tensor(data=[1.], requires_grad=True)
x_2 = torch.tensor(data=[2.], requires_grad=True)
x_3 = torch.tensor(data=[3.], requires_grad=True)
x_4 = torch.tensor(data=[4.], requires_grad=True)

f = x_1 + x_2
g = x_2 - x_3 - x_4

y = f * g
# y.retain_grad()

y.backward()
print(x_2.grad)

# 查看叶子结点
print(x_1.is_leaf, x_2.is_leaf, x_3.is_leaf, x_4.is_leaf, f.is_leaf, g.is_leaf, y.is_leaf)
# 输出:True True True True False False False

# 查看梯度
print(x_1.grad, x_2.grad, x_3.grad, x_4.grad, f.grad, g.grad, y.grad)
# 输出:tensor([-5.]) tensor([-2.]) tensor([-3.]) tensor([-3.]) None None None

torch.tensorThere is an attribute attribute in is_leafIndicates whether the tensor is a leaf node. The leaf node is the foundation of the entire calculation graph. Whether it is the derivation calculation graph or the backpropagation process, all gradient calculations depend on the leaf node. The main purpose of setting leaf nodes is to save memory. After the gradient backpropagation ends, the gradients of non-leaf nodes will be released.

During the running process, an error may be reported: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed.

Means: the .grad attribute of a tensor that is not a leaf tensor is being accessed, its .grad attribute will not be populated during autograd.backward(). If you really want gradients for non-leaf tensors, use .retain_grad() on non-leaf tensors. If you access non-leaf tensors by mistake, make sure you access leaf tensors.

For non-leaf nodes f , g , yf,g,yf,g,The three tensors of y.grad , if you want to use their attributes, you can use them before backward.retain_grad()to save their gradients.

backward some details

form:

tensor.backward(gradient, retain_graph)

The calculation graph built by pytoch is a dynamic graph. In order to save memory, the calculation graph is released in memory after each iteration. If you use backward multiple times , an error will be reported. The calculation graph can be retain_graph=Truesaved by setting the flag so that it will not be released.

import torch

x = torch.randn(4, 4, requires_grad=True)
y = 3 * x + 2
y = torch.sum(y)

y.backward(retain_graph=True)  # 添加retain_graph=True标识,让计算图不被立即释放
y.backward()  # 不报错
y.backward()  # 报错

Also, all the above calculation gradients are based on the case where y is a scalargradient . If it is not a scalar, you need to pass in parameters to backward . That is to say, when y is no longer a scalar, but a multidimensional tensor, Setup is required gradient. So why is this parameter needed and how is it used?

参考:Automatic Differentiation with torch.autograd

Here is an example, suppose, Y = [ y 1 y 2 ] Y=\left[ \begin{array}{c} y_1\\ y_2\\ \end{array} \right]Y=[y1y2] W = [ w 11 w 12 w 21 w 22 ] W=\left[ \begin{matrix} w_{11}& w_{12}\\ w_{21}& w_{22}\\ \end{matrix} \right] W=[w11w21w12w22] X = [ x 1 x 2 ] X=\left[ \begin{array}{c} x_1\\ x_2\\ \end{array} \right] X=[x1x2] , and complete:Y = WX , A = f ( Y ) Y=WX, A=f(Y)Y=WX,A=f ( Y ) , where the functionf ( y 1 , y 2 ) f(y_1,y_2)f(y1,y2)的具体定义未知。
[ y 1 y 2 ] = [ w 11 w 12 w 21 w 22 ] [ x 1 x 2 ] A = f ( y 1 , y 2 ) \left[ \begin{array}{c} y_1\\ y_2\\ \end{array} \right] =\left[ \begin{matrix} w_{11}& w_{12}\\ w_{21}& w_{22}\\ \end{matrix} \right] \left[ \begin{array}{c} x_1\\ x_2\\ \end{array} \right] \\ A = f(y_1,y_2) [y1y2]=[w11w21w12w22][x1x2]A=f(y1,y2)
modified scale form:
y 1 = w 11 x 1 + w 12 x 2 y 2 = w 21 x 1 + w 22 x 2 A = f ( y 1 , y 2 ) y_1=w_{11}x_1+w_{ 12}x_2 \\ y_2=w_{21}x_1+w_{22}x_2 \\ A = f(y_1,y_2)y1=w11x1+w12x2y2=w21x1+w22x2A=f(y1,y2)
Based on this, we need to calculateAAA pairx 1 , x 2 x_1,x_2x1,x2The partial derivative of , that is, need to solve ∂ A ∂ x 1 \frac{\partial A}{\partial x_1}x1A ∂ A ∂ x 2 \frac{\partial A}{\partial x_2} x2A,根据复合函数链式求解法则,有
∂ A ∂ x 1 = ∂ A ∂ y 1 ∂ y 1 ∂ x 1 + ∂ A ∂ y 2 ∂ y 2 ∂ x 1 ∂ A ∂ x 2 = ∂ A ∂ y 1 ∂ y 1 ∂ x 2 + ∂ A ∂ y 2 ∂ y 2 ∂ x 2 \begin{aligned} \frac{\partial A}{\partial x_1}&=\frac{\partial A}{\partial y_1}\frac{\partial y_1}{\partial x_1}+\frac{\partial A}{\partial y_2}\frac{\partial y_2}{\partial x_1}\\ \frac{\partial A}{\partial x_2}&=\frac{\partial A}{\partial y_1}\frac{\partial y_1}{\partial x_2}+\frac{\partial A}{\partial y_2}\frac{\partial y_2}{\partial x_2}\\ \end{aligned} x1Ax2A=y1Ax1y1+y2Ax1y2=y1Ax2y1+y2Ax2y2
The above two equations can be written in the form of matrix multiplication, as follows:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] \left[ \frac{\partial A}{\partial x_1},\frac{\partial A}{\partial x_2} \right] = \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \right] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1 }& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right ][x1A,x2A]=[y1A,y2A][x1y1x1y2x2y1x2y2]
其中
[ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1}& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right] [x1y1x1y2x2y1x2y2]
is calledthe Jacobian formula. The Jacobian equation can be obtained based on known conditions.
Now just know[ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \right][y1A,y2A] Even without knowing the value off ( y 1 , y 2 ) f\left(y_1, y_2\right)f(y1,y2) can also be obtained in the specific form[ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \ right][y1A,y2A] . The question now is how to find[ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2 } \right][y1A,y2A]

This part is provided by the parameter pytorchin backwardthe function gradient.

Here is a calculation method, let W = [ 1 2 3 4 ] W=\left[ \begin{matrix} 1& 2\\ 3& 4\\ \end{matrix} \right]W=[1324],则雅可比( Jacobian) 式 [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] = [ 1 2 3 4 ] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1}& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right] = \left[ \begin{matrix} 1& 2\\ 3& 4\\ \end{matrix} \right] [x1y1x1y2x2y1x2y2]=[1324] (Here happens to be equal to W, and the calculation of the Jacobian andx 1 , x 2 x_1,x_2x1,x2irrelevant, and not necessarily so coincidental in other circumstances).

So no matter x 1 , x 2 x_1,x_2x1,x2 取何值,其雅可比式固定,然后通过gradient传入参数torch.tensor([0.1, 0.2], dtype=torch.float)即:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] = [ 0.1 , 0.2 ] [ 1 2 3 4 ] = [ 0.7 , 1.0 ] \left[ \frac{\partial A}{\partial x_1},\frac{\partial A}{\partial x_2} \right] =\left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \right] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1}& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right] =\left[ 0.1,0.2 \right] \left[ \begin{matrix} 1& 2\\ 3& 4\\ \end{matrix} \right] =\left[ 0.7,1.0 \right] [x1A,x2A]=[y1A,y2A][x1y1x1y2x2y1x2y2]=[0.1,0.2][1324]=[0.7,1.0 ] The code implementation is as follows
:

# 在实现的时候,取了巧,所以计算的梯度和理论上存在转置
import torch

x = torch.tensor(data=[[1], [2]], dtype=torch.float64, requires_grad=True)  # 定义一个输入变量

w = torch.tensor([[1, 2],
                  [3, 4]], dtype=torch.float64)

y = torch.mm(w, x)  # 矩阵相乘
y.backward(gradient=torch.FloatTensor([[0.1], [0.2]]))
print(x.grad.data)

# 输出:
"""
tensor([[0.7000],
        [1.0000]], dtype=torch.float64)
"""

Of course, here is an example of a three-dimensional input and output:
y 1 = x 1 + x 2 − x 3 y 2 = x 1 x 2 + x 3 y 3 = x 1 − x 2 x 3 A = f ( y 1 , y 2 , y 3 ) \begin{gathered} y_1=x_1 + x_2- x_3 \\ y_2=x_1x_2+x_3 \\ y_3=x_1-x_2 x_3 \\ A=f\left(y_1, y_2, y_3\right) \end{gathered}y1=x1+x2x3y2=x1x2+x3y3=x1x2x3A=f(y1,y2,y3)
那么其多元复合函数求导的矩阵形式如下:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 , ∂ A ∂ x 3 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 , ∂ A ∂ y 3 ] [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 1 ∂ x 3 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ∂ y 2 ∂ x 3 ∂ y 3 ∂ x 1 ∂ y 3 ∂ x 2 ∂ y 3 ∂ x 3 ] \left[\frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3}\right]=\left[\frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3}\right]\left[\begin{array}{lll} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} \\ \frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3} \end{array}\right] [x1A,x2A,x3A]=[y1A,y2A,y3A]x1y1x1y2x1y3x2y1x2y2x2y3x3y1x3y2x3y3
Assume that the incoming gradientparameter is torch.tensor([0.1, 0.2, 0.3], dtype=torch.float), and assume that x 1 = 1 , x 2 = 2. x 3 = 3 x_1=1,x_2=2.x_3=3x1=1,x2=2.x3=3,理论上的梯度有:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 , ∂ A ∂ x 3 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 , ∂ A ∂ y 3 ] [ 1 1 1 x 2 x 1 1 1 − x 3 − x 2 ] = [ 0.1 , 0.2 , 0.3 ] [ 1 1 1 2 1 1 1 − 3 − 2 ] = [ 0.8 , − 0.6 , − 0.3 ] \left[ \frac{\partial A}{\partial x_1},\frac{\partial A}{\partial x_2},\frac{\partial A}{\partial x_3} \right] =\left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2},\frac{\partial A}{\partial y_3} \right] \left[ \begin{matrix} 1& 1& 1\\ x_2& x_1& 1\\ 1& -x_3& -x_2\\ \end{matrix} \right] =\left[ 0.1,0.2,0.3 \right] \left[ \begin{matrix} 1& 1& 1\\ 2& 1& 1\\ 1& -3& -2\\ \end{matrix} \right] =\left[ 0.8,-0.6,-0.3 \right] [x1A,x2A,x3A]=[y1A,y2A,y3A]1x211x1x311x2=[0.1,0.2,0.3]121113112=[0.8,0.6,0 . 3 ]
The corresponding code is as follows:

import torch

x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float)

y = torch.randn(3)

y[0] = x[0] + x[1] + x[2]
y[1] = x[0] * x[1] + x[2]
y[2] = x[0] - x[1] * x[2]

y.backward(torch.tensor([0.1, 0.2, 0.3], dtype=torch.float))

print(x.grad)
# 输出:
"""
tensor([ 0.8000, -0.6000, -0.3000])
"""

Guess you like

Origin blog.csdn.net/weixin_41012765/article/details/127956604