Table of contents
This article addresses the following issues:
- For tensor calculation gradient, you need to set
requires_grad=True
- why do you need
tensor.zero_grad()
tensor.backward()
The two parameters ingradient
andretain_graph
introduce
illustrate
In the old version of pytorch, there are two basic objects: Tensor (tensor), Variable (variable). Among them, tensor cannot be backpropagated; Variable is an ever-changing type that conforms to the attribute of parameter update, so it can be backpropagated.
In fact, Variable is a kind of encapsulation of tensor. The operation is basically the same as tensor, but each variable has three more attributes.
.data
: data of the tensor itself.grad
: Gradient corresponding to tensor.grad_fn
:.grad
The way of obtaining (this may not be well understood, it will be explained later. Indicates where the calculation of the gradient value of the tensor comes from, if not, it is. Here it can be simply comparedNone
to the independent variable and dependent variable in the function, if it is Independent variable, then the attribute value isNone
; if it is a dependent variable, its gradient comes from the calculation of its independent variable)
Every time a tensor tensor is created, if the gradient needs to be calculated, it needs to be converted to a Variable variable before calculation, and the attribute requires_grad
must be set to True.
from torch.autograd import Variable
import torch
# 创建一个torch.Size([2, 3])的tensor
x_tensor = torch.randn(2, 3)
# 将tensor封装成Variable类型,用于计算梯度,这里的requires_grad要设置为True
x = Variable(x_tensor, requires_grad=True)
y = 3 * x ** 2 + 1
print(y.type()) # torch.FloatTensor
print(y.grad_fn) # <AddBackward0 object at 0x0000021679AB9700>
# 梯度参数grad_variables形状必须与Variable一致
grad_variables = torch.FloatTensor([[1, 2, 3],
[1, 1, 1]])
# 求函数y对x的梯度,这里需要输入参数,否则会报错grad can be implicitly created only for scalar outputs
y.backward(grad_variables)
print(x.grad) # 得到x的梯度
In the new version of pytorch, torch.autograd.Variable
and torch.Tensor
belong to the same category. All the following operations are performed under the new version of pytorch.
Tensor
Tensor creation
Tensor is an n-dimensional array, which is conceptually the same as a numpy array, except that Tensor can track calculation graphs and calculate gradients. Its creation can come from python's built-in data types, or from others (numpy), etc. Here are several ways to create a tensor:
import torch
import numpy as np
# 从numpy中创建
numpy_array = np.array([2, 1, 2])
torch_tensor1 = torch.from_numpy(numpy_array)
torch_tensor2 = torch.Tensor(numpy_array)
torch_tensor3 = torch.tensor(numpy_array)
# 将tensor转换回numpy类型
numpy_array = torch_tensor1.numpy() # 如果tensor在CPU上
numpy_array = torch_tensor1.cpu.numpy() # 如果tensor在GPU上
print(type(numpy_array)) # 输出 : <class 'numpy.ndarray'>
torch.Tensor()
Is an alias for the default tensor typetorch.FloatTensor()
, that is to say,torch.Tensor()
the Float data type is returned. Its default data type can be modified:torch.set_default_tensor_type(torch.DoubleTensor)
Instead
torch.tensor()
, the correspondingtorch.LongTensor
, ,torch.FloatTensor
and are generated based on the input data typetorch.DoubleTensor
.
# 创建形状为size的全为fill_value的Tensor
a = torch.full(size=[2, 3], fill_value=2)
# 创建形状为size的全为1的Tensor
b = torch.ones(size=[2, 3])
# 创建形状为size的全0的Tensor
c = torch.zeros(size=[2, 3])
# 创建对角阵的Tensor
d = torch.eye(3)
# 在区间[low,high]中随机创建形状为size的Tensor
e = torch.randint(low=1, high=10, size=[2, 2])
When using torch.Tensor()
the method to create Tensor, you can set the data type (dtype) and device (device)
a = torch.tensor(data=[[1, 3, 1], # 数据
[4, 5, 4]],
dtype=torch.float64,# 数据类型
device=torch.device('cuda:0')) # 设备
# 可以通过Tensor的属性查看
print(a.dtype) # torch.float64
print(a.device) # cuda:0
print(a.is_cuda) # True
When using tensor, if the tensor is placed in the GPU, the calculation of the tensor can be accelerated.
# 1,定义cuda数据类型
# 把Tensor转换为cuda数据类型
dtype = torch.cuda.FloatTensor
gpu_tensor = torch.tensor(data=[[1, 3, 1],
[4, 5, 4]]).type(dtype)
# 2.直接将Tensor放到GPU上
a = torch.tensor(data=[[1, 3, 1],
[4, 5, 4]])
gpu_tensor = a.cuda(0) # 把Tensor直接放在第一个GPU上
gpu_tensor = a.cuda(1) # 把Tensor直接放在第二个GPU上
# 如果将GPU上tensor放回CPU也简单
a_cpu = gpu_tensor.cpu()
Tensor (tensor) basic data types and common attributes
The most basic data type of Tensor:
- 32-bit floating point type: torch.float32 (most commonly used)
- 64-bit floating-point type: torch.float64 (most commonly used)
- 32-bit integer: torch.int32
- 16-bit integer: torch.int16
- 64-bit integer: torch.int64
a.type
A tensor of the same data type can be regenerated.
import torch
# 创建一个tensor
a = torch.tensor(data=[[1, 3, 1],
[4, 5, 4]])
# 查看Tensor类型
print(a.dtype) # torch.int64
# 如果不传入参数,则默认转换为torch.LongTensor类型
b = a.type(torch.float64)
print(a.dtype) # torch.int64
print(b.dtype) # torch.float64
# 查看Tensor尺寸
print(a.shape) # torch.Size([2, 3])
# 查看Tensor是否在GPU上
print(a.is_cuda) # False
# 查看Tensor存储设备
print(a.device) # cpu
# 查看Tensor梯度计算,如果没有,则是None
print(a.grad) # None
# 如果要计算梯度,requires_grad需要设置为True
print(a.requires_grad) # False
Automatic Differentiation of Tensor
Set the attribute Torch.Tensor
of requires_grad
to True, then pytorch will start tracking all operations on this tensor. After the calculation is complete, .backward()
the method can be called to automatically calculate all the gradients. And the gradient computed value for that tensor will be accumulated into .grad
the attribute.
.data
: data of the tensor itself.grad
: Gradient accumulation sum of the corresponding tensor.grad_fn
:.grad
The way to get it (from what function, if it is created by yourself, it isNone
)
# 创建tensor
x = torch.tensor(data=[[1, 2],
[2, 1]],
dtype=torch.float64,
requires_grad=True)
print(x.requires_grad) # True
y = x ** 2 + 2
y = y.mean()
# 在计算backward的时候,y必须是标量,否则需要设置一下backward函数的输入参数。
y.backward()
print(x.grad)
"""
tensor([[0.5000, 1.0000],
[1.0000, 0.5000]], dtype=torch.float64)
"""
In order to illustrate the above gradient calculation principle, some functions need to be defined here, let x = [ x 1 x 2 x 3 x 4 ] \boldsymbol{x}=\left[ \begin{matrix}x_1& x_2\\x_3& x_4\\ \end{matrix} \right]x=[x1x3x2x4],并且 y = f ( x ) = f ( x 1 , x 2 , x 3 , x 4 ) = x 1 2 + x 2 2 + x 3 2 + x 4 2 4 + 2 y=f\left( \boldsymbol{x} \right) =f\left( x_1,x_2,x_3,x_4 \right)=\frac{x_{1}^{2}+x_{2}^{2}+x_{3}^{2}+x_{4}^{2}}{4}+2 y=f(x)=f(x1,x2,x3,x4)=4x12+x22+x32+x42+2
Therefore, find yyy versusx \boldsymbol{x}x 的梯度,有:
d y d x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 ∂ y ∂ x 3 ∂ y ∂ x 4 ] = [ x 1 2 x 2 2 x 3 2 x 4 2 ] \frac{d y}{d \boldsymbol{x}}=\left[\begin{array}{ll} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} \\ \frac{\partial y}{\partial x_3} & \frac{\partial y}{\partial x_4} \end{array}\right]=\left[\begin{array}{cc} \frac{x_1}{2} & \frac{x_2}{2} \\ \frac{x_3}{2} & \frac{x_4}{2} \end{array}\right] dxdy=[∂x1∂y∂x3∂y∂x2∂y∂x4∂y]=[2x12x32x22x4]
因为 x . d a t a = [ 1 2 2 1 ] x.data =\left[ \begin{matrix} 1& 2\\ 2& 1\\ \end{matrix} \right] x.data=[1221] into the gradient calculation formula, so its gradientx.grad = [ 1 2 2 2 2 2 1 2 ] = [ 0.5000 1.0000 1.0000 0.5000 ] x.grad = \left[ \begin{matrix} \frac{1}{2 } & \frac{2}{2} \\ \frac{2}{2} & \frac{1}{2} \end{matrix} \right] = \left[ \begin{matrix} 0.5000& 1.0000\ \ 1.0000& 0.5000\\ \end{matrix} \right]x.grad=[21222221]=[0.50001.00001.00000.5000]
But there is a problem here. Since the gradient is accumulated in .grad
, if there are multiple functions to solve the gradient for the same data x, the gradient is the sum of accumulation. Can be cleared Tensor.grad.data.zero_()
by ..grad
# 创建tensor
x = torch.tensor(data=[[1, 2],
[2, 1]],
dtype=torch.float64,
requires_grad=True)
print(x.requires_grad) # True
z = 4 * x + 2
z = z.mean()
z.backward()
print(x.grad)
"""
tensor([[1., 1.],
[1., 1.]], dtype=torch.float64)
"""
# 对x的梯度清0操作,否则会由于x的不同梯度进行叠加操作。
# x.grad.data.zero_()
y = x ** 2 + 2
y = y.mean()
# 在计算backward的时候,y必须是标量,否则需要设置一下backward函数的输入参数。
y.backward()
print(x.grad)
"""
tensor([[1.5000, 2.0000],
[2.0000, 1.5000]], dtype=torch.float64)
"""
The gradient of x in this example is accumulated by two functions z , yz,yz,y 。其中, z = f ( x ) = f ( x 1 , x 2 , x 3 , x 4 ) = x 1 + x 2 + x 3 + x 4 + 2 z=f\left( \boldsymbol{x} \right) =f\left( x_1,x_2,x_3,x_4 \right) =x_1+x_2+x_3+x_4+2 z=f(x)=f(x1,x2,x3,x4)=x1+x2+x3+x4+2, y = f ( x ) = f ( x 1 , x 2 , x 3 , x 4 ) = x 1 2 + x 2 2 + x 3 2 + x 4 2 4 + 2 y=f\left( \boldsymbol{x} \right) =f\left( x_1,x_2,x_3,x_4 \right) =\frac{x_{1}^{2}+x_{2}^{2}+x_{3}^{2}+x_{4}^{2}}{4}+2 y=f(x)=f(x1,x2,x3,x4)=4x12+x22+x32+x42+2。
Attributes starting with x
.grad
come from zzz ,于是有
∣ d z d x = [ ∂ z ∂ x 1 ∂ z ∂ x 2 ∂ z ∂ x 3 ∂ z ∂ x 4 ] = [ x 1 x 2 x 3 x 4 ] \mid \frac{d z}{d \boldsymbol{x}}=\left[\begin{array}{cc} \frac{\partial z}{\partial x_1} & \frac{\partial z}{\partial x_2} \\ \frac{\partial z}{\partial x_3} & \frac{\partial z}{\partial x_4} \end{array}\right]=\left[\begin{array}{ll} x_1 & x_2 \\ x_3 & x_4 \end{array}\right] ∣dxdz=[∂x1∂z∂x3∂z∂x2∂z∂x4∂z]=[x1x3x2x4]
此时, x . d a t a = [ 1 2 2 1 ] x.data =\left[ \begin{matrix} 1& 2\\ 2& 1\\ \end{matrix} \right] x.data=[1221] into the gradient calculation formula, so its gradientx .grad = [ 1 1 1 ] x.grad =\left[ \begin{matrix} 1& 1\\ 1& 1\\ \end{matrix} \right]x.grad=[1111]Since the gradient of x is not cleared, then calculate yyy versusx \boldsymbol{x}After the gradient of x
x.grad
, it consists of two parts, the previous gradient + the new gradient, that is,x .grad = [ 1 1 1 1 ] + [ 0.5000 1.0000 1.0000 0.5000 ] = [ 1.5000 2.0000 2.0000 1.5000 ] x.grad=\left[ \begin{matrix} 1& 1\\ 1& 1\\ \end{matrix} \right] +\left[ \begin{matrix} 0.5000& 1.0000\\ 1.0000& 0.5000\\ \end{matrix} \right] =\ left[ \begin{matrix} 1.5000& 2.0000\\ 2.0000& 1.5000\\ \end{matrix} \right]x.grad=[1111]+[0.50001.00001.00000.5000]=[1.50002.00002.00001.5000]
Set non-integrable calculation
Sometimes there is no need to track calculations. For example, test data for the test set (all set to not track calculations), or freeze weight information (partial track calculations).
The principle of implementing this method is to .requires_grad
set the property of the tensor that does not need to calculate the gradient to False.
In pytorch, if the requires_grad parameter of the tensor is set to True, the tensor will be automatically derived during backpropagation. When requires_grad is set to False, there will be no automatic derivation during backpropagation, thus greatly saving video memory or memory. The default attribute of tensor requires_grad is False.
Change the requires_grad value statement of the Tensor (tensor) variable:
# 就地改变Tensor变量a的requires_grad属性为True
a.requires_grad = True
a.requires_grad_(True) # 与上面等价,注意下划线
For the test set , use with torch.no_grad()
, all computed tensor properties under this module .requires_grad
will be automatically set to False.
import torch
x = torch.randn(3, 4, requires_grad=True)
y = torch.randn(3, 4, requires_grad=True)
u = x + y
print(u.requires_grad) # True
print(u.grad_fn) # <AddBackward0 object at 0x000001D0DC709700>
with torch.no_grad():
w = x + y
print(w.requires_grad) # False
print(w.grad_fn) # None
print(w.requires_grad) # False
The with statement is suitable for the occasion of accessing resources, to ensure that no matter whether an exception occurs during use, the necessary "cleanup" operation will be performed to release resources.
statement:
with open("1.txt") as file:
with working principle:
- After the statement following with is evaluated,
__enter__
the method is called, and the return value will be assigned to the variable after as;- When the body of the with statement is completely executed, the previous
__exit__
method .
It is used to freeze the weight , traverse the parameters in the network, set the tensor that needs to freeze the parameters to True, and set the others to False
# 遍历模型权重信息
for name, para in model.named_parameters():
# 按照名字将其进行冻结,这里冻结最后一层的fc,和fc.bias
if "fc" not in name and "fc.bias" not in name:
para.requires_grad_(False)
pytorch calculation graph
Explain that pytorch is a dynamic graph mechanism, so when training the model, a new calculation graph will be built every iteration. The calculation graph actually represents the relationship between variables in the program.
Dynamic computation means that the program will execute in the order we write the commands. This mechanism will make debugging easier, and it will also make it easier to translate the ideas in our heads into actual code. Static computing means that the program will first generate the structure of the neural network when it is compiled and executed, and then perform the corresponding operations. In theory, mechanisms like static evaluation allow the compiler to optimize to a greater degree, but it also means that there is more generation gap between what you expect your program to do and what the compiler actually executes. This also means that errors in the code will be more difficult to find (for example, if there is a problem with the structure of the calculation graph, you may only find it when the code executes the corresponding operation). Although in theory static computation graphs have better performance than dynamic computation graphs, in practice we often find that this is not the case.
Computation graph is a directed acyclic graph used to describe operations, mainly composed of nodes (Node) and edges (Edge), where nodes represent data, and edges represent operations.
以 y = f ( x 1 , x 2 , x 3 , x 4 ) = ( x 1 + x 2 ) ( x 2 − x 3 − x 4 ) y=f\left( x_1,x_2,x_3,x_4 \right) =(x_{1}+x_{2})(x_{2} - x_{3}-x_{4}) y=f(x1,x2,x3,x4)=(x1+x2)(x2−x3−x4) as an example, its calculation graph is shown in the figure below, wherex 1 , x 2 , x 3 , x 4 x_1,x_2,x_3,x_4x1,x2,x3,x4is a leaf node.
From the chain rule, we can know that if you want to calculate the derivative of a leaf node to its root node, it is the sum of the derivatives on all paths from the root node to the leaf node. (Also further explains why .grad
it is additive.)
For example, to solve for yyy tox 2 x_2x2, there are two paths, therefore,
∂ y ∂ x 2 = ∂ y ∂ f ∂ f ∂ x 2 + ∂ y ∂ g ∂ g ∂ x 2 \frac{\partial y}{\partial x_2}=\frac {\partial y}{\partial f}\frac{\partial f}{\partial x_2}+\frac{\partial y}{\partial g}\frac{\partial g}{\partial x_2}∂x2∂y=∂f∂y∂x2∂f+∂g∂y∂x2∂g
若 ( x 1 , x 2 , x 3 , x 4 ) = ( 1 , 2 , 3 , 4 ) (x_1,x_2,x_3,x_4) = (1,2,3,4) (x1,x2,x3,x4)=(1,2,3,4 ) , the calculated∂ y ∂ x 2 \frac{\partial y}{\partial x_2}∂x2∂y 满足:
∂ y ∂ x 2 = ∂ y ∂ f ∂ f ∂ x 2 + ∂ y ∂ g ∂ g ∂ x 2 = g ∗ 1 + f ∗ 1 = − 5 + 3 = − 2 \frac{\partial y}{\partial x_2}=\frac{\partial y}{\partial f}\frac{\partial f}{\partial x_2}+\frac{\partial y}{\partial g}\frac{\partial g}{\partial x_2} = g * 1 + f * 1 = -5+3 = -2 ∂x2∂y=∂f∂y∂x2∂f+∂g∂y∂x2∂g=g∗1+f∗1=−5+3=−2
import torch
# 创建tensor
x_1 = torch.tensor(data=[1.], requires_grad=True)
x_2 = torch.tensor(data=[2.], requires_grad=True)
x_3 = torch.tensor(data=[3.], requires_grad=True)
x_4 = torch.tensor(data=[4.], requires_grad=True)
f = x_1 + x_2
g = x_2 - x_3 - x_4
y = f * g
# y.retain_grad()
y.backward()
print(x_2.grad)
# 查看叶子结点
print(x_1.is_leaf, x_2.is_leaf, x_3.is_leaf, x_4.is_leaf, f.is_leaf, g.is_leaf, y.is_leaf)
# 输出:True True True True False False False
# 查看梯度
print(x_1.grad, x_2.grad, x_3.grad, x_4.grad, f.grad, g.grad, y.grad)
# 输出:tensor([-5.]) tensor([-2.]) tensor([-3.]) tensor([-3.]) None None None
torch.tensor
There is an attribute attribute in is_leaf
Indicates whether the tensor is a leaf node. The leaf node is the foundation of the entire calculation graph. Whether it is the derivation calculation graph or the backpropagation process, all gradient calculations depend on the leaf node. The main purpose of setting leaf nodes is to save memory. After the gradient backpropagation ends, the gradients of non-leaf nodes will be released.
During the running process, an error may be reported: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed.
Means: the .grad attribute of a tensor that is not a leaf tensor is being accessed, its .grad attribute will not be populated during autograd.backward(). If you really want gradients for non-leaf tensors, use .retain_grad() on non-leaf tensors. If you access non-leaf tensors by mistake, make sure you access leaf tensors.
For non-leaf nodes f , g , yf,g,yf,g,The three tensors of y
.grad
, if you want to use their attributes, you can use them before backward.retain_grad()
to save their gradients.
backward some details
form:
tensor.backward(gradient, retain_graph)
The calculation graph built by pytoch is a dynamic graph. In order to save memory, the calculation graph is released in memory after each iteration. If you use backward multiple times , an error will be reported. The calculation graph can be retain_graph=True
saved by setting the flag so that it will not be released.
import torch
x = torch.randn(4, 4, requires_grad=True)
y = 3 * x + 2
y = torch.sum(y)
y.backward(retain_graph=True) # 添加retain_graph=True标识,让计算图不被立即释放
y.backward() # 不报错
y.backward() # 报错
Also, all the above calculation gradients are based on the case where y is a scalargradient
. If it is not a scalar, you need to pass in parameters to backward . That is to say, when y is no longer a scalar, but a multidimensional tensor, Setup is required gradient
. So why is this parameter needed and how is it used?
Here is an example, suppose, Y = [ y 1 y 2 ] Y=\left[ \begin{array}{c} y_1\\ y_2\\ \end{array} \right]Y=[y1y2], W = [ w 11 w 12 w 21 w 22 ] W=\left[ \begin{matrix} w_{11}& w_{12}\\ w_{21}& w_{22}\\ \end{matrix} \right] W=[w11w21w12w22], X = [ x 1 x 2 ] X=\left[ \begin{array}{c} x_1\\ x_2\\ \end{array} \right] X=[x1x2] , and complete:Y = WX , A = f ( Y ) Y=WX, A=f(Y)Y=WX,A=f ( Y ) , where the functionf ( y 1 , y 2 ) f(y_1,y_2)f(y1,y2)的具体定义未知。
[ y 1 y 2 ] = [ w 11 w 12 w 21 w 22 ] [ x 1 x 2 ] A = f ( y 1 , y 2 ) \left[ \begin{array}{c} y_1\\ y_2\\ \end{array} \right] =\left[ \begin{matrix} w_{11}& w_{12}\\ w_{21}& w_{22}\\ \end{matrix} \right] \left[ \begin{array}{c} x_1\\ x_2\\ \end{array} \right] \\ A = f(y_1,y_2) [y1y2]=[w11w21w12w22][x1x2]A=f(y1,y2)
modified scale form:
y 1 = w 11 x 1 + w 12 x 2 y 2 = w 21 x 1 + w 22 x 2 A = f ( y 1 , y 2 ) y_1=w_{11}x_1+w_{ 12}x_2 \\ y_2=w_{21}x_1+w_{22}x_2 \\ A = f(y_1,y_2)y1=w11x1+w12x2y2=w21x1+w22x2A=f(y1,y2)
Based on this, we need to calculateAAA pairx 1 , x 2 x_1,x_2x1,x2The partial derivative of , that is, need to solve ∂ A ∂ x 1 \frac{\partial A}{\partial x_1}∂x1∂A 和 ∂ A ∂ x 2 \frac{\partial A}{\partial x_2} ∂x2∂A,根据复合函数链式求解法则,有
∂ A ∂ x 1 = ∂ A ∂ y 1 ∂ y 1 ∂ x 1 + ∂ A ∂ y 2 ∂ y 2 ∂ x 1 ∂ A ∂ x 2 = ∂ A ∂ y 1 ∂ y 1 ∂ x 2 + ∂ A ∂ y 2 ∂ y 2 ∂ x 2 \begin{aligned} \frac{\partial A}{\partial x_1}&=\frac{\partial A}{\partial y_1}\frac{\partial y_1}{\partial x_1}+\frac{\partial A}{\partial y_2}\frac{\partial y_2}{\partial x_1}\\ \frac{\partial A}{\partial x_2}&=\frac{\partial A}{\partial y_1}\frac{\partial y_1}{\partial x_2}+\frac{\partial A}{\partial y_2}\frac{\partial y_2}{\partial x_2}\\ \end{aligned} ∂x1∂A∂x2∂A=∂y1∂A∂x1∂y1+∂y2∂A∂x1∂y2=∂y1∂A∂x2∂y1+∂y2∂A∂x2∂y2
The above two equations can be written in the form of matrix multiplication, as follows:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] \left[ \frac{\partial A}{\partial x_1},\frac{\partial A}{\partial x_2} \right] = \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \right] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1 }& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right ][∂x1∂A,∂x2∂A]=[∂y1∂A,∂y2∂A][∂x1∂y1∂x1∂y2∂x2∂y1∂x2∂y2]
其中
[ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1}& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right] [∂x1∂y1∂x1∂y2∂x2∂y1∂x2∂y2]
is calledthe Jacobian formula. The Jacobian equation can be obtained based on known conditions.
Now just know[ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \right][∂y1∂A,∂y2∂A] Even without knowing the value off ( y 1 , y 2 ) f\left(y_1, y_2\right)f(y1,y2) can also be obtained in the specific form[ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \ right][∂y1∂A,∂y2∂A] . The question now is how to find[ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] \left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2 } \right][∂y1∂A,∂y2∂A]
This part is provided by the parameter pytorch
in backward
the function gradient
.
Here is a calculation method, let W = [ 1 2 3 4 ] W=\left[ \begin{matrix} 1& 2\\ 3& 4\\ \end{matrix} \right]W=[1324],则雅可比( Jacobian) 式为 [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] = [ 1 2 3 4 ] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1}& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right] = \left[ \begin{matrix} 1& 2\\ 3& 4\\ \end{matrix} \right] [∂x1∂y1∂x1∂y2∂x2∂y1∂x2∂y2]=[1324] (Here happens to be equal to W, and the calculation of the Jacobian andx 1 , x 2 x_1,x_2x1,x2irrelevant, and not necessarily so coincidental in other circumstances).
So no matter x 1 , x 2 x_1,x_2x1,x2 取何值,其雅可比式固定,然后通过gradient
传入参数torch.tensor([0.1, 0.2], dtype=torch.float)
即:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 ] [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ] = [ 0.1 , 0.2 ] [ 1 2 3 4 ] = [ 0.7 , 1.0 ] \left[ \frac{\partial A}{\partial x_1},\frac{\partial A}{\partial x_2} \right] =\left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2} \right] \left[ \begin{matrix} \frac{\partial y_1}{\partial x_1}& \frac{\partial y_1}{\partial x_2}\\ \frac{\partial y_2}{\partial x_1}& \frac{\partial y_2}{\partial x_2}\\ \end{matrix} \right] =\left[ 0.1,0.2 \right] \left[ \begin{matrix} 1& 2\\ 3& 4\\ \end{matrix} \right] =\left[ 0.7,1.0 \right] [∂x1∂A,∂x2∂A]=[∂y1∂A,∂y2∂A][∂x1∂y1∂x1∂y2∂x2∂y1∂x2∂y2]=[0.1,0.2][1324]=[0.7,1.0 ] The code implementation is as follows
:
# 在实现的时候,取了巧,所以计算的梯度和理论上存在转置
import torch
x = torch.tensor(data=[[1], [2]], dtype=torch.float64, requires_grad=True) # 定义一个输入变量
w = torch.tensor([[1, 2],
[3, 4]], dtype=torch.float64)
y = torch.mm(w, x) # 矩阵相乘
y.backward(gradient=torch.FloatTensor([[0.1], [0.2]]))
print(x.grad.data)
# 输出:
"""
tensor([[0.7000],
[1.0000]], dtype=torch.float64)
"""
Of course, here is an example of a three-dimensional input and output:
y 1 = x 1 + x 2 − x 3 y 2 = x 1 x 2 + x 3 y 3 = x 1 − x 2 x 3 A = f ( y 1 , y 2 , y 3 ) \begin{gathered} y_1=x_1 + x_2- x_3 \\ y_2=x_1x_2+x_3 \\ y_3=x_1-x_2 x_3 \\ A=f\left(y_1, y_2, y_3\right) \end{gathered}y1=x1+x2−x3y2=x1x2+x3y3=x1−x2x3A=f(y1,y2,y3)
那么其多元复合函数求导的矩阵形式如下:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 , ∂ A ∂ x 3 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 , ∂ A ∂ y 3 ] [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 1 ∂ x 3 ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ∂ y 2 ∂ x 3 ∂ y 3 ∂ x 1 ∂ y 3 ∂ x 2 ∂ y 3 ∂ x 3 ] \left[\frac{\partial A}{\partial x_1}, \frac{\partial A}{\partial x_2}, \frac{\partial A}{\partial x_3}\right]=\left[\frac{\partial A}{\partial y_1}, \frac{\partial A}{\partial y_2}, \frac{\partial A}{\partial y_3}\right]\left[\begin{array}{lll} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3} \\ \frac{\partial y_3}{\partial x_1} & \frac{\partial y_3}{\partial x_2} & \frac{\partial y_3}{\partial x_3} \end{array}\right] [∂x1∂A,∂x2∂A,∂x3∂A]=[∂y1∂A,∂y2∂A,∂y3∂A]⎣⎢⎡∂x1∂y1∂x1∂y2∂x1∂y3∂x2∂y1∂x2∂y2∂x2∂y3∂x3∂y1∂x3∂y2∂x3∂y3⎦⎥⎤
Assume that the incoming gradient
parameter is torch.tensor([0.1, 0.2, 0.3], dtype=torch.float)
, and assume that x 1 = 1 , x 2 = 2. x 3 = 3 x_1=1,x_2=2.x_3=3x1=1,x2=2.x3=3,理论上的梯度有:
[ ∂ A ∂ x 1 , ∂ A ∂ x 2 , ∂ A ∂ x 3 ] = [ ∂ A ∂ y 1 , ∂ A ∂ y 2 , ∂ A ∂ y 3 ] [ 1 1 1 x 2 x 1 1 1 − x 3 − x 2 ] = [ 0.1 , 0.2 , 0.3 ] [ 1 1 1 2 1 1 1 − 3 − 2 ] = [ 0.8 , − 0.6 , − 0.3 ] \left[ \frac{\partial A}{\partial x_1},\frac{\partial A}{\partial x_2},\frac{\partial A}{\partial x_3} \right] =\left[ \frac{\partial A}{\partial y_1},\frac{\partial A}{\partial y_2},\frac{\partial A}{\partial y_3} \right] \left[ \begin{matrix} 1& 1& 1\\ x_2& x_1& 1\\ 1& -x_3& -x_2\\ \end{matrix} \right] =\left[ 0.1,0.2,0.3 \right] \left[ \begin{matrix} 1& 1& 1\\ 2& 1& 1\\ 1& -3& -2\\ \end{matrix} \right] =\left[ 0.8,-0.6,-0.3 \right] [∂x1∂A,∂x2∂A,∂x3∂A]=[∂y1∂A,∂y2∂A,∂y3∂A]⎣⎡1x211x1−x311−x2⎦⎤=[0.1,0.2,0.3]⎣⎡12111−311−2⎦⎤=[0.8,−0.6,− 0 . 3 ]
The corresponding code is as follows:
import torch
x = torch.tensor([1, 2, 3], requires_grad=True, dtype=torch.float)
y = torch.randn(3)
y[0] = x[0] + x[1] + x[2]
y[1] = x[0] * x[1] + x[2]
y[2] = x[0] - x[1] * x[2]
y.backward(torch.tensor([0.1, 0.2, 0.3], dtype=torch.float))
print(x.grad)
# 输出:
"""
tensor([ 0.8000, -0.6000, -0.3000])
"""