Automatic derivative

import torch
torch.__version__
'1.7.0+cu101'

Use Pytorch to calculate the gradient value

The Autograd module of PyTorch implements the derivative of the propagation in the deep learning algorithm. For all operations on tensors (Tensor class), Autograd can automatically provide them with differentiation, which simplifies the complex process of manually calculating derivatives.

In versions prior to 0.4, Pytorch uses the Variable class to automatically calculate all gradients. The Variable class mainly contains three attributes: data: save the Tensor contained in the Variable; grad: save the gradient corresponding to the data, grad is also a Variable, not a Tensor , It has the same shape as data; grad_fn: points to a Function object, this Function is used to backpropagate the gradient of the input.

Since 0.4, Variable has been officially merged into the Tensor class, and the automatic differentiation function achieved through Variable nesting has been integrated into the Tensor class. Although Variable(tensor) can be used to nest for code compatibility, this operation actually does nothing.

Therefore, the future code is recommended to directly use the Tensor class for operation, because the official document has set Variable as an expired module.

If you want to use the autograd function through the Tensor class itself, you only need to set.requires_grad=True

The grad and grad_fn attributes in the Variable class have been integrated into the Tensor class

Autograd

When creating a tensor, by setting the requires_grad flag to True to tell Pytorch that the tensor needs to be automatically derived, Pytorch will record the history of each step of the tensor and automatically calculate it

x = torch.rand((5,5),requires_grad=True)
x
tensor([[0.1075, 0.1141, 0.7433, 0.7210, 0.8306],
        [0.2503, 0.9014, 0.1539, 0.9115, 0.5156],
        [0.7588, 0.6406, 0.2270, 0.9964, 0.5961],
        [0.1173, 0.3876, 0.1362, 0.4134, 0.2740],
        [0.0655, 0.6979, 0.7824, 0.1312, 0.7737]], requires_grad=True)
y = torch.rand(5,5,requires_grad=True)
y
tensor([[0.7243, 0.3929, 0.8745, 0.0246, 0.2233],
        [0.6284, 0.5150, 0.9777, 0.0754, 0.4568],
        [0.9365, 0.7749, 0.2340, 0.6136, 0.5249],
        [0.6051, 0.5057, 0.5160, 0.3853, 0.3365],
        [0.0239, 0.8699, 0.3889, 0.7705, 0.0522]], requires_grad=True)

pytorch will automatically track and record all operations on the tensor. When the calculation is completed, the .backward() method is called to automatically calculate the gradient and save the calculation result to the grad attribute.

z = torch.sum(x+y)
z

tensor(24.3645, grad_fn=<SumBackward0>)
print(y.grad_fn)
None

After the tensor is operated on, grad_fn has been assigned a new function, which references a Function object that created the Tensor class. Tensor and Function are connected to each other to generate an acyclic graph, which records and encodes the complete calculation history. Each tensor has a .grad_fn attribute. If the tensor is manually created by the user, then the grad_fn of the tensor is None.

Let's call the backpropagation function to calculate its gradient

Simple automatic derivation

z.backward()
print(x.grad,y.grad)
tensor([[1.3425, 2.7768, 2.9883, 1.7514, 2.1914],
        [2.4705, 1.3789, 1.7443, 1.5527, 1.9504],
        [2.3451, 2.3237, 1.6954, 1.3253, 2.3724],
        [1.3260, 1.2323, 1.4379, 1.9817, 2.3667],
        [1.6207, 2.3254, 2.7262, 2.0970, 2.6398]]) tensor([[3.0319, 3.0911, 1.4558, 3.8858, 1.3954],
        [1.1199, 1.2671, 1.2118, 2.4410, 1.1631],
        [1.0528, 1.5557, 3.2638, 2.8477, 1.1546],
        [1.0346, 3.0870, 1.0037, 1.0129, 1.1856],
        [2.4351, 1.2002, 2.9536, 1.8786, 2.7736]])

If the Tensor class represents a scalar (that is, it contains a tensor with one element), you do not need to specify any parameters for backward(), but if it has more elements, you need to specify a gradient parameter, which is shape matching Tensor. The above z.backward() is equivalent to the abbreviation of z.backward(torch.tensor(1.)). This kind of parameter often appears in single-label classification in image classification, and outputs a scalar representing the label of the image.

Complex automatic derivation

x = torch.rand(5,5,requires_grad=True)
y = torch.rand(5,5,requires_grad=True)
z = x**2+y**3
z
tensor([[0.5867, 1.3712, 1.0476, 1.0846, 0.4027],
        [0.5485, 0.0625, 0.1572, 0.4093, 0.2385],
        [0.4546, 0.5178, 0.7764, 0.5098, 0.4825],
        [0.0278, 0.5937, 0.0480, 0.2412, 0.4824],
        [0.4272, 0.4564, 1.2704, 0.4593, 1.1268]], grad_fn=<AddBackward0>)
# 我们的返回值不是一个标量,所以需要输入一个大小相同的张量作为参数,这里我们用ones_like函数根据x生成一个张量
z.backward(torch.ones_like(x))
print(x.grad)
tensor([[0.3425, 1.7768, 1.9883, 0.7514, 1.1914],
        [1.4705, 0.3789, 0.7443, 0.5527, 0.9504],
        [1.3451, 1.3237, 0.6954, 0.3253, 1.3724],
        [0.3260, 0.2323, 0.4379, 0.9817, 1.3667],
        [0.6207, 1.3254, 1.7262, 1.0970, 1.6398]])

We can use the with torch.no_grad() context manager to temporarily prohibit automatic derivation of tensors with requirements_grad=True set. This method is often used when calculating the accuracy of the test set. E.g:

with torch.no_grad():
  print((x**2).requires_grad)
print(x.requires_grad)
False
True

After using .no_grad() for nesting, the code will not track historical records, which means that the saved part of the records will reduce memory usage and speed up a little calculation.

Autograd process analysis

In order to illustrate the principle of Pytorch's automatic derivation, let's try to analyze the source code of PyTorch. Although Pytorch's Tensor and TensorBase are implemented using CPP, some Python methods can be used to view the properties and status of these objects in Python. . Python's dir() returns a list of parameter attributes and methods. z is a Tensor variable, see which member variables are in it.

dir(z)
['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__idiv__',
 '__ifloordiv__',
 '__ilshift__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rfloordiv__',
 '__rmul__',
 '__rpow__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__torch_function__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_backward_hooks',
 '_base',
 '_cdata',
 '_coalesced_',
 '_dimI',
 '_dimV',
 '_grad',
 '_grad_fn',
 '_indices',
 '_is_view',
 '_make_subclass',
 '_nnz',
 '_update_names',
 '_values',
 '_version',
 'abs',
 'abs_',
 'absolute',
 'absolute_',
 'acos',
 'acos_',
 'acosh',
 'acosh_',
 'add',
 'add_',
 'addbmm',
 'addbmm_',
 'addcdiv',
 'addcdiv_',
 'addcmul',
 'addcmul_',
 'addmm',
 'addmm_',
 'addmv',
 'addmv_',
 'addr',
 'addr_',
 'align_as',
 'align_to',
 'all',
 'allclose',
 'amax',
 'amin',
 'angle',
 'any',
 'apply_',
 'arccos',
 'arccos_',
 'arccosh',
 'arccosh_',
 'arcsin',
 'arcsin_',
 'arcsinh',
 'arcsinh_',
 'arctan',
 'arctan_',
 'arctanh',
 'arctanh_',
 'argmax',
 'argmin',
 'argsort',
 'as_strided',
 'as_strided_',
 'as_subclass',
 'asin',
 'asin_',
 'asinh',
 'asinh_',
 'atan',
 'atan2',
 'atan2_',
 'atan_',
 'atanh',
 'atanh_',
 'backward',
 'baddbmm',
 'baddbmm_',
 'bernoulli',
 'bernoulli_',
 'bfloat16',
 'bincount',
 'bitwise_and',
 'bitwise_and_',
 'bitwise_not',
 'bitwise_not_',
 'bitwise_or',
 'bitwise_or_',
 'bitwise_xor',
 'bitwise_xor_',
 'bmm',
 'bool',
 'byte',
 'cauchy_',
 'ceil',
 'ceil_',
 'char',
 'cholesky',
 'cholesky_inverse',
 'cholesky_solve',
 'chunk',
 'clamp',
 'clamp_',
 'clamp_max',
 'clamp_max_',
 'clamp_min',
 'clamp_min_',
 'clip',
 'clip_',
 'clone',
 'coalesce',
 'conj',
 'contiguous',
 'copy_',
 'cos',
 'cos_',
 'cosh',
 'cosh_',
 'count_nonzero',
 'cpu',
 'cross',
 'cuda',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'data',
 'data_ptr',
 'deg2rad',
 'deg2rad_',
 'dense_dim',
 'dequantize',
 'det',
 'detach',
 'detach_',
 'device',
 'diag',
 'diag_embed',
 'diagflat',
 'diagonal',
 'digamma',
 'digamma_',
 'dim',
 'dist',
 'div',
 'div_',
 'divide',
 'divide_',
 'dot',
 'double',
 'dtype',
 'eig',
 'element_size',
 'eq',
 'eq_',
 'equal',
 'erf',
 'erf_',
 'erfc',
 'erfc_',
 'erfinv',
 'erfinv_',
 'exp',
 'exp2',
 'exp2_',
 'exp_',
 'expand',
 'expand_as',
 'expm1',
 'expm1_',
 'exponential_',
 'fft',
 'fill_',
 'fill_diagonal_',
 'fix',
 'fix_',
 'flatten',
 'flip',
 'fliplr',
 'flipud',
 'float',
 'floor',
 'floor_',
 'floor_divide',
 'floor_divide_',
 'fmod',
 'fmod_',
 'frac',
 'frac_',
 'gather',
 'gcd',
 'gcd_',
 'ge',
 'ge_',
 'geometric_',
 'geqrf',
 'ger',
 'get_device',
 'grad',
 'grad_fn',
 'greater',
 'greater_',
 'greater_equal',
 'greater_equal_',
 'gt',
 'gt_',
 'half',
 'hardshrink',
 'has_names',
 'heaviside',
 'heaviside_',
 'histc',
 'hypot',
 'hypot_',
 'i0',
 'i0_',
 'ifft',
 'imag',
 'index_add',
 'index_add_',
 'index_copy',
 'index_copy_',
 'index_fill',
 'index_fill_',
 'index_put',
 'index_put_',
 'index_select',
 'indices',
 'int',
 'int_repr',
 'inverse',
 'irfft',
 'is_coalesced',
 'is_complex',
 'is_contiguous',
 'is_cuda',
 'is_distributed',
 'is_floating_point',
 'is_leaf',
 'is_meta',
 'is_mkldnn',
 'is_nonzero',
 'is_pinned',
 'is_quantized',
 'is_same_size',
 'is_set_to',
 'is_shared',
 'is_signed',
 'is_sparse',
 'isclose',
 'isfinite',
 'isinf',
 'isnan',
 'isneginf',
 'isposinf',
 'isreal',
 'istft',
 'item',
 'kthvalue',
 'layout',
 'lcm',
 'lcm_',
 'le',
 'le_',
 'lerp',
 'lerp_',
 'less',
 'less_',
 'less_equal',
 'less_equal_',
 'lgamma',
 'lgamma_',
 'log',
 'log10',
 'log10_',
 'log1p',
 'log1p_',
 'log2',
 'log2_',
 'log_',
 'log_normal_',
 'log_softmax',
 'logaddexp',
 'logaddexp2',
 'logcumsumexp',
 'logdet',
 'logical_and',
 'logical_and_',
 'logical_not',
 'logical_not_',
 'logical_or',
 'logical_or_',
 'logical_xor',
 'logical_xor_',
 'logit',
 'logit_',
 'logsumexp',
 'long',
 'lstsq',
 'lt',
 'lt_',
 'lu',
 'lu_solve',
 'map2_',
 'map_',
 'masked_fill',
 'masked_fill_',
 'masked_scatter',
 'masked_scatter_',
 'masked_select',
 'matmul',
 'matrix_exp',
 'matrix_power',
 'max',
 'maximum',
 'mean',
 'median',
 'min',
 'minimum',
 'mm',
 'mode',
 'movedim',
 'mul',
 'mul_',
 'multinomial',
 'multiply',
 'multiply_',
 'mv',
 'mvlgamma',
 'mvlgamma_',
 'name',
 'names',
 'nanquantile',
 'nansum',
 'narrow',
 'narrow_copy',
 'ndim',
 'ndimension',
 'ne',
 'ne_',
 'neg',
 'neg_',
 'negative',
 'negative_',
 'nelement',
 'new',
 'new_empty',
 'new_full',
 'new_ones',
 'new_tensor',
 'new_zeros',
 'nextafter',
 'nextafter_',
 'nonzero',
 'norm',
 'normal_',
 'not_equal',
 'not_equal_',
 'numel',
 'numpy',
 'orgqr',
 'ormqr',
 'outer',
 'output_nr',
 'permute',
 'pin_memory',
 'pinverse',
 'polygamma',
 'polygamma_',
 'pow',
 'pow_',
 'prelu',
 'prod',
 'put_',
 'q_per_channel_axis',
 'q_per_channel_scales',
 'q_per_channel_zero_points',
 'q_scale',
 'q_zero_point',
 'qr',
 'qscheme',
 'quantile',
 'rad2deg',
 'rad2deg_',
 'random_',
 'real',
 'reciprocal',
 'reciprocal_',
 'record_stream',
 'refine_names',
 'register_hook',
 'reinforce',
 'relu',
 'relu_',
 'remainder',
 'remainder_',
 'rename',
 'rename_',
 'renorm',
 'renorm_',
 'repeat',
 'repeat_interleave',
 'requires_grad',
 'requires_grad_',
 'reshape',
 'reshape_as',
 'resize',
 'resize_',
 'resize_as',
 'resize_as_',
 'retain_grad',
 'rfft',
 'roll',
 'rot90',
 'round',
 'round_',
 'rsqrt',
 'rsqrt_',
 'scatter',
 'scatter_',
 'scatter_add',
 'scatter_add_',
 'select',
 'set_',
 'sgn',
 'sgn_',
 'shape',
 'share_memory_',
 'short',
 'sigmoid',
 'sigmoid_',
 'sign',
 'sign_',
 'signbit',
 'sin',
 'sin_',
 'sinh',
 'sinh_',
 'size',
 'slogdet',
 'smm',
 'softmax',
 'solve',
 'sort',
 'sparse_dim',
 'sparse_mask',
 'sparse_resize_',
 'sparse_resize_and_clear_',
 'split',
 'split_with_sizes',
 'sqrt',
 'sqrt_',
 'square',
 'square_',
 'squeeze',
 'squeeze_',
 'sspaddmm',
 'std',
 'stft',
 'storage',
 'storage_offset',
 'storage_type',
 'stride',
 'sub',
 'sub_',
 'subtract',
 'subtract_',
 'sum',
 'sum_to_size',
 'svd',
 'symeig',
 't',
 't_',
 'take',
 'tan',
 'tan_',
 'tanh',
 'tanh_',
 'to',
 'to_dense',
 'to_mkldnn',
 'to_sparse',
 'tolist',
 'topk',
 'trace',
 'transpose',
 'transpose_',
 'triangular_solve',
 'tril',
 'tril_',
 'triu',
 'triu_',
 'true_divide',
 'true_divide_',
 'trunc',
 'trunc_',
 'type',
 'type_as',
 'unbind',
 'unflatten',
 'unfold',
 'uniform_',
 'unique',
 'unique_consecutive',
 'unsafe_chunk',
 'unsafe_split',
 'unsafe_split_with_sizes',
 'unsqueeze',
 'unsqueeze_',
 'values',
 'var',
 'vdot',
 'view',
 'view_as',
 'where',
 'zero_']

Return a lot, we directly exclude some special methods (starting and ending with __) and private methods (starting with _) in Python, and look at a few more main attributes: .is_leaf: Whether the record is a leaf node. Through this Attributes to determine the type of this variable. The "graph leaves" and "leaf variables" mentioned in the official documentation refer to variables such as x and y that are manually created instead of calculated. These variables are called created variables. 。 Like z, the result obtained after calculation is called the result variable.

# 一个变量是创建变量还是结果变量是通过.is_leaf来获取的
print(x.is_leaf)
print(z.is_leaf)
True
False

x is manually created and failed calculations, so it is regarded as a leaf node, that is, a creation variable, and z is obtained through a series of calculations of x and y, so it is not a leaf node or a result variable.

Why do we update x.grad and y.grad when we execute the z.backward() method? The .grad_fn attribute records this part of the operation. Although the .backward() method is also implemented by CPP, it can be simply explored through Python.

grad_fn: Record and encode the complete calculation history

z.grad_fn
<AddBackward0 at 0x7fbad990a400>

grad_fn is a variable of type AddBackward0. AddBackward0 is also written in Cpp, but we can roughly know from the name that it is the Backward of Addition. Let’s see what's in it.

dir(z.grad_fn)
['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_register_hook_dict',
 'metadata',
 'name',
 'next_functions',
 'register_hook',
 'requires_grad']
z.grad_fn.next_functions #精华
((<PowBackward0 at 0x7fbad990a5c0>, 0), (<PowBackward0 at 0x7fbad990a2e8>, 0))

Why are 2 tuples? Because our operation is z = x 2+y 3 The AddBackward0 just now is the addition, and the previous operation is the power PowBackward0. The first element of the tuple is the operation record related to x

xg = z.grad_fn.next_functions[0][0]
xg
<PowBackward0 at 0x7fbad990a5c0>
dir(xg)
['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_register_hook_dict',
 'metadata',
 'name',
 'next_functions',
 'register_hook',
 'requires_grad']
x_leaf=xg.next_functions[0][0]
type(x_leaf)
AccumulateGrad
x_leaf.variable # 这个就是我们生成的变量x
# 在PyTorch的反向图计算中,AccumulateGrad类型代表的就是叶子节点类型,也就是计算图终止节点。AccumulateGrad类中有一个.variable属性指向叶子节点。
tensor([[0.1712, 0.8884, 0.9942, 0.3757, 0.5957],
        [0.7352, 0.1895, 0.3721, 0.2764, 0.4752],
        [0.6725, 0.6619, 0.3477, 0.1627, 0.6862],
        [0.1630, 0.1161, 0.2189, 0.4909, 0.6834],
        [0.3103, 0.6627, 0.8631, 0.5485, 0.8199]], requires_grad=True)
print(id(x_leaf.variable) == id(x))
True

So the whole procedure is very clear:

When we execute z.backward(). This operation will call the attribute grad_fn in z to perform the derivative operation.
This operation will traverse the next_functions of grad_fn, and then take out the Function (AccumulateGrad) inside, and perform the derivative operation. This part is a recursive process until the final type is a leaf node.
After calculating the result, save the result to the grad attribute of the object (x and y) referenced by their corresponding variable.
The derivation is over. The grad variables of all leaf nodes have been updated accordingly.
Finally, after we execute z.backward(), the grad values ​​in a and b have been updated.

Extend Autograd

If you need to customize autograd to extend new functions, you need to extend the Function class. Because Function uses autograd to calculate the result and gradient, and encode the operation history. The most important methods in the Function class are forward() and backward(), which represent forward propagation and backward propagation, respectively.

A custom Function requires the following three methods:

  • init (optional): If this operation requires additional parameters, you need to define the Function constructor. If you don't need it, you can ignore it.
  • forward(): Perform calculation code for forward propagation
  • backward(): The code for gradient calculation during backpropagation. The number of parameters is the same as the number of forward return values, each parameter represents the gradient returned to this operation
# 引入Function便于扩展
from torch.autograd.function import Function


# 定义一个乘以常数的操作(输入参数是张量)
# 方法必须是静态方法,所以要加上@staticmethod 
class MulConstant(Function):
    @staticmethod 
    def forward(ctx, tensor, constant):
        # ctx 用来保存信息这里类似self,并且ctx的属性可以在backward中调用
        ctx.constant=constant
        return tensor *constant
    @staticmethod
    def backward(ctx, grad_output):
        # 返回的参数要与输入的参数一样.
        # 第一个输入为3x3的张量,第二个为一个常数
        # 常数的梯度必须是 None.
        return grad_output, None

After defining our new operation, let’s test it

a = torch.rand(3,3,requires_grad=True)
b = MulConstant.apply(a,5)
print(a,b)
tensor([[0.0843, 0.8616, 0.6903],
        [0.1071, 0.0301, 0.0123],
        [0.4286, 0.2768, 0.3485]], requires_grad=True) tensor([[0.4213, 4.3082, 3.4516],
        [0.5356, 0.1506, 0.0616],
        [2.1431, 1.3840, 1.7427]], grad_fn=<MulConstantBackward>)
# 反向传播,返回值不是标量,因此backward方法需要参数
b.backward(torch.ones_like(a))
a.grad #梯度为1
tensor([[2., 2., 2.],
        [2., 2., 2.],
        [2., 2., 2.]])

Guess you like

Origin blog.csdn.net/qq_49821869/article/details/113725340