PyTorch Learning Series Tutorial: How Tensor Realizes Automatic Derivation

guide

Today's article continues the PyTorch learning series. Although the reading effect of the first few tweets is not very good (it may be related to the new direction of this series of tweets), but you must stick to the path you choose!

The previous tweet introduced the basic process of building a deep learning model . After several Epochs, a simple handwritten digit classification model was completed, and the effect was not bad. In this process, an important detail is how the model learns the optimal parameters. The answer is through the gradient descent method. In fact, the gradient descent method is a kind of optimization method, which is widely used in deep learning and can even be called the cornerstone of deep learning. This article does not intend to explain the gradient descent method, but mainly to talk about how Tensor realizes automatic derivation. Only by understanding this process can we further understand the principles of various gradient descent methods.

To explain how Tensor realizes automatic derivation, this article states it from the perspective of theoretical analysis and code practice:

  • Automatic Derivation in Tensor: Gradient-Related Properties, Forward Propagation and Backpropagation

  • Automatic derivation exploration practice: take linear regression as an example to explore the automatic derivation process

01 Automatic derivative analysis in Tensor

Tensor is the basic data structure in PyTorch, which constitutes the cornerstone of deep learning. It is essentially a high-dimensional array. In the preamble tweet, it was actually mentioned that when creating a Tensor, you can specify whether it needs a gradient. So what is the difference between specifying requires_grad? In fact, setting True/False of this parameter will directly determine whether the Tensor supports automatic derivation and participates in subsequent gradient updates. Specifically, in the Tensor data structure, the relationship between several important attributes directly related to the gradient is as follows:

fa502d603e3e82ae285ceca32c45d47d.png

Original clumsy drawing, right to understand

Through the above class diagram, I probably want to express the following meanings:

  • In a Tensor data structure, the core attribute is data, which stores the high-dimensional array represented by Tensor (of course, although it is called high-dimensional here, it can actually be any dimension starting from 0);

  • Two attributes are controlled by the requires_grad parameter, grad and grad_fn, where the former represents the gradient of the current Tensor, and the latter represents the gradient function required by the current Tensor; when requires_grad=False, both grad and grad_fn are None and will not exist Any value, but only when requires_grad=True, the initial value of grad and grad_fn is still None at this time, but it can be assigned and updated in subsequent backpropagation

  • backward() is a function that is only applicable to scalar Tensor, that is, Tensor with dimension 0

  • is_leaf: Marks the position of the current Tensor in the constructed calculation graph, where the calculation graph can be regarded as a directed acyclic graph (DAG) or a tree structure. When Tensor is the initial node, it is a leaf node, is_leaf=True, otherwise it is False.

Currently, Tensor supports autograd for floating point types  Tensor (half, float, double and bfloat16) and complex  Tensor types (cfloat, cdouble) . "——Quoted from the official PyTorch documentation

After understanding the above properties and methods of Tensor, how does it realize automatic derivation? This involves the two important concepts of forward propagation and back propagation. In simple terms, if each layer of a neural network is compared to a series of function maps (f1, f2, ..., fn), then:

  • Forward propagation is to realize the calculation of data and the construction of calculation graph according to the calculation process:

  • Backpropagation, backpropagation is to recursively derive and assign gradients in the opposite direction of the constructed calculation graph:

402 Payment Required

And if this process is described graphically, it is:

9b527205fa3b839e996c6b764b7d3379.png

Among them, in the forward propagation process, the calculation process from the initial input (generally training data + network weight) to the final output (generally loss function) is completed according to the process, and the construction of the calculation graph is completed synchronously; while in the backpropagation In the process, by calling the loss.backward() function, the derivation at all levels is recursively completed according to the opposite direction of the calculation graph (essentially, it is the chain rule of derivation). At the same time, for the tensor with requires_grad=False, it is not actually derived and updated during the backpropagation process, and the corresponding reverse chain is cut off.

In addition, it is worth adding that the data type designed to support automatic derivation in the early version of PyTorch is Variable, which means parameters in English, specifically referring to the parameters to be optimized in the network. Among them, the relationship between Variable and Tensor is: Variable is a secondary encapsulation of Tensor, which is specially used to support gradient solving. Later, Variable was gradually deprecated, and its unique automatic derivation function was merged with Tensor, and the requires_grad attribute of Tensor was used to distinguish whether a Tensor supports derivation. Obviously, such a design is more reasonable and more convenient and unified for use.

dd6476fddeaf3091791b18d77ce19c85.png

The Variable type that has entered the stage of history

02 Automatic derivation practice in Tensor

Here, we take a simple univariate linear regression as an example to demonstrate the automatic derivation process of Tensor.

1. Create training data x, y and initial weights w, b

# 训练数据,目标拟合线性回归 y = 2*x + 3
x = torch.tensor([1., 2.])
y = torch.tensor([5., 7.])
# 初始权重,w=1.0, b=0.0
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

At this point, check the properties related to the gradient of w, b and x, y, the results are as follows

# 1. 注意:x和y设置为requires_grad=False
x.grad, x.grad_fn, x.is_leaf, y.grad, y.grad_fn, y.is_leaf
# 输出:(None, None, True, None, None, True)


# 2. w和b初始梯度均为None,且二者均为叶子节点
w.grad, w.grad_fn, w.is_leaf, b.grad, b.grad_fn, b.is_leaf
# 输出:(None, None, True, None, None, True)

2. Construct the calculation process and realize forward propagation

# 按计算流程逐步操作,实现前向传播
wx = w*x
wx_b = wx + b
loss = (wx_b - y)
loss2 = loss**2
loss2_sum = sum(loss2)

View the gradient-related properties of each intermediate variable:

# 1.查看是否叶子节点
wx.is_leaf, wx_b.is_leaf, loss.is_leaf, loss2.is_leaf, loss2_sum.is_leaf
# 输出:(False, False, False, False, False)


# 2.查看grad
wx.grad, wx_b.grad, loss.grad, loss2.grad, loss2_sum.grad
# 触发Warning
# UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  aten\src\ATen/core/TensorBody.h:417.)
  return self._grad
# 输出:(None, None, None, None, None)


3.查看grad_fn
wx.grad_fn, wx_b.grad_fn, loss.grad_fn, loss2.grad_fn, loss2_sum.grad_fn
# 输出:
(<MulBackward0 at 0x23a875ee550>,
 <AddBackward0 at 0x23a875ee8e0>,
 <SubBackward0 at 0x23a875ee4c0>,
 <PowBackward0 at 0x23a875ee490>,
 <AddBackward0 at 0x23a93dde040>)

3. Call backward on the final loss to realize backpropagation

loss2_sum.backward()

View the gradient of each intermediate variable and initial input in turn

# 1. 中间变量(非叶子节点)的梯度仅用于反向传播,但不对外暴露
wx.grad, wx_b.grad, loss.grad, loss2.grad, loss2_sum.grad
# 输出:(None, None, None, None, None)


# 2. 检查叶子节点是否获得梯度:w, b均获得梯度,x, y不支持求导,仍为None
w.grad, b.grad, x.grad, y.grad
# 输出:(tensor(-28.), tensor(-18.), None, None)

So far, through the calculation graph of forward propagation and the gradient transfer of back propagation, the gradient assignment process of the initial weight parameters has been completed. Note that here w and b are network parameters to be optimized, and once the two have gradients, they can be further updated by applying the gradient descent method.

So further, how are the values ​​of w.grad and b.grad obtained here? We actually solve it manually. First, derive the partial derivative formulas of loss for w and b respectively:

402 Payment Required

402 Payment Required

Then, bring in two sets of training data (x, y)=(1, 5) and (x, y)=(2, 7), and sum the gradients corresponding to the two sets of training data:

# w的梯度:2*(1*1 + 0 - 5)*1 + 2*(1*2 + 0 - 7)*2 = -28
# b的梯度:2*(1*1 + 0 - 5) + 2*(1*2 + 0 - 7) = -18

Obviously, the manual calculation results are consistent with the above demonstration results.

Note: When multiple training data (batch_size) participate in a backpropagation, the returned parameter gradient is the sum of the gradients obtained on each training data.

2a2953c0466e8a8895ae5f7cd912fb96.png

Excerpt from PyTorch official website

51c7d64b78ca487842b898a837e43e5f.png

Related Reading:

Guess you like

Origin blog.csdn.net/weixin_43841688/article/details/123492607