A detailed explanation of pytorch's "dynamic graph" and "automatic differentiation" technology

foreword

As we all know, Pytorch is a very popular and well-received deep learning training framework. This has a lot to do with its two features "dynamic graph" and "automatic differentiation". "Dynamic graph" makes the debugging of pytorch very simple. Every step and every process can be precisely controlled, debugged and output by us. It is even possible to reconstruct the entire network at each iteration. This is very inconvenient to handle in other static graph-based training frameworks. In the training framework of the static image, the entire network must be constructed first, and then the training starts. If you want to output the data of intermediate nodes or change the structure of the network a little during the training process, you need very complicated operations, or even impossible. The "automatic differentiation" technology makes it possible to implement the forward propagation of the operator when writing a deep learning network, instead of implementing forward propagation and backpropagation for the same operator at the same time as caffe does. Since backpropagation is generally more complicated than forward propagation, and it is easy to make mistakes when deriving backpropagation manually, "automatic differentiation" can greatly save labor and improve efficiency.

dynamic picture

Students who have used caffe or tensorflow should know that a neural network needs to be built before training. Caffe uses the configuration file prototxt to describe it, and tensorflow uses python code to describe it. Before training, the framework will have a process of parsing and building a neural network. After the construction is completed, data reading and training are performed. The network generally does not change during the training process, so it is called a "static graph". If you want to obtain the output of intermediate variables, you can, but it is more troublesome. If caffe uses c++ training, you need to obtain the top of the layer and then print it. Tensorflow needs to obtain it through the session. But if you want to control the operation of the network, such as stopping the network after a certain OP, it is very difficult to do. That is, it is impossible to accurately control every step of the network operation, and can only wait for the network operation to be completed, and then obtain relevant data through relevant interfaces. The "dynamic graph" mechanism of pytorch can achieve very precise control over the network. Before pytorch runs, the so-called neural network will not be created, which is completely described by the forward function defined by the python code. That is to say, the forward function we manually wrote is the dynamic graph of pytorch running forward. When the code executes to which sentence, the network will run to which step. So when you debug, breakpoint, and modify the forward function, the neural network will be debugged, interrupted, and modified accordingly. In other words, the forwad code of pytorch is the execution flow of the neural network, or the "dynamic graph" of pytorch. Control over forward is control over neural network. As shown below:

Because of this implementation mechanism, the debugging of the neural network can be debugged like ordinary python code, which is very convenient and friendly. And the structure of the network can be modified at any time, which is the benefit of the dynamic graph.

automatic differentiation

The dynamic diagram above explains in detail how pytorch constructs a forward-propagating dynamic neural network. In fact, pytorch does not explicitly construct a so-called dynamic diagram. The essence is to go through the forward code execution process. Then for backpropagation, because we have not built the code for backpropagation, pytorch cannot perform backpropagation through our manually written code execution flow like forward propagation. So how does pytorch achieve precise backpropagation? In fact, the biggest mystery is hidden in the grad_fn attribute of tensor. Some students may have encountered this grad_fn attribute inadvertently when debugging pytorch code. As shown below:

The tensor object in Pytorch has an attribute called grad_fn, which is actually a linked list, which is implemented under the autograd of the pytorch source code. This attribute records how this tensor was generated from the previous tensor. Before diving into grad_fn, let's take a look at the leaf tensor and non-leaf tensor in pytroch.

Leaf/非leaf tensor：

There are two ways to generate tensor in Pytorch, one is created out of thin air, such as params in some ops, training images, these tensors, they are not calculated by other tensors, but through torch.zeros(), torch.ones(), torch.from_numpy(), etc. are created out of thin air. Another generation method is calculated by a certain tensor through an op, for example, tensor a obtains tensor b through conv calculation. In fact, these two op creation methods correspond to leaf nodes (leaf nodes) and non-leaf (non-leaf nodes). As shown in the figure below, it is a leaf node and a non-leaf node in a cnn network. The tensor corresponding to the yellow node is generated out of thin air and is a leaf node; the blue tensor is calculated by other tensors and is a non-leaf node. Then it is obvious that the grad_fn of the blue non-leaf node is valuable, because its gradient needs to continue to propagate backwards to the node that created it. The grad_fn of the yellow leaf nodes is None, because they are not created by other nodes, and their gradients do not need to continue backpropagation.

Digging into grad_fn:

grad_fn is the encapsulation of the python layer, and its implementation corresponds to the node object of the pytorch source code under autograd, which is implemented in C++, as shown in the following figure:

Node is actually a linked list with a next_edges_ attribute, which stores all nodes pointing to the next level. Note that it is not a simple one-way linked list, because many tensors may be created by multiple tensors. For example, tensor a = tensor b + tensor c. Then next_edges in the grad_fn attribute of tensor a will have two pointers, pointing to the grad_fn attributes of tensor b and tensor c respectively. At the python layer, the next_edges_ attribute is encapsulated into next_functions. Therefore, the correct statement is: next_ functions in the grad_fn attribute of tensor a points to the grad_fn attributes of tensor b and tensor c. In fact, with this complete linked list, the calculation graph of backpropagation has been completely expressed. The complete backpropagation can be completed. Below we use a small example to further illustrate how grad_fn expresses the backpropagation calculation graph. First, we define a very simple network: there are only two conv layers, one relu layer, and one pool layer, as shown in the figure below (the conv layer has parameters weights and bias):

The corresponding code snippet is as follows:

class TinyCnn(torch.nn.Module):
    def __init__(self, arg_dict={}):
        super(TinyCnn, self).__init__()
        self.conv = torch.nn.Conv2d(3, 3, kernel_size=2, stride=2)
        self.relu = torch.nn.ReLU(inplace=True)
        self.pool = torch.nn.MaxPool2d(kernel_size=2, stride=2)
       
    def forward(self, images):
        conv_out = self.conv(images)
        relu_out = self.relu(conv_out)
        pool_out = self.pool(relu_out)
        return pool_out

cnn = TinyCnn()
loss_fun = torch.nn.BCELoss()
images = torch.rand((1,3,4,4))
labels = torch.rand((1,3,1,1))
preds = cnn(images)
loss = loss_fun(preds, labels)
loss.backward()

Then when the code is executed to loss = loss_fun(preds, labels), let's look at the grad_fn of loss and its corresponding next_functions:

You can see that the grad_fn of loss is: <BinaryCrossEntropyBackward object at 0x000001A07E079FC8>, and its next_functions are: (<MaxPool2DWithIndicesBackward object at 0x000001A07E08BC88>, 0), continue to track MaxPool2DWithIndicesBack The nex_functions of ward are: (<ReluBackward1 object at 0x000001A07E079B88>, 0), If you continue to track down, the calculation graph of the entire backpropagation is very intuitive, as shown in the following figure:

Since Images are leaf nodes and do not require gradients, the first next_functions of ThnnConv2DBackward corresponds to None. Although the weights and bias in conv are also leaf nodes, they need gradients, so an AccumulateGrad method is added to indicate that gradients can be accumulated, which is actually the preservation of the gradients of weights and bias.

How is grad_fn generated?

With the above introduction, in fact, everyone should have a general understanding of the general process of pytorch automatic differentiation. In fact, it is organized through the tensor's gran_fn. The grad_fn is essentially a linked list, pointing to the next level of the tensor's grad_fn, so a complete dynamic graph of reverse calculation is formed through such a linked list. So the last question is how is tensor's grad_fn constructed? Whether it is the upper-level code we wrote ourselves, or the op implementation at the bottom of pytorch, there is no explicit creation of grad_fn, so when and how is it assembled? In fact, you can find clues by compiling the pytorch source code. Pytorch will perform a secondary encapsulation of all underlying operators. After completing the normal op forward, the process of setting grad_fn and next_functions is added. The following figure shows the comparison between the forward process of the original convolution and the forward calculation process of the convolution automatically packaged by pytroch. It can be clearly seen that there are more codes for grad_fn settings.

postscript

The above process is the core logic of pytorch's "dynamic graph" and "automatic differentiation". Based on the analysis of the source code of pytorch1.6.0, due to the author's ignorance and limited scope, it is inevitable that there will be some mistakes. If there is something wrong, please forgive me and correct me.

A detailed explanation of pytorch's "dynamic graph" and "automatic differentiation" technology

Guess you like