Significantly reduce GPU memory usage: The Reversible Residual Network

prologue:

  The latest paper Reformer produced by Google AI won high scores at the ICLR 2020 conference. The paper made two innovations on the current hot Transformer: one is Local Sensitive Hash (LSH); the other is a reversible residual network instead of a standard residual network . This article mainly introduces the second part of the change, the reversible residual network. Let's start with the back propagation of the neural network, then the standard residual network, and finally the natural transition to the reversible residual network. After reading this article, I believe you will have a very clear understanding of the development of neural network architecture.

1. Background introduction

  All current neural networks are trained using backpropagation. The backpropagation algorithm needs to store the intermediate results of the network to calculate the gradient, and its memory consumption is proportional to the number of network units. This also means that the deeper and wider the network, the greater the consumption of memory, which will become a bottleneck for many applications. Due to the limited video memory of the GPU, it is difficult to achieve the optimal network structure, because some network structures may reach a depth of thousands of layers. If a parallel GPU is used, the price is expensive and complicated, and it is not suitable for personal research.

image
  The above is a screenshot of torchsummary. Forword and bacword pass size are the intermediate variable sizes that need to be saved. It can be seen that this part occupies most of the video memory. If you don't store the results of the intermediate layer, you can greatly reduce the GPU memory usage and help train deeper and wider networks. Aidan N. Gomez and Mengye Ren of the University of Toronto proposed a reversible residual neural network. The activation result of the current layer can be calculated from the result of the next layer. That is, if we know the final result of the network layer, we can reverse the previous results. Intermediate result of the first layer . In this way, we only need to store the parameters of the network and the results of the last layer. The storage of the activation results has nothing to do with the depth of the network, which will greatly reduce the memory usage. Surprisingly, the experimental results show that the performance of the reversible residual network has not dropped significantly, which is basically on par with the previous experimental results of the standard residual network.

  If you have forgotten a lot of the calculation details, it doesn't matter, we will start with BP backpropagation and standard residual network step by step. The purpose of this article is to take you to figure it out from beginning to end. First of all, let's review the derivation formula of the multivariate compound function:

image

2. Back Propagation (BP) of Neural Network

image

Symbol means:

X1, X2, X3: Represents 3 input layer nodes

W t ji : represents the weight parameter from layer t-1 to layer t, j represents the j-th node in layer t, and i represents the i-th node in layer t-1

a t i : represents the output result after the i-th activation of the t layer

g(x): represents the activation function

Forward propagation calculation process :

<Hidden layer>

image

<Output layer>

image

Backpropagation :

Taking a single sample as an example, suppose the input vector is [x1, x2, x3], the target output value is [y1, y2], and the cost function is represented by L. The overall principle of backpropagation is to back-propagate back to the network according to the overall output error. By calculating the gradient of each layer of nodes, using the principle of gradient descent method, update the network weight w and bias b of each layer, which is also network learning the process of. The advantage of error backpropagation is that the complicated derivative calculation can be expressed in the form of a series of recurrences, which simplifies the calculation process.

 The square error is used to calculate the process of back propagation, and the cost function is expressed as follows:

image

According to the chain rule of the derivative, the weight representation of hidden -> output layer, input layer -> hidden layer is solved in reverse:

image

 Introduce a new form of error derivation, called neural unit error :

image

l=2, 3 represents the layer, j represents the node of a certain layer. The replacement is as follows:

image

So we can summarize the general calculation formula:

image

As can be seen from the above formula, if the error δ nerve cell may seek out, that it the partial derivative of the total error in the weight of each layer of the weight w and the bias b can be out of order, the following parameters can be optimized using a gradient descent method Up .

Solve the δ of each layer:

 Output layer

image

Hidden layer

 image

也就是说,我们根据输出层的神经误差单元δ就可以直接求出隐藏层的神经误差单元,进而省去了隐藏层的繁杂的求导过程,我们可以得出更一般的计算过程:

image

从而得出l层神经单元误差和l+1层神经单元误差的关系。这就是误差反向传播算法,只要求出输出层的神经单元误差,其它层的神经单元误差就不需要计算偏导数了,而可以直接通过上述公式得出

 

 三、残差网络(Residual Network)

残差网络主要可以解决两个问题:1)梯度消失问题;2)网络退化问题。其结构如下图

image

上述结构就是一个两层网络组成的残差块,残差块可以由2、3层甚至更多层组成,但是如果是一层的,就变成线性变换了,没什么意义了。上述图可以写成公式如下:

F(x)=W* ReLU(W* X)

所以在第二层进入激活函数ReLU之前F(x)+X组成新的输入,也叫恒等映射,就是在这个残差块输入是X的情况下输出依然是X,这样其目标就是学习让F(X)=0。

为什么要额外加一个X呢,而不是让模型直接学习F(x)=X?

  因为让F(x)=0比较容易,初始化参数W非常小接近0,就可以让输出接近0,同时输出如果是负数,经过第一层Relu后输出依然0,都能使得最后的F(X)=0,也就是有多种情况都可以使得F(x)=0;但是让F(x)=x确实非常难的,因为参数都必须刚刚好才能使得最后输出为X。

恒等映射有什么作用?

  恒等映射就可以解决网络退化的问题,当网络层数越来越深的时候,网络的精度却在下降,也就是说网络自身存在一个最优的层度结构,太深太浅都能使得模型精度下降。有了恒等映射存在,网络就能够自己学习到哪些层是冗余的,就可以无损通过这些层,理论上讲再深的网络都不影响其精度,解决了网络退化问题。

为什么可以解决梯度消失问题呢?

  以两个残差块的结构实例图来分析,其中每个残差块有2层神经网络组成,如下图:

image

 

假设激活函数ReLU用g(x)函数来表示,样本实例是[X1,Y1],即输入是X1,目标值是Y1,损失函数还是采用平方损失函数,则每一层的计算如下: 

 image

下面我们对第一个残差块的权重参数求导,根据链式求导法则,公式如下:

 image

我们可以看到求导公式中多了一个+1项,这就将原来的链式求导中的连乘变成了连加状态,可以有效避免梯度消失了

 

四、可逆残差网络(Reversible Residual Network)

1)可逆块结构

 可逆神经网络将每一层分割成两部分,分别为x1和x2,每一个可逆块的输入是(x1,x2),输出是(y1,y2)。其结构如下:

正向计算图示:

image

公式表示:

           image       

 

逆向计算图示:

image

公式表示:

      image                 

其中F和G都是相似的残差函数,参考上图残差网络。可逆块的跨距只能为1,也就是说可逆块必须一个接一个连接,中间不能采用其它网络形式衔接,否则的话就会丢失信息,并且无法可逆计算了,这点与残差块不一样。如果一定要采取跟残差块相似的结构,也就是中间一部分采用普通网络形式衔接,那中间这部分的激活结果就必须显式的存起来。

2)不用存储激活结果的反向传播

   为了更好地计算反向传播的步骤,我们修改一下上述正向计算和逆向计算的公式:

image

  尽管z1和y1的值是相同的,但是两个变量在图中却代表不同的节点,所以在反向传播中它们的总体导数是不一样的。Z1的导数包含通过y2产生的间接影响,而y2的导数却不受y2的任何影响。

  In the back propagation calculation process, the activation value (y1, y2) of the last layer and the overall derivative of error propagation (dL/dy 1 ,dL/dy 2 ) are given first, and then the input value (x1, x2) and the corresponding derivative (dL/dx 1 ,dL/dx 2 ), and the overall derivative of the weight parameters in the residual functions F and G, the solution steps are as follows:

image

 3) Computational overhead

  A neural network with N connections, the theoretical addition and multiplication cost of forward calculation is N, the theoretical addition and multiplication cost of back propagation derivation is 2N (reverse derivation includes compound function derivation and multiplication), and the reversible network has one more step The operation of calculating the input value in reverse is required, so the theoretical calculation cost is 4N, which is about 33% more than the ordinary network cost. But in actual operation, the calculation overhead of forward and reverse is similar on GPU, which can be understood as N. In this case, the overall computational cost of the ordinary network is 2N, and the overall cost of the reversible network is 3N, which is about 50% more.

 

参考论文:The Reversible Residual Network:Backpropagation Without Storing Activations




This article is authored by the author to publish the original AINLP on the official account platform. Contributions are welcome, both AI and NLP are acceptable. Link to the original text, click "Read Original" to go directly:


https://www.cnblogs.com/gczr/p/12181354.html




About AINLP

AINLP is an interesting natural language processing community with AI, focusing on the sharing of AI, NLP, machine learning, deep learning, recommendation algorithms and other related technologies. Topics include text summarization, intelligent question answering, chat robots, machine translation, automatic generation, and knowledge Graphs, pre-training models, recommendation systems, computational advertisements, recruitment information, job search experience sharing, etc. Welcome to follow! To add technical exchange group, please add AINLP Jun WeChat (id: AINLP2), note work/research direction + add group purpose.



Guess you like

Origin blog.51cto.com/15060464/2675634