ResNet Identity Mapping

foreword

Since ResNet became popular in the ImageNet competition in 2015, it has become a big hit now. In the past two years, there have been more and more researches on ResNet, extensions and improvements based on ResNet networks, including FractalNet, WideResNet, DenseNet, etc., These articles will be introduced to you one by one. The one I will introduce today is called Gated ResNet. It comes from a paper published on ICLR this year. It is not as complicated as the previously mentioned networks. After more detailed thinking, it believes that Identity Mapping is the essence of ResNet including Highway Network, and proposes that for a deep network, if it has the ability to degenerate into Identity Mapping, then it must be easy to optimize, and has a good performance.

Essence: Identity Mapping

The Residual block proposed in ResNet is successful for two reasons. First, its shortcut connection increases its information flow. Second, it thinks that for a stacked nonlinear layer, then its optimal The situation is to make it an identity mapping, but the existence of shortcut connection just makes it easier to become an Identity Mapping. For the second point, in fact, when I first read the original ResNet, I didn't fully understand it. I didn't really understand its meaning until I saw the paper I talked about today.

Look at the picture below: 
write picture description here

The network in the row below is actually a new layer superimposed on the network in the above row, and the weight of the new layer is superimposed. If it can learn to become an identity matrix I , then the upper and lower networks are actually equal. In other words, if the layers that continue to be stacked can learn an identity matrix, the performance of the stacked network will not be worse than the original network, that is, if it is easy to learn an identity matrix If the mapping is equal, then deeper networks are more likely to produce better performance. This is the root proposed by ResNet and the focus of this paper.

For a convolutional layer f(x,W) in a network , W is the weight of the convolutional layer. If the convolutional layer is to be an identity map, that is, f(x,W)=x , then W is Should be an identity map I , but it  is not so easy to make W = I when the network of the model becomes deeper . For each Residual Block of ResNet, to make it an identity map, that is, f(x,W)+x=x , just make W = 0, and learning a matrix of all 0s is better than learning an identity The matrix is ​​much easier, which is why ResNet still has no optimization problems when the number of layers reaches hundreds or thousands.

Improved: Residual Gates

Learning an all-zero matrix is ​​to make all the values ​​in a matrix 0, so is there an easier way? For example, just a value of 0 is enough? That's exactly the point of this article: Residual Gates.

The following figure is an improvement based on the plain network: 
write picture description here 
so f(x,W) becomes g(k)f(x,W)+(1-g(k))x , I am familiar with it, are you very familiar with it ? Like Highway Network: write picture description here , but Highway Network still needs to learn a function T(x,Wt) with x as input , so that when Wt is an all-zero matrix, the entire network is equivalent to an identity mapping. Here, only g(k) equal to 0 is required to represent the identity mapping. Note that k is also a parameter of the model, which is also obtained through the forward and reverse training of the model, and g is the activation function (ReLU), that is As long as k learns a value close to 0 or less than 0 (due to the existence of ReLU), or k=1, W=I , it is much simpler than simply relying on W to learn I. According to the author's meaning, this model has the ability to degenerate into an identity mapping, so when the number of layers is deepened, the performance of the model can be improved.

For ResNet, the same Gates can also be used: 
write picture description here
ie: g(k)(f(x,W)+x)+(1-g(k))x=g(k)f(x,W)+x  , so it seems that g(k) does not even need to be used as a function of gating, as long as it is equivalent to a scaling effect, compared to the original ResNet that requires W to learn an all-zero matrix, it is simpler to make g(k) equal to 0, Therefore, the authors infer that the gated version of Gates ResNet is stronger than the original ResNet.

experiment

The results of the model on the MINIST and CIFAR-10 datasets are not shown and explained here. But there are a few things that interest me too. 
First of all: 
write picture description here 
when the model is shallow, the parameter optimization is simple, so the effect of k is not reflected, and the value of k is very large, which may play a role of amplifying or enhancing a signal; but when the number of layers gradually increases, the value of k Slowly decrease, such as the above picture, when d=100, the mean value of k is only 0.67, then in many layers, the value of k should be very close to 0, these layers play the role of identity mapping, which It also confirms the author's point of view.

Second, another figure, this is a 100-layer deep model: the 
write picture description here
author found that in ResNet, the value of k may have a lower value in the middle, and the author believes that when k is close to 0, the layer is close to For identity mapping, then this layer may play more role in information transmission than information extraction. Therefore, the impact on the entire model is not large. If these layers are eliminated, the performance of the model should not be improved. too much impact. The curve on the right also proves this. The author's discovery also provides new ideas for model compression.

Based on the second finding, in the shallow model with only 24 layers: 
write picture description here
we found that in the 1st, 5th, and 9th residual blocks, the k value is very low, and the 1st, 5th, and 9th residual blocks correspond to the dimension The rising layer (if you don't understand, you can go back and look at the model structure of Wide ResNet or ResNet), which shows that in the residual block of the rising dimension, the shortcut connection used to increase the dimension of the convolutional layer plays a more important role, and in the end In a block, the value of k is very high, which means that the shortcut connection has little effect here, so removing the shortcut connection has almost no effect.

Summarize

This paper provides a good idea for model design and optimization. It proposes a capability called model degradation to identity mapping, that is, if the model has the ability to degenerate into identity mapping, then stacking a lot of such layers, Will be no worse than a lighter layer. A gating mechanism with only a single parameter is proposed. It is an ordinary plain network, and even ResNet, which has good performance, becomes better, because the learning of one parameter is always simpler than the learning of multi-dimensional weights. Finally, the author also gives an idea for model understanding and optimization. For a trained model with gating mechanism, we can judge the role of each layer by observing the value of k, and according to the importance of the role, Unimportant layers can be eliminated without affecting the effect of the model, which plays a role in model compression.

In general, this paper is really a very good article, and it is worth reading carefully. Of course, there are still many places in the paper that are not well understood. I also hope that everyone can bring it up, communicate and learn together!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324824363&siteId=291194637