Detailed explanation of deep learning DenseNet algorithm

Paper: Densely Connected Convolutional Networks 
Paper link: https://arxiv.org/pdf/1608.06993.pdf 
Code github link: https://github.com/liuzhuang13/DenseNet 
MXNet version code (with ImageNet pre-training model, if you feel Helpful, remember to give a star): https://github.com/miraclewkf/DenseNet

Detailed explanation of the article: 
This article is the oral of CVPR2017 , which is very powerful. The DenseNet (Dense Convolutional Network) proposed in the article is mainly compared with the ResNet and Inception networks. It is ideologically learned, but it is a new structure. The network structure is not complicated, but it is very effective! As we all know, the direction of improving the effect of convolutional neural networks in the past one or two years is either deep (such as ResNet, which solves the problem of gradient disappearance when the network is deep) or wide (such as GoogleNet's Inception), and the author starts from the feature. The extreme use of features achieves better results and fewer parameters. Although the bloggers have not read many articles, but after reading this article, I feel very emotional, just like reading the ResNet article back then!


Let's first list several advantages of DenseNet, and feel its power:  
1. Reduce vanishing-gradient (gradient disappearance)  
2. Strengthen the transfer of features  
3. More effectively use features  
4. To a certain extent, fewer parameters quantity


In deep learning networks, with the deepening of the network depth, the problem of gradient disappearance will become more and more obvious. At present, many papers have proposed solutions to this problem, such as ResNet, Highway Networks, Stochastic depth, FractalNets, etc. Although the network structure of these algorithms There are differences, but the core is: create short paths from early layers to later layers. So how does the author do it? Continuing this idea, it is to directly connect all layers on the premise of ensuring maximum information transmission between layers in the network!


First put a structure diagram of the dense block. In a traditional convolutional neural network, if you have L layers, then there will be L connections, but in DenseNet, there will be L(L+1)/2 connections. Simply put, the input of each layer comes from the output of all previous layers. As shown in the figure below: x0 is the input, the input of H1 is x0 (input), the input of H2 is x0 and x1 (x1 is the output of H1)...

640?wx_fmt=jpeg

One of the advantages of DenseNet is that the network is narrower and has fewer parameters. A large part of the reason is due to the design of this dense block. It is mentioned later that the number of output feature maps of each convolutional layer in the dense block is very small ( less than 100), rather than hundreds or thousands of widths like other networks. At the same time, this connection method makes the transfer of features and gradients more efficient, and the network is easier to train. I really like a sentence in the original text: Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision. It directly explains why this network works well. As mentioned earlier, the gradient disappearance problem is more likely to occur when the network depth is deeper. The reason is that the input information and gradient information are transmitted between many layers, and now this dense connection is equivalent to each layer directly connecting the input and the gradient. loss, so the vanishing gradient phenomenon can be mitigated, so that deeper networks are not a problem. In addition, the author also observed that this dense connection has a regularization effect, so it has a certain inhibitory effect on overfitting. The blogger believes that because the parameters are reduced (the reason why the parameters are reduced will be described later), so the overfitting phenomenon is reduced. .


One of the advantages of this article is that there are basically no formulas, and it is not like an irrigation article that piles up complex formulas to make people look at it for a while. There are only two formulas in the article, which are used to explain the relationship between DenseNet and ResNet, which are very important for understanding these two networks in principle.

The first formula is for ResNet. Here l represents the layer, xl represents the output of the l layer, and Hl represents a nonlinear transformation. So for ResNet, the output of layer l is the output of layer l-1 plus a nonlinear transformation of the output of layer l-1.

640?wx_fmt=jpeg

The second formula is that of DenseNet. [x0,x1,...,xl-1] means concatenation of the output feature map of layers 0 to l-1. Concatenation is the merging of channels, just like Inception. The previous resnet is the addition of values, and the number of channels is unchanged. Hl includes BN, ReLU and 3*3 convolution.

640?wx_fmt=jpeg

So from these two formulas, we can see the essential difference between DenseNet and ResNet, which is too incisive.

The previous Figure 1 represents the dense block, while the following Figure 2 represents the structure diagram of a DenseNet, which contains 3 dense blocks. The author divides DenseNet into multiple dense blocks, the reason is that the size of the feature maps in each dense block is expected to be unified, so that there will be no size problem when doing concatenation.

640?wx_fmt=jpeg

This Table1 is the structure diagram of the entire network. k=32 in this table, k in k=48 is the growth rate, which represents the number of feature maps output by each layer in each dense block. In order to avoid the network becoming very wide, the author uses a small k, such as 32, and the author's experiments also show that a small k can have better results. According to the design of the dense block, the following layers can get the input of all the previous layers, so the input channel after concat is still relatively large. In addition, the 3*3 convolution of each dense block here includes a 1*1 convolution operation in front of it, which is the so-called bottleneck layer. The purpose is to reduce the number of input feature maps, which can not only reduce the dimension and reduce the amount of calculation, but also Integrate the characteristics of each channel, why not do it. In addition, in order to further compress the parameters, the author adds a 1*1 convolution operation between each two dense blocks. Therefore, in the following experimental comparison, if you see the DenseNet-C network, it means that this Translation layer has been added. The output channel of the 1*1 convolution of this layer is half the input channel by default. If you see the DenseNet-BC network, it means that there is both a bottleneck layer and a Translation layer.

640?wx_fmt=jpeg

Let's talk about bottleneck and transition layer operations in detail. Each Dense Block contains many substructures. Taking DenseNet-169's Dense Block (3) as an example, it contains 32 convolution operations of 1*1 and 3*3, that is, the input of the 32nd substructure is the previous The output of the 31 layers, the output channel of each layer is 32 (growth rate), then if the bottleneck operation is not performed, the input of the 3*3 convolution operation of the 32nd layer is 31*32+ (the output channel of the previous Dense Block). ), nearly 1000. With the addition of 1*1 convolution, the channel of 1*1 convolution in the code is growth rate*4, which is 128, and then used as the input of 3*3 convolution. This greatly reduces the amount of computation, which is bottleneck. As for the transition layer, it is placed between the two Dense Blocks because the number of output channels after each Dense Block ends is large, and a 1*1 convolution kernel is required to reduce the dimension. Take DenseNet-169's Dense Block (3) as an example. Although the 3*3 convolution output channel of the 32nd layer has only 32 (growth rate), there will be a channel concat operation like the previous layers. The output of the 32nd layer and the input of the 32nd layer are concat. As mentioned earlier, the input of the 32nd layer is about 1000 channels, so the output of each Dense Block is also more than 1000 channels. Therefore, this transition layer has a parameter reduction (the range is 0 to 1), which indicates how many times to reduce these outputs to the original. The default is 0.5, so that the number of channels will be reduced by half when it is passed to the next Dense Block, which is transition. The role of layers. The dropout operation is also used in the article to randomly reduce branches and avoid overfitting. After all, this article has many connections.


Experimental results: 
The DenseNet network used by the author on different datasets will be slightly different. For example, on the Imagenet dataset, DenseNet-BC has 4 dense blocks, but only 3 dense blocks are used on other datasets. For more details, see Implementation Details in Section 3 of the paper. The details of training and the setting of hyperparameters can be seen in Section 4.2 of the paper. When testing on the ImageNet dataset, a 224*224 center crop was done.


Table 2 is the comparison result with other algorithms on three datasets (C10, C100, SVHN). ResNet[11] is Kaiming He's paper, and the comparison results are clear at a glance. The network parameters of DenseNet-BC are indeed reduced a lot compared to DenseNet of the same depth! In addition to saving memory, parameter reduction also reduces overfitting. Here for the SVHN dataset, the results of DenseNet-BC are not as good as DenseNet (k=24). The author believes that the main reason is that the SVHN dataset is relatively simple, and deeper models are easy to overfit. The comparison of DenseNets with three different depths L and k in the penultimate area of ​​the table shows that as L and k increase, the effect of the model is better.

640?wx_fmt=jpeg

Figure 3 is the comparison of DenseNet-BC and ResNet on the Imagenet dataset. The figure on the left is the comparison of parameter complexity and error rate. You can see the parameter complexity under the same error rate, or you can see the error under the same parameter complexity. The rate of increase is still very obvious! On the right is the comparison of flops (which can be understood as computational complexity) and error rate, which is also effective.

640?wx_fmt=jpeg

Figure 4 is also important. The figure on the left shows the comparison of parameters and errors of different types of DenseNet. The middle figure shows the comparison of parameters and errors between DenseNet-BC and ResNet. Under the same error, the parameter complexity of DenseNet-BC is much smaller. The figure on the right also shows that DenseNet-BC-100 can achieve the same results as ResNet-1001 with only a few parameters.

640?wx_fmt=jpeg

Also mention the relationship between DenseNet and stochastic depth. In stochastic depth, the layers in the residual will be randomly dropped during the training process. In fact, this will directly connect adjacent layers, which is very similar to DenseNet.


Summary: 
After reading this article, the blogger really felt a bit late to meet each other. It was posted on arxiv half a year ago. I heard that it caused a sensation at the time. Later, I was selected as the oral of CVPR2017. I feel like it will shake ResNet. In addition, many networks for classification and detection are now done on ResNet. Isn’t this a big earthquake. I am surprised to summarize this article. The core idea of ​​DenseNet proposed in this article is to establish the connection relationship between different layers, make full use of features, further reduce the problem of gradient disappearance, deepen the network is not a problem, and the training effect is very good . In addition, the use of bottleneck layer, translation layer and a small growth rate makes the network narrower and the parameters are reduced, which effectively suppresses overfitting and reduces the amount of calculation. DenseNet has many advantages, and the advantages in comparison with ResNet are still very obvious.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324823678&siteId=291194637
Recommended