Classical taxonomy model (six): DenseNet_v2_v3 (2017CVPR)

Densely Connected Convolutional Networks-----DenseNet_2017CVPR

Convolution dense network connections

In order to strengthen the traditional skills CNN model, there are two possible methods, namely, CNN will increase the number of layers, becoming deeper and deeper; increasing the number of conv filters sucked two single-layer CNN, it becomes more and more wide . But the two will result in doubling the training parameters, which slide into the abyss of overfitting.

Later Resnet to-peer network on the introduction of identity mapping, so we realized that further effective network architecture design CNN can reach a certain excellent performance under certain circumstances network training parameters.

Resnet skip learning by using the foregoing layers may be effectively transmitted to the activation maps generated later use layers greatly avoid the parameters before the explosion direction (Exploding) and after the disappearance of the parameters (Vanishing) and so on. In DenseNet in consideration of further strengthening the network link between the CNN layers and back layers in front of, to design a network of DenseNet to feature information is more fully the rear end of the treatment layers is obtained effectively behind the multiplexed layers, thus making the network increases as the number of layers, and gradually on the basis of the original holdings have been global feature information on its ever-increasing information about the new features behind layers produced. It is designed experiments can be demonstrated effectively multiplexed between the calculated feature map layers, thereby reducing the training required of each parameter.

DenseNet of network structure can be described as: Xl = Hl ([X0, X1, ..., Xl-1]). Wherein [X0, X1, ..., Xl-1] represent the same set of feature maps lth layer feature map size in front of the generated plurality of layers.

Reference Links: https://www.jianshu.com/p/59bb32202353

Abstract

Recent work has shown that if convolution network contains a short connection between the layer and the layer close to the output close to the input, you can be deeper and more accurate and effective training. In this article, we accept this observation, and introduces a dense network convolution (DenseNet) , the network will be connected to a feed-forward manner every other layer of each layer. ** conventional convolutional network having a layer L having L (L + 1) / 2 connections (between each subsequent layer thereto has a connection), and having an L our network directly connected. ** For each layer, all of the features of FIG previous layer are used as input, wherein its own plans used as input all subsequent layers. ** DenseNets has several notable advantages: they reduce the problem goes away gradients, and enhance the propagation characteristics, the characteristics encourage reuse and greatly reduces the number of parameters. ** We evaluated our proposed architecture in the highly competitive four benchmarks object recognition task (CIFAR-10, CIFAR-100 , SVHN and ImageNet). DenseNets in most respects than the latest technology has significantly improved, while requiring less computing to achieve high performance. Code and pre-training model is available at the following locations https://github.com/liuzhuang13/DenseNet.

1.Introduction

Convolutional neural network (CNN) has become the main method of machine learning visual object recognition. Although they were originally introduced in [18] 20 years ago, but improved computer hardware and network structure was the true depth of CNN training possible. The original LeNet5 [19] consists of five layers, VGG has a 19 layer [29], only last year's highway network [34] and the residual network (ResNets) [11] only broke the 100 barrier layer.
Here Insert Picture Description
As CNN deeper and deeper, there is a new research question: With information about entering or gradient through the many layers, when reaching the end of the network (or starting point), and it may disappear. "Washed away." Many recent publications have addressed this issue or related issues. ResNets [11] and Highway Networks [34] connected to the signal from one layer to another through the bypass status. Random length [13] to shorten ResNets layers by randomly placed in the training process, in order to provide better information and gradient flow. FractalNets [17] The repeated sequences of several parallel layers with different numbers of convolutional blocks are combined to obtain a greater nominal depth, while maintaining many short paths in the network. Although these various methods differ in terms of network topology and training process, but they all have a key characteristics: they create a short path from the early layer to the subsequent layers.

In this paper, we propose an architecture that will refine this insight a simple connectivity patterns: In order to ensure a maximum flow of information between the network layers, we have all the layers (Fig having matched feature size) directly to each other. In order to retain the characteristics of feed-forward, each of the layers are acquired from all the other input of the previous layer, and passes its own to all subsequent layers wherein FIG. FIG 1 schematically illustrates this arrangement. Crucially, compared with ResNets, we'll never before features to deliver a combination of features by summing the layers. On the contrary, we have to combine them in series function. Thus, "a first layer having a l" input, characterized by the convolution of all the preceding block composition diagram. FIG its own characteristics to be delivered to all L- 'subsequent layers. This introduces the (L + 1) / 2 connections L L layer in the network, rather than the traditional architecture, as only the L introduced. Because of its dense connection mode, we will this method is called convolution dense network (DenseNet).

This effect may be reversed dense connected mode and intuition, as compared with the conventional convolutional network, it requires fewer parameters, since no redundant features relearn FIG. The traditional feed-forward architecture can be considered to have a state of the algorithm, the state will transfer layer by layer. Each layer from which a layer of the read state and write the next layer, which not only changes state, but also transmit information to be retained. ResNets [11] by converting the additional identity information is stored becomes clear. [13] The latest changes ResNets show that the contribution of many layers of small, in fact, can be thrown away in the training process. This makes the state similar ResNet (expanded) recurrent neural networks [21], but the number of parameters ResNet substantially larger, because each layer has its own weight. Our proposed architecture DenseNet a clear distinction between information added to the network information and reservations. DenseNet layer is very narrow (e.g., each filter 12), only a small amount of the network characterized in portfolio "collective knowledge" of the network and maintaining unchanged the other characteristics of FIG - final classification decision based on all the features of the network of FIG. .

** In addition to better efficiency parameters, a big advantage is that they DenseNets improve the flow of information throughout the network and gradient, which makes them easy to train. ** Each layer from the original input signal and the loss function access gradients, resulting in direct implicit deep monitoring [20]. This helps train the deeper network architecture. In addition, we also observed intensive connection with a regularization effect, thereby reducing the size of the training set of smaller tasks overfitting.

We evaluated four DenseNet in the highly competitive benchmark data sets (CIFAR-10, CIFAR-100, SVHN and ImageNet). Compared with the existing algorithms have comparable accuracy, our model usually requires fewer parameters. In addition, on most benchmarks task, our performance was significantly better than the results of the current date.

2.Related Work

Since the initial discovery, the search network architecture has been part of a study of the neural network. Recent resurgence of popularity of neural network makes this area of ​​research renaissance. With the increase in the number of layers of modern networks, expanding the difference between architecture and inspired the exploration of different connectivity modes and re-examine the old research ideas.

In the neural network literature of the 1980s, it has been studied cascade structure similar to a dense network layout of our proposed [3]. Their pioneering work focused on the Multilayer Perceptron in a layer by layer manner trained fully connected. Recently proposed a batch gradient descent training fully connected network cascade [40]. Although this method is effective for small data sets, but only can be extended to networks with hundreds of parameters. In [9,23,31,41], it was found by skipping connections for a variety of functions using a multi-vision tasks are valid in the CNN. Parallel with our work, [1] a purely theoretical framework has a network similar to our cross-layer connection deduced.

One architecture of the motorway network [34] is the first to provide effective training end network having a plurality of layers 100 of a method. By using the bypass unit and the gate can be optimized without difficulty with hundreds of layers of the road network. Bypass path is considered a key factor in these very deep simplify network training. ResNets [11] Further support of this, wherein a bypass path is used as a pure identity mapping. ResNets achieved an impressive record of performance [11] in many challenging image recognition, positioning and inspection tasks (such as ImageNet and object detection COCO) in. Recently, as a proposed method for random depth ResNet successful training layer 1202 [13]. Random depth by placing the layers in a random training process to improve the depth of the residual network training. This suggests that may not need all of the layers, and there are a lot of redundant highlights the depth (residual) network. Our paper in a way inspired by this observation. ResNets with pre-activation function may also have a train> 1000 latest network layer [12].
Here Insert Picture Description
Method deeper quadrature network (e.g., connected by means of skipping) is to increase the width of the web. GoogLeNet [36,37] using the "pirate module," filters of different sizes are connected generated feature FIG. In [38], a widely variant having the generalized ResNets residual block. In fact, if deep enough, simply increasing the number of filter ResNets each layer can improve its performance [42]. FractalNets also widely used network structure made of a plurality of competing results on data set [17].

DenseNets obtained from deep or not deep architecture representative function, but the potential for heavy excavation by the features of the network, thereby producing a concentrated and easy to train model parameters and efficient. Wherein FIG connect different layers learned change will increase the input of subsequent layers, and improve efficiency. This constitutes a major difference between DenseNet and ResNet. Compared to Inception network [36, 37], which is also connected to different levels of features, DenseNets simpler and more efficient.

There are other significant network architectural innovations also had a competitive results. Network Network (NIN) [22] The micro-structure comprises a multilayer perceptron convolution to filter layer, to extract more complex features. [20], the direct supervision of the internal layer at a depth supervision network (DSN)
via the auxiliary classifier can be enhanced early received a gradient layer. Ladder Networks [27,25] transverse connecting to introduce an automatic encoder, resulting in an impressive accuracy on semi-supervised learning task. [39] proposes a Deep-Fused Nets (DFNs) to improve the flow of information through an intermediate layer of a combination of different underlying network. Enhanced network reconstruction by way minimizing loss may also improve image classification model [43].

3.DenseNets

Consider a single image transmitted by convolving the network x0. The network contains the L layers, each performing a nonlinear transformation H l (·), where "the layer index." Composite function H l (·) such as a batch may be normalized (BN) [14], the rectifying linear unit (ReLU) [6], combined [19] or convolutional (Conv) or the like operation. We will output the l layer is presented as xl.

ResNets. Feedforward network of conventional convolutional output layer l as input connected to the (l + 1) layer [16], which leads to the conversion layer: xl = H l (x 1-1 ). ResNets [11] added to skip a connection that bypasses the non-linear transformation function using the identity:
Here Insert Picture Description
a ResNets advantage is that, by the identity function gradient may flow directly in front of the layer from the back layer. However, identification of the function and the H output by summing combined, which may hinder the flow of information in the network.

Intensive connection. To further improve the flow of information between the layers, we propose a different connection modes: we introduce a direct connection from any layer to all subsequent layers. FIG 1 schematically illustrates the layout resulting DenseNet. Thus, the l layer receives all previous layers x 0, ..., xl-1 characteristic diagram as input:
Here Insert Picture Description
where [x 0, x 1, ... , xl-1] refers to the 0, ..., -1 layer generating a series of elements of FIG. Because of its dense connectivity, we see this dense network architecture called convolution Network (DenseNet). For ease of implementation, we equation plurality of inputs H (·) are connected. (2) a single tensor.

Complex function. According to [12], we will Hl (·) function is defined as a composite of three consecutive operations: Batch normalized (BN) [14], then the linear rectifying means (ReLU) [6] and the 3 × 3 convolution ( conversion).

The combined layer. Operating in series used in the equation. (2) When it is not feasible to change the size of the feature map, however, an important part of a convolutional network is to reduce the size of features of FIG downsampling layer. To facilitate sampled at our architecture, we dividing the network into a plurality of blocks dense densely connected ; Figure 2. We called transition layer between the blocks, they are pooled and convolution. Transition our experiments using a batch normalization comprises 1 × 1 layer and a convolution layer, followed by an average of 2 × 2 cell layer.
Here Insert Picture Description
growth rate. If each of the k functions H1 generated feature map, it follows that the first layer has a l k 0 + k × (1-1) th input feature map, where k 0 is the number of channels in the input layer. DenseNet important difference between the conventional network architecture is very narrow DenseNet layer may have, for example, k = 12. We will exceed the growth rate of the parameter k is called the network. In Section 4, we show that a relatively small rate of growth sufficient to get the latest results on our test data set. One explanation for this is that each of the layers in which the block can access all of the foregoing features of FIG., It is possible to access the "collective knowledge" network. FIG feature may be regarded as the global state of the network. FIG Add your each k features this state. The growth rate of new information to decide how much of each layer contribute to global status. Global state after writing may be performed in the network access to any location, and with different conventional network architecture, without copying it layer by layer.

Bottleneck layer. While each output layer k generates only the elements of FIG, but typically with more inputs. Has been pointed out in the [37,11] may be introduced before each of the 3 × 3 convolution convolution as Bottleneck 1 × 1 layer to reduce the number of input characteristic diagram, to improve the computational efficiency. We found that this design is particularly effective for DenseNet, and we will have a BN-ReLU-Conv Network Bottleneck of such a layer is referred Hl (1 × 1) -BN-ReLU Conv (3 × 3) version called DenseNet -B in our experiments, we have each 1 × 1 generates convolution 4k feature FIG.

compression. In order to further increase the compactness of the model, we can reduce the number of transition characteristics of FIG. If the block contains a dense feature FIG. M, so that the transition layer is generated bθmc output characteristic diagram in which 0 <θ≤1 called the compression factor. When θ = 1, the number of cross-transition feature remains static. We θ <DenseNet 1 referred DenseNet-C, and [theta] is set to 0.5 in the experiments. When using Bottleneck <transition layer and θ <1, it will be referred to as the model DenseNet-BC.

Implementation details. On all datasets except the ImageNet, DenseNet used in our experiments with dense three blocks, each block having an equal number of layers of dense. Before entering the first dense block, performing a convolution of the input image 16 (or twice the rate DenseNet-BC) convolution. For the size of the convolution kernel layer is a 3 × 3, each with a side of the input pixel are zero-padded to maintain a fixed size characteristics of FIG. We use a convolutional 1 × 1, 2 × 2 then two as the average cell density transition between successive blocks. In the final end of a dense block, performing a global pool of average, then additional softmax classifier. FIG three feature sizes dense blocks were 32 × 32,16 × 16 and 8 × 8. We use the configured {L = 40, k = 12}, {L = 100, k = 12} and {L = 100, k = 24} DenseNet basic structure experiment. For DenseNet-BC, configured to evaluate {L = 100, k = 12}, {L = 250, k = 24} and {L = 190, k = 40} networks.

In experiments on ImageNet, we use DenseNet-BC 4 having a dense structure on the 224 × 224 blocks input images. 2k initial convolution comprises a convolution layer, a size of 7 × 7, 2 steps; map number of elements from all the other layers are also set k. Table 1 shows the exact configuration of the network we use on ImageNet.

4.Experiments

Our empirical way to show the effectiveness of DenseNet on the plurality of reference data sets and compared with the latest architecture (especially ResNet and its variants).
Here Insert Picture Description

4.1. Datasets**

CIFAR. CIFAR two data sets [15] by the natural color image composed of 32 × 32 pixels. CIFAR-10 (C10) contains 10 categories of image 100 and image categories CIFAR-100 (C100) of. Training and test sets each contain 50,000 pages and 10,000 images, we reserve 5,000 training images as a validation set. We use the standard expansion program data (image / shift), the program is widely used in both data sets [11,13,17,22,28,20,32,34]. We in the data set name (such as C10 +) end with a "+" sign indicates that this data expansion program. For pretreatment, we use the channel mean and standard deviation normalized data. For the final run, we all 50,000 pages of the training images, and report errors in the final test at the end of training.

SVHN. Street House Number (SVHN) data set [24] comprising 32 × 32 color digital image. There are 73,257 images in the training set, the test set has 26,032 images, and other training have 531,131 images. In accordance with standard practice [7,13,20,22,30], we use all the training data without any data expansion, and focus on segmented validation set contains 6,000 images from the training. During the verification exercise we choose the smallest model error, and report test error. We follow [42] and the pixel value is divided by 255, so they are within the range [0,1].

ImageNet. ILSVRC 2012 the classified data set [2] contains 1.2 million images for training and 50,000 sheets image for authentication from 1,000 categories. We use the [8,11,12] the same data enhancement scheme training image, and apply a size of 224 × 224 single crops or crop 10 in the test. After [11,12,13], we report the classification error on the validation set.

4.2. Training

Here Insert Picture Description
All networks are using stochastic gradient descent (SGD) training. On CIFAR and SVHN, we were trained to use a batch size of 64 300 and 40 era. The initial learning rate is set to 0.1, then 50% and 75% of the total training period divided by 10. On ImageNet, we train the model 90 period, the batch size is 256. Learning rate is initially set to 0.1, and then 10-fold decrease in the period of 30 and 60. Please note that naive DenseNet implementation can cause low memory efficiency. To reduce memory consumption on the GPU, please refer to the technical report on our memory about DenseNets efficient implementation of [26].

Follow [8], we use the right weight decay of heavy attenuation 10-4 and 0.9 momentum [35], but no decay. We use rights introduced by [10] re-initialized. For no data of three enhancement data sets, i.e., C10, C100 and SVHN, we add a missing layer [33] after each convolution layer (first layer except a convolution), and the loss rate is set to 0.2. An assessment for each task and set the model, only the test error.

6.Conclusion

We propose a new network architecture convolution, convolution we call the dense network (DenseNet). It introduces a direct connection between any two layers of the same size of the map features. We show that DenseNets nature can be extended to hundreds of layers, without any optimization problem. In our experiments, DenseNets tends to increase as the number of parameters and continuously improve the accuracy without any signs of performance degradation or over-fitting. In many settings, it is possible to obtain the latest results in the highly competitive plurality of data sets; Further, DenseNets require less parameters and less calculations to achieve the most advanced performance. Because we use the residual network optimized for super parameters in our study, we believe that we can get to further improve the accuracy of DenseNets by adjusting parameters and super learning rate plans in more detail.

Follow simple connection rules, DenseNets naturally integrates identity mapping, attribute depth monitoring and diverse depth. They allow reuse function in the entire network, it is possible to study a more compact model, and according to our experiments, the model can be more accurately learned. Because of its compact internal representation and reduction of redundancy features, DenseNets convolution may be based on characteristics (e.g. [4,5]) good feature extractor various computer vision tasks. We plan to study this transfer function using DenseNets in future work.

Acknowledgments. NSF III-1618134, III-1526012, IIS-1149882, the Office of Naval Research grants N00014-17-1-2175 and the Bill and Melinda Gates Foundation provided support to some extent for the author. GH is supported by China Postdoctoral International Council Postdoctoral Fellowship exchange program (No.20150015) of. ZL been 2011CBA00300,2011CBA00301, NSFC 61361136003 grant support China National Basic Research Program. We also thank Daniel Sedra, Geoff Pleiss and Yu Sun made many insightful discussions.

Published 47 original articles · won praise 21 · views 7229

Guess you like

Origin blog.csdn.net/qq_18315295/article/details/103569061