Super detailed interpretation of classic neural network papers (6) - DenseNet study notes (translation + intensive reading + code reproduction)

 

Preface

We introduced ResNet in the previous article: Super detailed interpretation of classic neural network papers (5) - ResNet (residual network) study notes (translation + intensive reading + code reproduction )

ResNet can train deeper CNN models through short-circuit connections, thereby achieving higher accuracy. Today we are going to introduce the DenseNet ("Densely connected convolutional networks") model. Its basic idea is the same as ResNet, but it achieves better performance than ResNet with fewer parameters and computational costs. DenseNet also won CVPR for this reason. Best Paper Award in 2017. Let’s learn it together!

Original text location:https://arxiv.org/pdf/1608.06993.pdf


Table of contents

Preface

Abstract—Abstract

1. Introduction—Introduction

2. Related Work—related work

3. DenseNets

4. Experiments—Experiments

4.1. Datasets—datasets

4.2. Training—Training

4.3. Classification Results on CIFAR and SVHN—Classification results of CIFAR and SVHN

4.4. Classification Results on ImageNet—Classification results on ImageNet

5. Discussion—discussion

6. Conclusion—Summary

Ten questions about the paper


Abstract—Abstract

translate

Recent research shows that in convolutional networks, increasing the links between layers can enhance the accuracy and efficiency of the network. In this paper, we take note of this assumption and connect all layers directly. Compared with the traditional convolutional network, an L-layer network, our network uses a direct connection of L*(L+1) /2. Each layer takes as input the outputs of all previous layers and its own input. This network has the following advantages: it solves the problem of gradient descent, increases feature transfer, reuses features, and reduces network parameters. We evaluate on 4 datasets (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNet uses less memory and computing power to achieve the best performance. Code and pre-trained models are available at: https://github.com/liuzhuang13/DenseNet.

intensive reading

main content

Background:Recent research shows that CNNs can be trained more deeply, accurately and efficiently if they contain shorter connections between layers closer to the input and output. (ResNet)

Introduced problems:A large number of network parameters, the utilization rate of the network structure is not high (some layers are selectively dropped out).

Traditional method:For a network with L layers, traditional CNN has L connections (including the subsequent connection)

Improvements of this article:This article introduces dense convolutional network (DenseNet),It connects each layer to each layer in a feed-forward manner Up, there are L(L+1)/2 connections. (There are connections between each layer and between each layer and its subsequent layers) Each layer gets additional input from all previous layers and passes its own feature map to all subsequent layers.

Comparison between ResNet and DenseNet

 Advantages of DenseNet

(1) Alleviating the vanishing gradient problem

(2) Enhanced feature propagation

(3) Greatly reduce the number of parameters


1. Introduction—Introduction

translate

Convolutional neural networks (CNN) have become the dominant machine learning method for visual object recognition. Although they were originally introduced more than 20 years ago [18], improvements in computer hardware and network structures have made the training of truly deep convolutional neural networks possible. The original LeNet5 [19] consisted of 5 layers, VGG had 19 [29], and last year's HighWay Network [34] and Residual Networks (ResNets) [11] broke through the 100-layer barrier. As CNNs get deeper and deeper, a new research problem arises: as information about the input or gradient passes through many layers, it may disappear and "wash away" by the time it reaches the end (or beginning) of the network. Many recent publications have addressed this or related issues. ResNets [11] and Highway Networks [34] bypass signals from one layer to another through identity connections. Random depth [13] shortens ResNets by randomly discarding layers during training to provide better information and gradient flow. FractalNets [17] repeatedly combine several parallel layers with varying numbers of convolutional blocks to achieve large nominal depth while maintaining many short paths in the network. Although these different methods differ in terms of network topology and training process, they all share a key feature: they create short paths from previous layers to later layers.

In this paper, we propose an architecture that distills this insight into a simple connection pattern: To ensure maximum information flow between layers in the network, we combine all layers (with matching feature-map sizes ) are directly connected to each other. To preserve feed-forward properties, each layer takes additional input from all previous layers and passes its own feature map to all subsequent layers. Figure 1 illustrates this layout. More importantly, compared to ResNets, we never combine features by summing before passing them to the layer. Instead, we combine them by concatenating features. Therefore, the L-th layer has L inputs, consisting of the feature-maps of all previous convolutional blocks. Its own feature-map is passed to all subsequent layers. In this way, an L-layer network has L(L+1)/2 connections instead of L connections like the traditional structure. Due to its dense connectivity pattern, we call our method a dense convolutional network (DenseNet).

A potentially counterintuitive effect is that this dense connection pattern requires fewer parameters than traditional convolutional networks because it does not require relearning redundant feature maps. A traditional feedforward architecture can be thought of as an algorithm with a state that is passed from layer to layer. Each layer reads state from the layer above it and writes it to the next layer. It changes the state, but also passes information that needs to be retained. ResNets [11] make this information retention explicit by appending identity transformations. Recent variants of ResNets [13] show that many layers contribute very little and can actually be dropped randomly during training. He makes the state of ResNets similar to (unrolled) recurrent neural networks [21], but the number of parameters of ResNets is much larger because each layer has its own weight. Our proposed DenseNet architecture clearly distinguishes between information added to the network and information retained. DenseNet layers are very narrow (e.g., 12 filters per layer) and only add a small number of feature maps to the network's "collective knowledge", leaving the rest of the feature maps unchanged – and the final classifier makes decisions based on all feature maps in the network .

In addition to better parameter efficiency, a big advantage of DenseNets is that they improve information flow and gradients throughout the network, which makes them easy to train. Each layer has direct access to gradients from the loss function and the original input signal, leading to implicit deep supervision [20]. This helps in training deeper network architectures. Additionally, we observe that dense connections have a regularizing effect, reducing overfitting for tasks with smaller training set sizes.

We evaluate DenseNet on four competitive benchmark datasets (CIFAR-10, CIFAR-100, SVHN and ImageNet). Our models tend to require far fewer parameters than existing algorithms and achieve comparable accuracy. Furthermore, our performance significantly outperforms the current state-of-the-art results on most benchmark tasks.
Intensive reading

Current problem:A problem with CNN is that when the input or gradient information reaches the end after passing through multiple layers, it may "disappear".

Inspiration:Both Highway Networks and ResNet pass signals to the next layer through short circuits. There are many other forms of networks, but they all have one key point: There is a short path from the front layer to the end layer

Improvements to this article

Connection mode:

(1) To ensure maximum information flow between layers in the network, we connect all layers (with matching feature map sizes) directly to each other.

(2) To preserve feed-forward properties, each layer takes additional inputs from all previous layers and passes its own feature map to all subsequent layers.

(3) The Lth layer has L inputs, consisting of feature maps of all previous convolutional blocks. Its own feature map is passed to all subsequent layers. In this way, there are L(L+1)/2 connections in an L-layer network.

Structure diagram:


 2. Related Work—related work

translate

The increasing number of industry insiders in modern networks has widened the differences between architectures and inspired the exploration of different connectivity patterns and the reexamination of old research ideas.

In the neural network literature of the 1980s, cascade structures similar to the dense network layout we proposed have been studied [3]. Their pioneering work focused on fully connected multilayer perceptrons trained in a layer-by-layer manner. Recently, fully connected cascade networks trained using batch gradient descent were proposed [40]. Although this method is effective for small datasets, it only scales to networks with hundreds of parameters. In [9, 23, 31, 41], using multi-level features in CNN via skip connections was found to be effective for various vision tasks. Parallel to our work, [1] derived a purely theoretical framework for cross-layer connected networks similar to ours.

Highway Networks [34] was one of the first to provide a means to effectively train end-to-end networks with more than 100 layers. By using bypass paths together with gating units, road networks with hundreds of layers can be optimized without difficulty. Side-path paths are considered a key factor in simplifying the training of these very deep networks. This is further supported by ResNets [11], where pure identity mapping is used as a bypass path. ResNets have achieved impressive record-breaking performance in many challenging image recognition, localization and detection tasks such as ImageNet and COCO object detection [11]. Recently, random depth was proposed as a method to successfully train 1202-layer ResNet [13]. Stochastic depth improves the training of deep residual networks by randomly discarding layers during training. This suggests that all layers may not be needed and highlights the large amount of redundancy present in deep (residual) networks. Our paper is partly inspired by this observation. ResNets with pre-activation can also train state-of-the-art networks with more than 1000 layers [12].

An orthogonal method for making a network deeper (e.g. with skip connections) is to increase the network width. GoogLeNet [36, 37] uses an “inception module” that concatenates feature maps generated by filters of different sizes. In [38], a variant of ResNets with extensively generalized residual blocks is proposed. In fact, as long as the depth is sufficient, simply increasing the number of filters in each layer of ResNets can improve their performance [42]. FractalNets also achieves competitive results on multiple datasets using a wide range of network structures [17].

Rather than deriving representative features from extremely deep or wide architectures, DenseNets exploit the potential of the network through feature reuse, resulting in condensed models that are easy to train and parameter-efficient. The cascaded feature maps learned by different layers increase the variation of inputs to subsequent layers and improve efficiency. This constitutes the main difference between DenseNet and ResNet. Compared with Inception networks [36, 37] which also connect the functions of different layers together, DenseNets are simpler and more efficient.

There are other notable network architecture innovations that have produced competing results. The Network in Network (NIN) [22] structure includes micro-multilayer perceptrons into the filters of the convolutional layer to extract more complex features. In deep supervised networks (DSN) [20], inner layers are directly supervised by auxiliary classifiers, which can enhance the gradients received by early layers. Ladder networks [27, 25] introduce lateral connections into autoencoders, resulting in impressive accuracy on semi-supervised learning tasks. In [39], Deep Fusion Networks (DFN) are proposed to improve information flow by combining intermediate layers of different underlying networks. Network extensions with pathways that minimize reconstruction loss have also been shown to improve image classification models [43].

intensive reading

source of inspiration

depth:

  • Highway Networks and ResNets demonstrate that side paths are key to simplifying the training of these very deep networks factors
  • Stochastic depth improves the training of deep residual networks by randomly discarding layers during training. This suggests that all layers may not be needed and highlights the large amount of redundancy present in deep (residual) networks.

width:

  • GoogLeNet uses the "Inception module" (Inception-v4, Inception-ResNet-v1 and Inception-ResNet-v2), which will generate features from convolution kernels of different sizes Graphs are connected, and a variant of ResNets with broadly generalized residual blocks is proposed.
  • FractalNets also achieves competitive results on multiple datasets using a wide range of network structures

DenseNet core:   

Feature reuse

Introduction to other network innovations

  • The Network in Network (NIN) structure includes micro-multilayer perceptrons into the filters of the convolutional layer to extract more complex features.
  • Deep Supervised Network (DSN) The inner layers are directly supervised by auxiliary classifiers, which can enhance the gradient received by the early layers.
  • Ladder Network introduces lateral connections into autoencoders, resulting in impressive accuracy on semi-supervised learning tasks.
  • Deep Fusion Network (DFN) to improve information flow by combining intermediate layers of different underlying networks.


3. DenseNets

translate

Suppose there is a single imagex_{0} passed through a convolutional neural network. The network contains L layers, each of which performs a nonlinear transformationH_{l}\left ( \cdot \right )
where is the index of layer l. H_{l}\left ( \cdot \right )It can be a composite function such as convolution, pooling, Relu activation, and batch normalizatioin. We denote the output l-th layer as x_{l} .

ResNets The traditional convolutional feed-forward network connects the output of the l-th layer as the input of the l+1-th layer [16], which causes the following layer transformation: ResNets a>. X_{l}= H_{l}\left ( x_{l-1} \right )reset [11] Add a skip connection to bypass the nonlinear transformation through an original function: X_{l}= H_{l}\left ( x_{l-1} \right )+x_{l-1}


 H_{l}\left ( \cdot \right )One advantage of ResNets is that gradients can flow from later layers to earlier layers directly through the identity function. However, combining the outputs of the original function and via summation may hinder the flow of information in the network.

 Dense connective: To further improve the information flow between layers, we propose a different connectivity model: we introduce a direct connection from any layer to all subsequent layers . Figure 1 shows the layout of the resulting DenseNet. Therefore, the l-th layer receives the feature maps of all previous layers: x_{0}, ⋯ x_{l-1} as input: X_{l}=H_{l} \left ( \left [ x_{0 },x_{1}...x_{l-1} \right ]\right )

Here\left [ x_{0 },x_{1}...x_{l-1} \right ] is the cascade of the previous 0 , 1 , 2 , ⋯ , l − 1 layers. Due to its tight connectivity, we call this network architecture "Dense Convolutional Network" (DenseNet). For ease of implementation, we concatenate multiple inputs of H_{l}\left ( \cdot \right ) in equation (2) into a tensor.

Composite function H_{l}\left ( \cdot \right ) is defined as a composite function of three consecutive operations: batch normalization (BN) [14], followed by ReLU [6] and a 3×3 volume Product (Conv).

Pooling layer When the size of the feature map changes, the cascade operation used in Equation (2) is not feasible. However, an important part of a convolutional network is the downsampling layer that reduces the feature map size. To facilitate downsampling in our architecture, we partition the network into multiple densely connected dense blocks. Looking at Figure 2, we call the layers between blocks transition layers, they perform convolution and pooling. The transition layers used in our experiments include a batch normalization layer and a 1×1 convolutional layer, followed by a 2×2 average pooling layer.

Growth rate If each functionH_{l}\left ( \cdot \right ) generates k feature maps, then the l-th layer hask_{0}+k*\left ( l-1 \right ) The number of channels of the input layer. An important difference between DenseNet and existing network architectures is that DenseNet can have very narrow layers, such as k = 12. We refer to the hyperparameter growth rate of the network. We show in Section 4 that relatively small growth rates are sufficient to obtain state-of-the-art results on the datasets we tested. One explanation for this is that each layer has access to all previous feature maps in its block and therefore the "collective knowledge" of the network. Feature maps can be thought of as the global state of the network. Each layer adds its own k feature maps to this state. The growth rate controls how much new information each layer contributes to the global state. The global state once written can be accessed anywhere in the network and, unlike traditional network architectures, does not need to be replicated layer by layer.

Bottleneck layer Although each layer only generates k output feature maps, it usually has more inputs. It has been pointed out in [37, 11] that 1×1 convolution can be introduced as a bottleneck layer before each 3×3 convolution to reduce the number of input feature maps and thereby improve computational efficiency. We found this design to be particularly effective for DenseNet, and we call a network with such a bottleneck layer DenseNet-B, i.e., BN-ReLU-Conv(1×1)-BN-ReLU of H_{l}\left ( \cdot \right ) -Conv (3×3) version. In our experiments, we let each 1×1 convolution produce 4 feature maps.

Compression To further improve the compactness of the model, we can reduce the number of feature maps in the transition layer. If the dense block contains m feature maps, let the following transition layer generate \left \lfloor \theta \right \rfloor output feature maps, where 0<θ≤1 is called the compression factor. When θ = 1, the number of feature maps across the transition layer remains unchanged. We call the DenseNet with θ < 1 DenseNet-C, and set θ to 0.5 in experiments. When using both the bottleneck layer and a transition layer with θ < 1, we call the model DenseNet-BC.

Implementation details On all datasets except ImageNet, the DenseNet used in our experiments has three dense blocks, each with an equal number of layers. Before entering the first dense block, a convolution with 16 (or twice the growth rate of DenseNet-BC) output channels is performed on the input image. For convolutional layers with a kernel size of 3 × 3, each side of the input is zero-padded with one pixel to keep the feature map size fixed. We use 1×1 convolution followed by 2×2 average pooling as a transition layer between two consecutive dense blocks. At the end of the last dense block, global average pooling is performed and then a softmax classifier is attached. The feature map sizes in the three dense blocks are 32×32, 16×16 and 8×8 respectively. We conduct experiments using the basic DenseNet structure configured as {L = 40, k = 12}, {L = 100, k = 12} and {L = 100, k = 24}. For DenseNet-BC, networks configured as {L = 100, k = 12}, {L = 250, k = 24} and {L = 190, k = 40} are evaluated.

In experiments on ImageNet, we use DenseNet-BC structure with 4 dense blocks on 224×224 input images. The initial convolutional layer consists of 2k convolutions of size 7 × 7 with a stride of 2; the number of feature maps for all other layers is also taken from the setting k. Table 1 shows the exact network configuration we used on ImageNet.

intensive reading

Now assume that there is an input image X0, the network has L layers, each layer implements nonlinear transformation Hℓ(⋅), and ℓ indicates which layer it is. Hℓ(⋅) here is a composite function, which can include operations such as BN layer, ReLU, Pooling or Conv. We denote the output of layer ℓ as Xℓ.

(1) Calculation method

traditional network

 The most original feedforward convolutional neural network uses the output of layer ℓ as the input of layer ℓ+1.

 Formula: Xℓ =Hℓ (Xℓ −1)

ResNet

  ResNet adds a short-circuit connection in addition to the connection between this layer and the next layer, that is, the output of the ℓ layer and the ℓ -1 layer are used as the input of the ℓ +1 layer: that is to say, the ℓ layer will be drawn from all previous layers. Accept feature maps in .

  公式: Xℓ =Hℓ (Xℓ−1 )+Xℓ−1 ​

  Advantages: Gradient can flow directly from the later layer to the previous layer through the Identity function.

  Disadvantages: The output of the Identity function and Hℓ are combined by summation, which may hinder the flow of information in the network.

Dense connectivity

   DenseNet allows the input of the ℓ layer to directly affect all subsequent layers, and since each layer contains the output information of all previous layers, it only requires a small number of feature maps. This is why the number of parameters of DenseNet is The reason is greatly reduced compared to other models.

  公式: Xℓ =Hℓ ([X0 ​ ,X1 ​ ,⋯,Xℓ−1 ​ ])

  [X0,X1,...] means combining the feature maps of layers 0,1..(ℓ-1).


(2) Network structure

DenseNet’s network structure mainly consists ofDense Block+Transition

DenseBlock (defines how input and output are connected) is a module containing many layers. The feature map of each layer is the same size, and dense connections are used between layers.

Transition module (controlling the number of channels) connects two adjacent Dense Blocks and reduces the size of the feature map through Pooling.

Dense Block

Composite function—composite function

For convenience, we define Hℓ(⋅) as the coincidence function of three consecutive operations:BN+ReLU+a 3×3 Conv

 Growth rate—Growth rate

Assume that the number of channels of the feature map of the input layer is k0, and each layer in the Dense Block outputs K feature maps after convolution, that is, the number of channels of the obtained feature map is K, then the number of channels input to the ℓ layer is K0​+( ℓ−1)K, we call K the growth rate of the network

 Explanation:Each layer has access to all previous feature maps in its block, that is, it has access to the global knowledge of the network. The feature map can be considered as the global state of the network, and each layer adds its own K feature maps to the global state. The growth rate controls the amount of new information each layer contributes to the global state. Once the global state is written, it can be accessed from anywhere in the network and, unlike traditional network architectures, does not need to be replicated layer by layer.

Q: Why does DenseNet have so few parameters and can have a large number of features through feature reuse even with a limited number of convolutions?

Because of the Dense connection, the output channel of each block is fixed at K, but the number of output channels is constantly increasing. The number of convolutions in each block can be fixed to K, instead of specifying the number of convolution kernels as 128, 256, 512, 1024, etc. as in other networks.

Bottleneck layer

Bottleneck layers—Bottleneck layers

Purpose:Although each layer (H(⋅)) can only produce K output features, there are many input features due to cumulative reasons. Since the input of the later layers will be very large, the bottleneck layer can be used inside the Dense Block to reduce calculations.

Method: A 1×1 convolution can be introduced as a bottleneck layer before each 3×3 convolution to reduce the number of input feature maps and thereby improve computational efficiency. That is, the operations included in Hℓ()BN+Relu+1×1conv+BN+Relu+3×3conv, call this network structure DenseNet-B .

Transition layer

Pooling layers—pooling layers

Reason:The feature maps have different sizes in each layer and cannot be directly concated. It's really inconvenient to go across the whole thing directly.

Method:Divide DenseNet into multiple Dense Blocks, and the layers between blocks are calledtransformation layers, used for convolution and pooling, in this articleConversion layer=BN+1×1Conv+2×2AveragePooling

 Compression—Compression

Purpose:In order to further improve the compactness of the model, the number of feature maps in the conversion layer can be reduced.

Method: Introduce a compression factor θ (0 < θ ≤1). When θ=1, the number of input and output features of the conversion layer remains unchanged, that is, after conversion The number of features after the layer remains unchanged; when θ <1, and the number of input feature maps is m, the output is ⌊θm⌋. The DenseNet with θ<1 is called DenseNet-C (in the experiment we set θ=0.5).

The model of Dense Block+bottleneck+Translation is calledDenseNet-BC

Q: DenseNet-B has already been compressed, why do we need to propose DenseNet-C?

The bottleneck layer refers to H(⋅) compression used within the dense block, and compression refers to compression in the conversion layer.


(3) Implementation Details—Implementation details

Experimental operation

  • On entering the first Dense Block, the input image is convolved with 16 (or twice the growth rate of DenseNet-BC) output channels
  • For 3x3 convolution, use zero-padding to ensure that the size of the feature map remains unchanged
  • Use 1x1 convolution + 2x2 average pooling as the conversion layer of two Dense Blocks
  • At the end of the last Dense Block, a global average pooling + a softmax classifier is performed.

Experimental data

DenseNet contains a total of three Dense Blocks. The feature map sizes of each module are 32 × 32, 16 × 16 and 8 × 8 respectively. The number of layers in each Dense Block is the same (the Dense Block itself does not change the feature map size, so it is Dimensional changes caused by conversion layers)

Experiment on these basic DenseNets:

  • L = 40 , K = 12
  • L = 100 ,K = 12
  • L = 100 ,K = 24

For DenseNet-BC:

  • L = 100 , K = 12
  • L = 250 , K = 24
  • L = 190 , K = 40

Experimental illustration

In the experiments on ImageNet, the DenseNet-BC structure was used, which contains 4 Dense Blocks. The initial convolutional layer consists of convolutions of size 7×7 with stride 2. The detailed network configuration is shown in the table below.


 (4) Advantages of DenseNet

(1) Fewer parameters are required, and the DenseNet structure clearly distinguishes the information added to the network and the information retained, and there is no need to relearn redundant feature maps.

 (2) Improved the information flow and gradient in the entire network. Due to the dense connection method, DenseNet improves the back propagation of gradients, making the network easier to train. Each layer can directly access the loss function and the gradient of the initial input, reducing the vanishing gradient phenomenon.

(3) Dense links have a regularization effect, preserve low-dimensional features, and can reduce the risk of over-fitting for tasks with small training set sizes.


4. Experiments—Experiments

4.1. Datasets—datasets

translate

CIFAR. The two CIFAR datasets [15] contain 32×32 pixel color natural images. CIFAR-10 (C10) contains images drawn from 10 categories and 100 species of CIFAR-100 (C100). The training set and test set contain 50,000 and 10,000 images respectively, and we retain 5,000 training images as the validation set. We adopted a standard data expansion scheme (mirror/shift), which is widely used in both datasets [11, 13, 17, 22, 28, 20, 32, 34]. We indicate this data augmentation scheme with a "+" sign at the end of the dataset name (e.g. C10+). For preprocessing, we normalize the data using channel mean and standard deviation. For the final run, we use all 50,000 training images and report the final test error at the end of training.

SVHN. The Street View House Numbers (SVHN) dataset [24] contains 32×32 color digital images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images in the other training sets. Following common practice [7, 13, 20, 22, 30], we use all training data without any data augmentation and split a validation set with 6,000 images from the training set. We select the model with the smallest validation error during training and report the test error. We follow [42] and divide the pixel values ​​by 255 so they are in the range [0,1].

ImageNet. ILSVRC 2012 classification dataset [2] contains 1.2 million images for training and 50,000 images for validation from 1,000 categories. We adopt the same data augmentation scheme as in [8, 11, 12] to train the images and apply a single crop or 10 crops of size 224 × 224 at test time. Following [11, 12, 13], we report the classification errors on the validation set.

intensive reading

CIFAR

Introduction:32x32 color picture. CIFAR-10 (C10) is category 10, and CIFAR-100 (C100) is category 100. The training set is 50000, the test set is 10000, and 5000 is selected from the training set as the verification set.

Dataset augmentation:Adopt the standard data augmentation method widely used in these two datasets - (flip + random crop)

Preprocessing:Normize the data using the mean and standard deviation of each channel.

SVHN

Introduction:A street number recognition data set similar to MNIST. 32x32 color image. The training set is 73257, the test set is 26032, and 531131 images are used as additional training. 6000 were selected from the training set as the validation set.

Preprocessing:Divide the pixel values ​​by 255 so that they are in the range [0,1]

ImageNet

Introduction:The largest image recognition database in the world. The ILSVARC 2012 classification data set has 1.2 million training sets and 50,000 validation sets, with a total of 1,000 categories.

Test image:Use single-crop and 10-crop with size 224*224


4.2. Training—Training

translate

All networks are trained using stochastic gradient descent (SGD). On CIFAR and SVHN, we use batch size 64 to train for 300 and 40 epochs respectively. The initial learning rate is set to 0.1 and then divided by 10 for 50% and 75% of the total number of training epochs. On ImageNet, we trained the model for 90 epochs with a batch size of 256. The learning rate is initially set to 0.1 and then reduced by a factor of 10 at epochs 30 and 60. Note that a simple implementation of DenseNet may result in memory inefficiency. To reduce memory consumption on the GPU, see our technical report on memory-efficient implementations of DenseNets [26].

According to [8], we use 10^{-4} weight decay and a Nesterov momentum [35] of 0.9 without decay. We adopt the weight initialization introduced in [10]. For the three datasets without data augmentation, namely C10, C100 and SVHN, we add a dropout layer [33] after each convolutional layer (except the first convolutional layer) and set the dropout probability to 0.2. Test error was evaluated only once for each task and model setting.
Intensive reading

CIFAR and SVHN

batchsize:64。

epochs:Train 300 rounds and 40 rounds respectively

Initial learning rate:0.1, at 50% and 75% of the total training rounds, the learning rate changes to the original 0.1

ImageNet

batchsize:256

epochs:训练90轮

Initial learning rate:0.1, the learning rate changes to the original 0.1 in 30 and 40 rounds respectively

In addition: due to GPU memory limitations, the batchsize of the largest model (DenseNet-161) is 128. Train for 100 epochs and change the learning rate to the original 0.1 at epoch 90

Advancer:SGD

Weight decay:10^{-4}

Amount of movement:0.9

Droupout:For three types of data without data augmentation, such as C10, C100 and SVHN, a layer is added after each convolutional layer (except the first layer) dropout layer, and set the deactivation rate to 0.2


4.3. Classification Results on CIFAR and SVHN—Classification results of CIFAR and SVHN

translate

We train DenseNet using different depths L and growth rates k. Table 2 shows the main results of CIFAR and SVHN. To highlight the overall trend, we mark all results as better than the latest available in bold, while the overall best result is shown in blue.

Accuracy. Perhaps the most obvious trend comes from the last row of Table 2, which shows that DenseNet-BC with L = 190 and k = 40 on all CIFAR datasets All are superior to the latest available technologies. It achieves an error rate of 3.46% on C10+ and 17.18% on C100+, which is much lower than the error rate achieved by the widespread ResNet architecture [42]. Our best results on C10 and C100 (without data augmentation) are even more encouraging: both are nearly 30% lower than Fractal-Net using branch path regularization [17]. On SVHN, with dropout layers, DenseNet with L = 100 and k = 24 also surpasses the current best results obtained by the broad ResNet. However, the 250-layer DenseNet-BC cannot further improve the performance compared to the shorter DenseNet-BC. It can be explained with SVHN that this is a relatively easy task, while extremely deep models may overfit the training set.

Capacity. Without compression or bottleneck layers, the general trend of DenseNets is to perform better as L and k increase. We attribute this primarily to the corresponding increase in model capacity. This is better demonstrated with the C10+ and C100+ columns. On C10+, as the number of parameters increases from 1.0M to 7.0M to 27.2M, the error decreases from 5.24% to 4.10% and finally to 3.74%. On the C100+ we observed a similar trend. This shows that DenseNets can take advantage of the enhanced representation capabilities of larger and deeper models. This also shows that they do not have the overfitting or optimization difficulties of residual networks [11].

Parameter efficiency. The results in Table 2 show that DenseNets utilize parameters more efficiently than other architectures, especially ResNets. DenseNet-BC, which has a bottleneck structure and reduces size at transition layers, is particularly effective. For example, our 250-layer model has only 15.3 million parameters but consistently outperforms other models with more than 30 million parameters (such as FractalNet and Wide ResNets). We also highlight that the performance of DenseNet-BC with L = 100 and k = 12 (e.g., 4.51% vs 4.62% error for C10+, 22.27% vs 22.71% error for C100+) is comparable to using 90% of the parameters The 1001-layer pre-activation is comparable to ResNet.

Overfitting. A positive side effect of using parameters more efficiently is that DenseNets are less prone to overfitting. We observe that the improvement of the DenseNet architecture over previous work is particularly evident on datasets without data augmentation. On the C10, the improvement represents a 29% reduction in relative error, from 7.33% to 5.19%. On the C100, the reduction is about 30%, from 28.20% to 19.64%. In our experiments, we observed potential overfitting in a single setting: on C10, a 4x increase in parameters produced by increasing k = 12 to k = 24 resulted in a modest error from 5.77% to 5.83% Increase. DenseNet-BC bottleneck and compression layer seems to be an effective way to deal with this trend.

intensive reading

We train DenseNet using different depths L and growth rates k. Table 2 shows the main results of CIFAR and SVHN. To highlight the overall trend, bold represents an improvement over the previous best, and blue represents the best result.

Accuracy—accuracy rate

  • [CIFAR] The last row of the table, the results of the DenseNet-BC network with L=190 and k=40, outperform all existing models.
  • [SVHM] DenseNet (using dropout) with L=100 and k=24 also far exceeds the best result of wideResNet. However, the 250-layer DenseNet-BC did not further improve the performance. The analysis is because SVHM is a relatively easy task and a model that is too deep may overfit.

Capacity—Capacity

Without compression or bottleneck layers, the general trend for DenseNets is to perform better as L and k increase. We attribute this primarily to the corresponding increase in model capacity.

This shows that DenseNets can take advantage of the enhanced representation capabilities of larger and deeper models. This also shows that they do not have the overfitting or optimization difficulties of residual networks

Parameter Efficiency—Parameter efficiency

DenseNets utilize parameters more efficiently than other architectures, especially ResNets. DenseNet-BC, which has a bottleneck structure and reduces size at the conversion layer, is particularly effective.

Overfitting—overfitting

The compression and bottleneck layers of DenseNet-BC are effective ways to suppress the overfitting trend.


4.4. Classification Results on ImageNet—Classification results on ImageNet

translate

We evaluate DenseNet-BC with different depths and growth rates on the ImageNet classification task and compare it with the state-of-the-art ResNet architecture. To ensure a fair comparison between the two architectures, we adopted the publicly available Torch implementation of ResNet from [8], thus eliminating all other factors such as differences in data preprocessing and optimization settings. We only replace the ResNet model with the DenseNet-BC network and make all experimental settings exactly the same as those used for ResNet.

We report single-crop and 10-crop validation errors for DenseNets on ImageNet in Table 3. Figure 3 shows top-1 validation error versus number of parameters (left) and FLOPs (right) for DenseNets and ResNets. The results in the figure show that DenseNets perform comparably to state-of-the-art ResNets while requiring significantly fewer parameters and computations to achieve comparable performance. For example, DenseNet-201 with a 20M parameter model produces similar validation errors as a 101-layer ResNet with over 40M parameters. A similar trend can be observed from the right panel, which plots the validation error as a function of the number of FLOPs: DenseNet is computationally equivalent to ResNet-50, while ResNet-101 is twice as computationally intensive.

It is worth noting that our experimental setup implies that we use hyperparameter settings optimized for ResNets but not for DenseNets. It is conceivable that a broader hyperparameter search could further improve the performance of DenseNet on ImageNet.

intensive reading

The publicly available Torch implementation of ResNet is adopted, thus eliminating all other factors such as differences in data preprocessing and optimization settings. We only replace the ResNet model with the DenseNet-BC network and make all experimental settings exactly the same as those used for ResNet.

As can be seen from the above line chart, when the two networks have the same error rate, DenseNet-BC has lower parameters and computational complexity than Resnet.​ 


5. Discussion—discussion

translate

On the surface, DenseNets are very similar to ResNets: ResNets: Equation (2) differs from equation (1) only in that the inputs of H_{l}\left ( \cdot \right ) are concatenated instead of summed. However, the implications of this seemingly small modification cause the two network architectures to behave substantially differently.

Model compactness A direct consequence of the input cascade is that all subsequent layers have access to the feature maps learned by any of the DenseNet layers. This encourages the reuse of functionality throughout the network and leads to more compact models. The left two panels of Figure 4 show the results of an experiment designed to compare the parameter efficiency of all variants of DenseNets (left) and a comparable ResNet architecture (middle). We train multiple small networks with varying depths on C10+ and plot their test accuracy as a function of network parameters. Compared to other popular network architectures such as AlexNet [16] or VGG-net [29], ResNet with pre-activation uses fewer parameters and often achieves better results [12]. Therefore, we compare DenseNet (k = 12) with this architecture. The training settings of DenseNet are the same as in the previous section.

Implicit deep supervision One explanation for the improved accuracy of dense convolutional networks could be that individual layers are subject to additional supervision by a loss function through shorter connections. DenseNets can be interpreted to perform a kind of "deep supervision". The benefits of deep supervision have been previously demonstrated in deeply supervised networks (DSNs; [20]), which have classifiers in each hidden layer, forcing intermediate layers to learn discriminative features.

DenseNets perform similar deep monitoring in an implicit manner: a single classifier at the top of the network provides direct monitoring of all layers through up to two or three transition layers. However, the loss function and gradient of DenseNets are actually less complex since the same loss function is shared between all layers.

Random vs Deterministic Connections There is an interesting connection between dense convolutional networks and stochastic deep regularization of residual networks [13]. At random depth, layers in the residual network are randomly dropped, creating direct connections between surrounding layers. Since pooling layers are never dropped, the network produces a similar connection pattern to DenseNet: if all intermediate layers are randomly dropped, there is a small chance that any two layers between the same pooling layer will be directly connected . Although these methods will ultimately be completely different, DenseNet's interpretation of stochastic depth may provide insights into the success of this regularization tool.

Feature reuse By design, DenseNets allow each layer to access the feature maps of all its previous layers (albeit sometimes through transition layers). We conducted an experiment to investigate whether trained networks exploit this opportunity. We first train DenseNet on C10+ with L = 40 and k = 12. For each convolutional layer L within a block, we calculate the average (absolute) weight assigned to the connections of each layer. Figure 5 shows the heatmap for all three dense blocks. The average absolute weight is used as a proxy for the dependence between a convolutional layer and its previous layers. The red dot in position (L, S) indicates that, on average, layer L fully utilizes the previously generated feature map of layer S. Several observations can be made from the figure:

  1. All layers distribute weights over many inputs within the same block. This shows that features extracted by very early layers are indeed used directly by deeper layers in the same dense block.
  2. The weight of the transition layer also distributes its weight across all layers within the previous dense block, indicating that information flows from the first layer to the last layer of DenseNet with a small amount of indirection.
  3. The layers within the second and third dense blocks always assign the smallest weight to the output of the transition layer (top row of triangles), indicating that the transition layer outputs many redundant features (lower average weight). This is consistent with the excellent results of DenseNet-BC, where these outputs are compressed.
  4. Although the final classification layer shown on the far right also uses the weights of the entire dense block, there seems to be a high degree of attention paid to the final feature map, suggesting that some more advanced features may be generated later in the network

intensive reading

Model compactness—Model compactness

A direct consequence of the input connections is that the feature map learned by any DenseNet layer can be accessed by all subsequent layers. This encourages feature reuse across the network and leads to more compact models.

The left two panels of Figure 4 show the results of an experiment designed to compare the parameter efficiency of all variants of DenseNets (left) and a comparable ResNet architecture (middle).

As can be seen from the figure, in order to achieve the same level of accuracy, DenseNet-BC only requires about 1/3 of the parameters of ResNets (middle figure).

The right panel in Figure 4 shows that DenseNet-BC with only 0.8M trainable parameters is able to achieve comparable accuracy to a 1001-layer (pre-activated) ResNet with 10.2M parameters.

Implicit Deep Supervision—implicit deep supervision

Dense connections allow each layer to reach the final output and have fast channels (there will be a small number of transformation layers on the channels). Each layer can obtain supervision information from the loss function (understood as a kind of "deep supervision"), forcing the intermediate The layer also learns to judge features.

Stochastic vs. deterministic connection—random vs. deterministic connection

Random depth regularization of residual network:In depth randomization, the residual network randomly discards some layers, allowing the layers before and after the layer to be directly connected, and the pooling layer is retained.

DenseNet:Dense connections within the module enable direct channels between every two layers. Both methods have a regularizing effect

Feature Reuse—Feature reuse

Principle:DenseNets allows each layer to obtain the feature map of all previous layers (although sometimes through the transformation layer)

Design experiment:Conduct an experiment to determine whether the trained network is effective in feature reuse: DenseNet with L=40 and k=12 was trained on C10+ data. For each convolution l inside the block, we calculate the average weight connected to its s layer. The heat map of the three dense blocks is as follows:

 in conclusion:

  1. In the same block, all layers pass their weights to the previous layer as input. This shows that: inside the block, each shallow layer feature will be successfully utilized by high-level convolution to varying degrees;
  2. The weights of the transformation layer can also be passed to all layers of the previous dense block. This shows that: information flows from the first layer to the last layer of DenseNet through very few indirections;
  3. All layers in the second and third dense blocks assign the least weight to the output of the conversion layer, which shows that: the conversion layer outputs many redundant features;
  4. The third classifier also uses the weights through the entire dense block, but pays more attention to the last feature map, which shows that some high-level features are also produced at the end of the network.

6. Conclusion—Summary

translate

We propose a new convolutional network architecture, which we call dense convolutional network (DenseNet). It introduces a direct connection between any two layers with the same feature map size. We show that DenseNets naturally scale to hundreds of layers without any optimization difficulties. In our experiments, DenseNets tended to continuously improve accuracy as the number of parameters increased without showing signs of performance degradation or overfitting. Across multiple environments, it achieves state-of-the-art results on multiple highly competitive datasets. Additionally, DenseNets require fewer parameters and less computation to achieve state-of-the-art performance. Since we adopted hyperparameter settings optimized for residual networks in our study, we believe that the accuracy of DenseNets can be further improved by tuning the hyperparameters and learning rate schedules in more detail.

While following simple connection rules, DenseNets naturally integrate the properties of identity mapping, deep supervision, and diverse depth. They allow features to be reused throughout the network, so more compact models can be learned and, according to our experiments, more accurate models. Due to their compact internal representation and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks based on convolutional features (e.g. [4, 5]). We plan to study this feature transfer using DenseNets in future work.

intensive reading

This article proposes a new convolutional neural network structure named DenseNet. It is proposed to directly connect any two layers of the same feature map size in the network, achieving the best results on multiple data sets.

DenseNet mainly has the following advantages:

  1. Parameter saving: fewer parameters are needed to achieve the same accuracy;
  2. Save calculation: fewer parameters, higher calculation efficiency;
  3. Anti-overfitting: Due to the large reusability of parameters, it has certain anti-overfitting performance;
  4. Due to the existence of dense connections, the vanishing gradient problem is slowed down and the spread of information is enhanced.​ 
  5. Disadvantage: consumes a lot of video memory!

Ten questions about the paper

Q1: What problem does the paper try to solve?

At this stage, there are a large number of network parameters and the utilization rate of the network structure is not high. This article proposes a new convolutional neural network structure named DenseNet. It is proposed to directly connect any two layers of the same feature map size in the network, achieving the best results on multiple data sets.

Q2: Is this a new question?

No, this is further optimization based on ResNet

Q3: What scientific hypothesis does this article want to test?

It establishes a dense connection between all previous layers and subsequent layers.

Feature reuse is achieved through the connection of features on channels. These features allow DenseNet to achieve better performance than ResNet with fewer parameters and computational costs.

Q4: What relevant research is there? How to classify? Who are the noteworthy researchers in the field on this topic?

ResNet

Q5: What is the key to the solution mentioned in the paper?

Connect each layer to each layer in a feed-forward manner

Feature reuse

Q6: How were the experiments in the paper designed?

Compare with the original network structure on multiple data sets and compare the classification results

Q7: What is the dataset used for quantitative evaluation? Is the code open source?

CIFAR10,100

SVHN

ImageNet

There is open source

Q8: Do the experiments and results in the paper well support the scientific hypothesis that needs to be verified?

It has been proved that the classification results on the three data sets have achieved better results than before.

Q9: What contribution does this paper make?

Rather than extracting representational power from extremely deep or wide architectures, DenseNet exploits the potential of the network through feature reuse, introducing a direct connection between any two layers with the same feature map size, yielding parameters that are easy to train and efficient. Compressed model. Concatenating feature-maps learned at different layers increases the input changes of subsequent layers and improves efficiency.

Q10: What’s next? Is there any work that can be further developed?

(1) The accuracy of DenseNets can be further improved by tuning the hyperparameters and learning rate schedule in more detail.

(2) Memory is consumed during training. Because the later features need to use the previous features, the previous features will always be stored in the memory, so the memory occupied is very large.

(3) Due to their compact internal representation and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks based on convolutional features. We plan to study this feature transfer using DenseNets in future work.


Code reproduction:DenseNet code reproduction + super detailed comments (PyTorch)

Next issue preview: SeNet

Guess you like

Origin blog.csdn.net/weixin_43334693/article/details/128478420