Super detailed interpretation of classic neural network papers (5) - ResNet (residual network) study notes (translation + intensive reading + code reproduction)

Preface

The paper "Deep Residual Learning for Image Recognition" was written by He Yuming and other big guys. It is quite classic in the field of deep learning and won the best paper in 2016CVPR. Let us learn together today!

Original text:https://arxiv.org/abs/1512.03385


Past review:

Super detailed interpretation of classic neural network papers (1) - AlexNet study notes (translation + intensive reading)
Super detailed interpretation of classic neural network papers (2) - VGGNet study notes ( Translation + intensive reading)
Super detailed interpretation of classic neural network papers (3) - GoogLeNet InceptionV1 study notes (translation + intensive reading + code reproduction)
Classic neural network Super detailed interpretation of the paper (4) - InceptionV2-V3 study notes (translation + intensive reading + code reproduction)


Preface

Table of contents

Abstract—Abstract

1. Introduction—Introduction

2. Related Work—related work 

2.1Residual Representations—residual expression

2.2Shortcut Connections—Short-circuit connections

3. Deep Residual Learning—Deep Residual Learning

         3.1. Residual Learning—Residual Learning

3.2. Identity Mapping by Shortcuts—Identity mapping through short-circuit connections

3.3. Network Architectures—Network architecture

3.4. Implementation—implementation

4. Experiments—Experiments

4.1. ImageNet Classification—ImageNet分类

4.2. CIFAR-10 and Analysis—CIFAR-10 and Analysis

4.3. Object Detection on PASCAL and MS COCO—Object detection on PASCAL and MS COCO

Ten questions about the paper


Abstract—Abstract

translate

Deeper neural networks are often more difficult to train. We propose a residual learning framework here to reduce the training burden of the network. This is a much deeper network than previous networks. We explicitly take the layer as input to learn the residual function, rather than learning an unknown function. We provide very comprehensive experimental data to prove that residual networks are easier to optimize and can increase accuracy as depth increases. On the ImageNet data set, we evaluated a residual network with a depth of 152 layers (8 times that of VGG), but it still has lower complexity than VGG. The residual network achieved an overall error rate of 3.57%. This result won the first place in the classification task of ILSVRC2015. We also used the CIFAR-10 data set to analyze 100-layer and 1000-layer networks.

In some computer vision direction recognition tasks, depth representation is often the focus. Our extremely deep network allowed us to obtain a relative improvement of 28% (on the COCO object detection dataset). We made a submitted version based on the deep residual network to participate in the ILSVRC and COCO2015 competitions. We also won the first place in ImageNet object detection, Imagenet object localization, COCO object detection and COCO image segmentation.


Intensive reading 

main content

Background:The deeper the neural network, the more difficult it is to train

Contribution of this article:This article demonstrates a residual learning framework that simplifies the training of very deep networks. The framework can learn the residual function by taking layers as input, Instead of learning unknown functions.

Results:This paper provides comprehensive evidence that these residual networks are easier to optimize and can increase accuracy as depth increases.

Results:Won the first place in the ILSVRC classification task in 2015, and later won the first place in ImageNet detection, ImageNet positioning, COCO detection and COCO segmentation. A grade of one.


1. Introduction—Introduction

translate

Deep convolutional neural networks have achieved a series of breakthroughs in the field of image classification. Deep networks well integrate low/medium/high-level features and classifiers in an end-to-end multi-layer model. The level of features can be enriched by the number (depth) of stacked layers. Recent results show that the depth of the model plays a crucial role, which has led to the fact that the participating models in the ImageNet competition tend to be "very deep" - 16 layers to 30 layers. Many other visual recognition tasks benefit from very deep models.

Driven by the importance of depth, a new question arises: is training a better network as simple as stacking more layers? The obstacle to solving this problem is the vanishing/exploding gradient that has troubled people for a long time, which hinders the convergence of the model from the beginning. Normalized initialization and intermediate normalization solve this problem to a large extent, which enables dozens of layers of networks to converge on backpropagation stochastic gradient descent (SGD).

When deep networks are able to converge, a degradation problem arises: as the depth of the network increases, the accuracy reaches saturation (not surprisingly) and then degrades rapidly. Surprisingly, this degradation is not caused by overfitting, and adding more layers to a reasonably deep model leads to higher error rates, as our experiments demonstrate.

The occurrence of degradation (training accuracy) shows that not all systems are easily optimizable. Let's compare a shallow framework to its deep version. For deeper models, there is a solution by building: identity mapping to build additional layers, and other layers are copied directly from the shallower model. This constructed solution also shows that a deeper model should not produce a higher training error rate than its shallower version. Experimentation shows that we are currently unable to find a solution that is equal to or better than this constructed solution (or cannot be implemented in a feasible time).

In this paper, we propose a deep residual learning framework to solve this degradation problem. We explicitly let these layers fit the residual mapping, rather than having each stacked layer directly fit the desired underlying mapping. Assuming that the desired underlying mapping is H(x)H(x), we let stacked nonlinear layers fit another mapping: F(x):=H(x)−xF(x):=H(x )−x. So the original mapping is transformed into: F(x)+xF(x)+x. We infer that the residual mapping is easier to optimize than the original unreferenced mapping. In the extreme case, if a certain identity map is optimal, it is simpler to change the residual to 0 than to fit the identity map with a stack of nonlinear layers.

The formula F(x)+xF(x)+x can be realized through the "shortcut connection" of the feedforward neural network (Fig.2). Shortcut connection skips one or more layers. In our case, the shortcut connections simply perform identity mapping and add their outputs to the outputs of the stacked layers (Fig. 2). Identity shortcut connection does not add additional parameters and computational complexity. The complete network can still be trained by end-to-end SGD backpropagation, and can be easily implemented through public libraries (e.g., Caffe) without modifying the solvers.

We conduct comprehensive experiments on the ImageNet dataset to demonstrate this degradation problem and evaluate our proposed method. This paper shows that: 1) Our extremely deep residual network is easy to optimize, but the corresponding "plain" network (just stacked layers) suffers from higher error rates as the depth increases. 2) Our deep residual network can easily improve accuracy by adding layers, and the results are much better than previous networks.

A similar phenomenon also occurred on the CIFAR-10 data set, which shows that the optimization difficulty and effect of our proposed method are not just for a specific data set. We successfully proposed a training model with more than 100 layers on this dataset and explored models with more than 1000 layers.

On the ImageNet classification data set, extremely deep residual networks achieved excellent results. Our 152-layer residual network is currently the deepest network in ImageNet, and is even less complex than the VGG network. On the ImageNet test set, our ensemble achieved a top-5 error rate of only 3.57% and won first place in the ILSVRC 2015 classification competition. This extremely deep model also has very good generalization performance on other recognition tasks, which allowed us to win first place in ImageNet detection, ImageNet localization, COCO detection and COCO segmentation in the ILSVRC & COCO 2015 competition. score. This is a strong demonstration of the generality of the residual learning method, so we will apply it to other vision and even non-vision problems.

intensive reading

background

The depth of the model plays a crucial role, which leads to the fact that the participating models in the ImageNet competition tend to be "very deep" - 16 to 30 layers.

Problem 1: When the depth of the model is too large, there will be problems of gradient disappearance/explosion.

Gradient disappearance/gradient explosion concept:Both problems are caused by the network being too deep and the network weight update being unstable. Essentially it is because of the continuous multiplication effect in gradient backpropagation (less than 1 is multiplied multiple times in succession). When the gradient disappears, the parameter w closer to the input layer becomes almost motionless; when the gradient explodes, the parameter w closer to the input layer jumps up and down.

Solution:Normalized initialization and intermediate normalization + BN to speed up network convergence.

Problem 2: As the depth of the network increases, the accuracy reaches saturation and then degrades rapidly.

Concept of network degradation:As the number of layers of a neural network deepens, first the training accuracy will gradually become saturated; if the number of layers continues to deepen, the training accuracy will decrease and the effect will be ineffective. Well, this drop is not caused by overfitting (because if it were overfitting, the error should be low during training and high during testing).

Q: Why does network degradation occur?

Due to the existence of the nonlinear activation function Relu, the process from input to output each time is almost irreversible, which also causes a lot of irreversible information loss. If some useful information about a feature is lost, the results obtained will definitely be unsatisfactory. To put it simply, the middlemen make the difference. As the number of layers increases, information is lost in the middle layers.

 Solution:Deep residual learning

 (The specific method will be explained in Chapter 3.1)

result:

(1) The structure of the residual network is more conducive to optimization convergence

(2) Solve the degradation problem

(3) The residual network can improve network performance while expanding the network depth.


2. Related Work—related work 

2.1Residual Representations—residual expression

translate

Residual expression
In image recognition, VLAD is an expression form in which the residual vector is encoded corresponding to the dictionary. Fisher Vector can be regarded as a probabilistic version of VLAD. They are powerful shallow representations for image retrieval and classification. For vector quantization, residual vector encoding is more efficient than raw vector encoding.

In low-level vision and computer graphics, in order to solve partial differential equations (PDEs), the Multigrid method is usually used to re-express the system into multi-scale sub-problems to solve. Each sub-problem is to solve the problem between coarse and fine scales. residual problem. Another approach to Multigrid is hierarchical basis preprocessing, which relies on variables representing the residual vector between two scales. Experiments show that these solvers converge much faster than other standard solvers, without realizing that this is due to the residual nature of the method. These methods show that a good reformulation or preprocessing can simplify the optimization problem.

intensive reading

main content

(1) For vector quantization, residual vector coding is more efficient than original vector coding.

(2) The residual characteristics of Multigrid make the solver converge much faster than other standard solvers, indicating that a good reformulation or preprocessing can simplify the optimization problem.


2.2Shortcut Connections—Short-circuit connections

translate 

Shortcut connection
Shortcut connection has gone through a long process of practice and theoretical research. An early practice in training multilayer perceptrons (MLPs) was to add a linear layer connecting the input and output. In Szegedy2015Going and Lee2015deeply, some intermediate layers are directly combined with auxiliary classification The problem of vanishing/exploding gradients can be solved by connecting vectors. In Szegedy2015Going, an "inception" layer is composed of a shortcut branch and some deeper branches.

Meanwhile, “highway networks” combine shortcut connections with gating functions. These gates are data-dependent and have additional parameters, while our identity shortcuts are parameterless. When a gate's shortcut is "closed" (close to 0), the layers in the highway network represent non-residual functions. In contrast, our model always learns residual functions; our identity shortcuts are never closed, and all information is always passed through when learning additional residual functions. Furthermore, highway networks cannot improve accuracy by increasing the depth of layers (e.g., beyond 100 layers).

intensive reading

main content

(1) Shortcut connection has gone through a long process of practical and theoretical research and has been proven to be effective.

(2) Comparison with highway networks (gating function): When the shortcut of a gate is "closed" (close to 0), the layer in highway networks represents a non-residual function. In contrast, our model always learns residual functions; our identity shortcuts are never closed, are parameterless, and all information is always passed through when learning additional residual functions. Furthermore, highway networks cannot improve accuracy by increasing the depth of layers (e.g., beyond 100 layers).


3. Deep Residual Learning—Deep Residual Learning 

3.1. Residual Learning—Residual Learning

translate

We think of H(x) as an underlying map fitted by some stacked layers (not necessarily all of the network), where x is the input to these layers. Assuming that multiple nonlinear layers can approximate complex functions, this is equivalent to the fact that these layers can approximate complex residual functions, for example, H(x)−x (assuming that the input and output dimensions are the same). So we explicitly ask these layers to estimate a residual function: F(x)=H(x)−x instead of H(x). So the original function becomes: F(x)+x. Although both forms should be able to approximate the desired function (as hypothesized), they are not equally easy to learn.

This reformulation is motivated by the anomalous phenomenon of the degradation problem (Fig. 1, left). As we discussed in the Introduction, if the additional layers can be constructed with identity mapping, the training error rate of a deeper model should not be greater than that of its corresponding shallow model. Degeneracy problems indicate that solvers may have difficulty estimating identity maps through multiple nonlinear layers. With the reformulation of residual learning, if the identity mapping is optimal, then the solver drives the weights of multiple nonlinear layers toward zero to approximate the identity mapping.

In real situations, identity mapping is unlikely to be optimal, but our reformulation is helpful as a preconditioning of this problem. If the optimal function is closer to the identity map than to the zero map, then it is much easier for the solver to find perturbations on the identity map than to learn a new function. Experiments (Fig.7) show that the learned residual function usually has only a small response, indicating that the identity mapping provides reasonable preprocessing.

intensive reading

ResNet purpose

We choose to deepen the number of layers in the network because we hope that the performance of the deep network will be better than that of the shallow network, or that we hope that its performance will be at least the same as that of the shallow network (equivalent to directly copying the characteristics of the shallow network)

previous method

In a normal network, the input that should be passed to the next layer of network is H(x)=F(x), that is, directly fitting H(x)

Improvements to this article

In ResNet, the input passed to the next layer becomes H(x)=F(x)+x, that is, the fitting residual F(x)=H(x)-x

Residual module:One path remains unchanged (identity mapping); the other path is responsible for fitting the residual relative to the original network to correct the deviation of the original network, and Instead of letting the entire network fit all the underlying mappings, the network only needs to correct the deviation.

Nature

(1) After adding the residual structure, the input x is given more choices. If the neural network learns that the parameters of this layer are redundant, it can choose to directly follow this "shortcut connection" curve and skip this redundant layer without having to fit the parameters to make H( x)=F(x)=x

(2) After adding identity mapping, the deep network will at least not be worse than the shallow network.

(3) In Resnet, you only need to change F(x) to 0, and the output becomes F(x)+x=0+x=x. Obviously, optimizing the output of the network to 0 is better than optimizing it. An identity transformation is much easier.

Q: Why is it valid when F(x) is 0 in H(x)=F(x)+x?

During the training process of the model, F(x) is trained. If F(x) has no effect on improving the training accuracy of the model, the natural gradient descent algorithm will adjust the parameters of this part so that the effect of this part approaches 0. In this way, the entire model will not have a situation where the deeper the depth, the worse the effect.


3.2. Identity Mapping by Shortcuts—Identity mapping through short-circuit connections 

translate

We adopt the residual learning algorithm on the stacked layer. A building block is shown in Fig.2. The building blocks in this article are defined as follows (Eq.1): y=F(x,{Wi})+x.
where x and y represent the input and output of the layer respectively. The function F(x,{Wi}) represents the learned residual mapping. The example in Fig.2 contains two layers, F=W2σ(W1x), where σ represents ReLU, and the bias term is omitted for simplicity. The F+x operation is represented by a shortcut connection and element-wise addition. After the addition we perform another nonlinear operation (for example, σ(y), as shown in Fig.2.

The shortcut connection in Eq.1 does not add additional parameters and computational complexity. Not only is this an attractive approach, it is also very important when comparing “plain” networks with residual networks. We can make a fair comparison between the two networks based on the same parameters, depth, width, and computational cost (except for negligible element-level addition).

In Eq.1, the dimensions of x and F must be the same. If they are not the same (for example, when changing the input/output channels), we can perform a linear mapping Ws through shortcut connection to match the dimensions of the two (Eq.2): y=F(x,{Wi})+Wsx .

The square matrix Ws can also be used in Eq.1. But our experiments show that identity mapping is sufficient to solve the degradation problem and is economical, so Ws is only used to solve the dimension mismatch problem.

The shape of the residual function F is flexible. The function FF involved in the experiment of this article has two or three layers (Fig.5). Of course, more layers are also feasible. But if F contains only one layer, Eq.1 is consistent with the linear function: y=W1x+x, so it does not have any advantages.

We also found that this is not only applicable to fully connected layers, but also to convolutional layers. The function F(x,{Wi}) can represent multiple convolutional layers, performing element-level addition between the channels of two feature maps.

intensive reading

Two ways of Shortcuts Connection:

(1) Shortcuts have the same dimension mapping. The addition of F(x) and x means element-by-element addition.

  • y=F(x,Wi)+x
  • F=W2σ(W1x)

where x and y represent the input and output of the layer respectively. The function F(x,Wi) represents the learned residual mapping, and σ represents ReLU

This method directly passes the input x through shortcuts, does not introduce additional parameters and does not increase the computational complexity of the module, so the residual network and the plain network can be compared fairly.

(2) If the dimensions of the two are different (the input/output channels are changed), a linear mapping needs to be performed on x to match the dimensions.

  • y=F(x,Wi)+Wsx.
  • F=W2σ(W1x)

The purpose of this method is only to keep the dimensions between x and F(x) consistent, so it is usually only used when the number of channels changes between adjacent residual blocks. In most cases, only the first method is used.

Use convolutional layers for residual learning: For simplicity, the above formulas are based on fully connected layers. In fact, they can of course be used for convolutional layers. The addition then becomes the element-by-element addition of the two feature maps between the corresponding channels.


3.3. Network Architectures—Network Architecture 

translate

We tested on multiple plain networks and residual networks, and observed consistent phenomena. Next we will discuss the two models on ImageNet.

Plain Network
Our plain network structure (Fig.3, middle) is mainly inspired by the VGG network (Fig.3, left).
The convolutional layer is mainly a 3*3 filter and follows the following two requirements: (i) layers with the same output feature size contain the same number of filters; (ii) if the feature size Halved, the number of filters is doubled to ensure the same time complexity of each layer. We directly downsample through the convolutional layer with stride of 2. At the end of the network there is a global average pooling layer and a 1000-class fully connected layer containing softmax. The number of weighted layers is 34, as shown in Fig. 3 (middle).

Notably, our model has fewer filters and lower computational complexity than the VGG network (Fig. 3, left). Our 34-layer structure contains 3.6 billion FLOPs (multiply-add), which is only 18% of VGG-19 (19.6 billion FLOPs).

Residual network
Based on the above plain network, we insert shortcut connections (Fig.3, right) to turn the network into the corresponding residual version. If the input and output dimensions are the same, identity shortcuts (Eq.1) can be used directly (the solid line in Fig.3). When the dimension increases (dotted line in Fig.3), consider two options:
(A) shortcut still uses identity mapping and pads with 0s in the increased dimension. This way No additional parameters will be added;
(B) Use the mapping shortcut of Eq.2 to keep the dimensions consistent (through 1*1 convolution).
For both options, when the shortcut spans feature maps of two sizes, a convolution with stride of 2 is used.
Fig.3 Example of network framework corresponding to ImageNet. Left: VGG-19 model (19.6 billion FLOPs) as reference. Center: plain network, containing 34 parameter layers (3.6 billion FLOPs). Right: Residual network with 34 parameter layers (3.6 billion FLOPs). The shortcuts indicated by dashed lines add dimension. Table 1 shows more details and other variants.

Table 1

Table 1 corresponds to the structural framework of ImageNet. The parameters of the building blocks are in parentheses (also see Fig.5), and several building blocks are stacked. Downsampling is implemented by conv3_1, conv4_1 and conv5_1 with stride of 2.

intensive reading

The following will take the ImageNet data set as an example to compare and discuss the plain network and the residual network.

Plain network

The plain network structure is mainly inspired by the VGG network. The convolutional layer is mainly a 3*3 convolution kernel, which is directly downsampled through the convolutional layer with a stride of 2. At the end of the network there is a global average pooling layer and a 1000-class fully connected layer containing softmax. The number of weighted layers is 34.

Two design principles:

(i) Feature maps of the same output size have the same number of convolution kernels;

(ii) If the size of the feature map is halved, in order to ensure the same time complexity, the number of convolution kernels is doubled.

Compare with VGG:

Our model has fewer convolution kernels and lower computational complexity than VGG. Our 34-layer structure contains 3.6 billion FLOPs (multiply-add), which is only 18% of VGG-19 (19.6 billion FLOPs).

residual network

On the basis of the plain network, adding shortcuts connection becomes the corresponding residual network.

As shown in the figure above, the solid line represents the same dimension, which can be added directly. The dotted lines represent different dimensions (downsampling, convolution with a stride of 2 appears), and the residual network is used

There are two ways to adjust dimensions:

(1) Zero-padding: Zero-padding the extra channel padding. This method does not introduce additional parameters;

(2) Linear projection transformation: Using 1*1 convolution to increase the dimension is a parameter that needs to be learned. The accuracy is better than zero-padding, but it takes longer and takes up more memory.

Both methods use convolution with stride 2.


3.4. Implementation—implementation

translate

The network implementation for ImageNet follows Krizhevsky2012ImageNet and Simonyan2014Very. Resize the image so that its short side length is randomly sampled from [256,480] to increase the size of the image. Randomly sample a 224*224 crop from an image or its horizontally flipped image, subtracting the mean from each pixel. The image uses standard color enhancement. We use batch normalization (BN) after each convolutional layer and before the activation layer. We initialize the weights according to He2014spatial and then train all plain/residual networks from scratch.
The size of the mini-batch we use is 256. The learning rate starts from 0.1, and the learning rate is divided by 10 whenever the error rate plateaus, and the entire model is trained for 60*104 iterations. We set the weight decay to 0.0001 and the a momentum to 0.9. According to Ioffe2015Batch, we do not use Dropout.

In testing, for comparison, we take the standard 10-crop test.
To achieve the best results, we use Simonyan2014Very and He2014spatial Fully convolutional form in a>, and average the results across multiple scales (resize the image so that its short side lengths are {224, 256, 384, 480, 640} respectively).

intensive reading

method

(1) The images are randomly compressed to between 256 and 480, and then image enhancement is performed.

(2) Output processing process: Use 224 * 224 to randomly cut out a small picture, do horizontal mirroring for image enhancement (different scale dimensions), and summarize 10 small pictures into one large picture (multi-scale cropping and results can be used fusion).

(3) BN is used after each convolutional layer or before the activation layer.

Parameters: The mini-batch is 256, the learning rate is 0.1, the number of training iterations is 600,000, the regularization is 0.0001, and the momentum is 0.9. Dropout is not used (BN and dropout cannot be mixed, and the effect is better when used alone, reason: variance shift)


4. Experiments—Experiments

4.1. ImageNet Classification—ImageNet分类

This paper evaluates our method on the 1000-class ImageNet2012 dataset. The training set contains 1.28 million images, and the validation set contains 50,000 images. We test on 100,000 test images and evaluate the top-1 and top-5 error rates.

Plain network

translate

We first evaluated 18-layer and 34-layer plain networks. The 34-layer network is shown in Fig. 3 (middle). The structure of the 18th floor is very similar, see Table 1 for details.

The results shown in Table 2 show that the 34-layer network has a higher verification error rate than the 18-layer network. In order to reveal the reasons for this phenomenon, in Fig. 4 (left) we compare the training and validation error rates during the entire training process. From the results, we observed an obvious degradation problem—the 34-layer network had a higher training error rate throughout the training process, even though the solution space of the 18-layer network was a subspace of the 34-layer solution space.

We believe that this optimization difficulty is unlikely to be caused by vanishing gradients. Because the training of these plain networks uses BN, this can ensure that the forward signal has non-zero variance. We also verified that the gradient in the backward pass phase has a good paradigm due to BN, so the signals in the forward and backward phases will not disappear. In fact, the 34-layer plain network still has good accuracy (Table 3), which shows that the solver is also effective to some extent. We speculate that the convergence rate of deep plain networks decays exponentially, which may affect the reduction of training error rates. The reasons for this optimization difficulty will be studied in future work.

intensive reading

The first experiment was conducted on 18-layer and 34-layer plain networks. The experimental results are shown in the table below, which resulted in a degradation phenomenon: During the training process, the 34-layer network It has a higher training error rate than the 18-layer network.

(Thin line: error on the training set; thick line: error on the test set)


 Residual network

translate

Next we evaluate 18-layer and 34-layer residual networks ResNets. As shown in Fig.3 (right), the basic framework of ResNets is basically the same as that of the plain network, except that a shortcut connection is added to each pair of 3*3 filters. In the comparison between Table 2 and Fig. 4 (right), all shortcuts are identity maps and padded with zeros for the added dimension (option A). Therefore they did not add additional parameters.

We observe the following three points from Table 2 and Fig.4:

First, contrary to the plain network, the 34-layer ResNet has better results (2.8%) than the 18-layer ResNet. More importantly, the 34-layer ResNet showed lower error rates on both the training set and the validation set. This shows that this setup solves the degradation problem well and that we can improve accuracy with increased depth.

Second, compared with the corresponding plain network, the 34-layer ResNet reduces the top-1 error rate by 3.5% (Table 2), which benefits from the reduction in training error rate (Fig. 4 right vs left). This also verifies the effectiveness of residual learning in extremely deep networks.

Finally, we also noticed that the accuracy of the 18-layer plain network and the residual network are very close (Table 2), but the convergence speed of ResNet is much faster. (Fig.4 Right vs Left).
If the network is "not particularly deep" (such as 18 layers), the existing SGD can solve the plain network very well, and ResNet can make the optimization converge faster.

intensive reading

Then the 18-layer and 34-layer residual networks are evaluated. In order to ensure the consistency of variables, the basic framework structure is the same as that of the plain network, except that shortcuts connections are added to each pair of convolutional layers to implement the residual structure. , for the case where the dimensions do not match, use 0 to fill the dimensions (ie method 1 introduced in 3.3), so no additional parameters are added. The training results are shown in the figure below

[Table 2 Top-1 error rate on ImageNet validation set (%, 10-crop testing)]

in conclusion

(1) Contrary to the plain network, the 34-layer resnet network has a lower error rate than the 18-layer one. shows that the accuracy can be improved by increasing the depth, solving the degradation problem.

(2) Compared with the plain network, the error rate on the resnet network at the same level is lower, indicating that the residual network is still effective at deep levels.

(3) For the 18-layer plain network, its accuracy is very close to that of the residual network, but the convergence speed of the residual network is faster.


Identity vs. Projection Shortcuts—Identity vs. Projection Shortcuts

translate

We have verified that parameterless identity shortcuts are helpful for training. Next we study mapping shortcut (Eq.2). In Table 3, we compare three options:
(A) uses 0 padding for increased dimensions and all shortcuts are parameterless (similar to Table 2 and Fig. 4 (right) Same);
(B) Use mapping shortcuts for increased dimensions, and use identity shortcuts for others;
(C) All are mapping shortcuts.

Table 3 shows that the models of the three options are better than the plain model. B is slightly better than A, we think this is because the 0 padding in A does not perform residual learning. C is slightly better than B, which we attribute to more (13) parameters introduced by mapping shortcuts. The small gap in the results of A, B, and C also shows that mapping shortcuts is not necessary to solve the degradation problem. Therefore, in the remainder of this article, in order to reduce complexity and model size, we do not use the model of option C. Identity shortcuts are particularly important for the bottleneck structures introduced below because they have no additional complexity.

intensive reading

Identity shortcuts without parameters will definitely help improve the training effect. There are three methods to choose from for mapping shortcuts:

(1) ResNet - 34 A: All shortcuts use identity mapping, that is, the extra channels are filled with 0, and there are no additional parameters

(2) ResNet - 34 B:  Use convolution mapping shortcut for those that need to adjust the dimension. Use identity shortcut for those that do not need to adjust the dimension. Use it when upgrading the dimension. 1 * 1 convolution

(3) ResNet - 34 C: All shortcuts use 1 * 1 convolution (the best effect, but introduces more parameters, which is not economical)

The table below shows that the models of the three options are better than the plain model. The order of effect is C>B>A.

[Table 3 Error rate on ImageNet validation set (%, 10-crop testing)]

B is better than A because A uses padding to fill in zeros when increasing the dimension, losing shortcut learning and not performing residual learning.

C is better than B because the shortcuts of C's 13 non-subsampled residual modules all have parameters, and the model capability is relatively strong.

However, ABC is almost the same, which shows that the shortcut of identity mapping can solve the degradation problem.


Deeper Bottleneck Architectures—deep bottleneck structures

translate

Next we introduce deeper models. Considering the limitation of training time, we modify the building blocks into bottleneck designs. For each residual function F, we use three overlays instead of two (Fig.5). These three layers are 1*1, 3*3 and 1*1 convolutions respectively. The 1*1 layer is mainly responsible for reducing and then increasing (restoring) the dimensions, and the remaining 3*3 layers are used to reduce the input and output dimensions. . Fig.5 shows an example where both designs have similar time complexity.

Identity shortcuts of no parameters are particularly important for bottleneck structures. If you use mapping shortcuts instead of the identity shortcuts in Fig.5 (right), you will find that the time complexity and model size will double. Because the shortcut connects two high-dimensional ends, the identity shortcuts are more effective for bottleneck design. of.

50-layer ResNet: We replace the 2-layer module in the 34-layer network with the 3-layer bottleneck module, and the entire model becomes a 50-layer ResNet (Table 1 ). For added dimensions we use option B. The entire model contains 3.8 billion FLOPs.

101-layer and 152-layer ResNets: We use more 3-layer modules to build 101-layer and 152-layer ResNets (Table 1). It is worth noting that although the depth of the layers has increased significantly, the computational complexity of the 152-layer ResNet (11.3 billion FLOPs) is still smaller than that of VGG-16 (15.3 billion FLOPs) and VGG-19 (19.6 billion FLOPs) a lot of.

The accuracy of 50/101/152-layer ResNets is much higher than that of 34-layer ResNet (Table 3 and 4). And we did not observe degradation problems. All metrics confirm the benefits of depth. (Tables 3 and 4).

intensive reading

Next, a more layered model will be introduced. For each residual block, two layers of convolution are no longer used, but three layers of convolution are used to implement it, as shown in the figure below.

50-layer residual network: Replace the 2 convolutional layers of its 34-layer residual network with 3 bottleneck residual blocks, which becomes 50 Layer residual network, downsampling uses 1 * 1 convolution

 [Table 4 The error rate (%) of a single model on the ImageNet validation set (except ++ is the result on the validation set )

in conclusion

The accuracy of 50/101/152-layer resnet is much higher than that of 34-layer resnet, which solves the deep degradation problem. At the same time, even the computational complexity of 152-layer resnet is still smaller than VGG-16 and VGG-19.


Comparisons with State-of-the-art Methods—Comparisons with the best methods

translate

In Table 4 we compare the current best single-model results. Our 34-layer ResNets achieved very good results. The single-model top-5 verification error rate of the 152-layer ResNet was only 4.49%, even better than the results of the previous combined models (Table 5). We synthesized 6 ResNets of different depths into a combined model (only 2 152-layer models were used when submitting the results). The top-5 error rate on the test set is only 3.57% (Table 5), which won the first place in ILSVRC 2015.

intensive reading

Synthesize 6 ResNets of different depths into a combined model (only 2 152-layer models were used when submitting the results). The top-5 error rate on the test set is only 3.57% (Table 5), which won the first place in ILSVRC 2015.

[Table 5 Combined modeltop-5 error rate on the ImageNet test set]


4.2. CIFAR-10 and Analysis—CIFAR-10 and Analysis

translate

We conduct more studies on the 10-class CIFAR-10 dataset containing 50,000 training images and 10,000 test images. We train on the training set and validate on the test set. We focus on verifying the effect of extremely deep models rather than pursuing the best results, so we only use a simple framework as follows.

The framework of the Plain network and the residual network is shown in Fig.3 (middle/right). The input to the network is a 32*32 image minus the pixel mean. The first layer is a 3*3 convolutional layer. Then we use a stack of 6n 3*3 convolutional layers. There are three types of feature maps corresponding to the convolutional layer: {32, 16, 8}. The number of each convolutional layer is 2n, and the corresponding number of filters They are {16, 32, 64} respectively. Use a convolutional layer with strde of 2 for downsampling. At the end of the network is a global average pooling layer and a 10-class fully connected layer including softmax. There are a total of 6n+2 stacked weighted layers.

The attenuation of the weight is set to 0.0001, the momentum is 0.9, and the weight initialization and BN in He2015Delving are used, but Dropout is not used. The mini-batch size is 128, and the model is trained on 2 GPUs. The learning rate is initially 0.1 and is divided by 10 at iterations 32,000 and 48,000. The total number of iterations is 64,000, which is determined by the training/validation set distribution of 45,000/5,000. We follow the data augmentation rules in Lee2015deeply during the training phase: fill 4 pixels on each side of the image, and then fill the filled image or its Randomly sample a 32*32 crop on the horizontally flipped image. During the testing phase, we only use original 32*32 images for evaluation.

We compared n={3,5,7,9}, that is, 20, 32, 44 and 56-layer networks. Fig.6 (left) shows the results of the plain network. As the number of layers of a deep plain network increases, the training error rate also increases. This phenomenon is very similar to the results on ImageNet (Fig. 4, left) and MNIST, indicating that the difficulty of optimization is indeed an important issue.

Fig.6 (middle) shows the effect of ResNets. Similar to ImageNet (Fig.4, right), our ResNets can overcome optimization problems well, and as the depth deepens, the accuracy also improves.

We further explored n=18, which is 110 layers of ResNet. Here, we find that the initial learning rate of 0.1 is a bit too large to converge well. So we just started using a learning rate of 0.01. When the training error rate was below 80% (about 400 iterations), we adjusted the learning rate back to 0.1 and continued training. The rest of the study is the same as before. The 110-layer ResNets converge well (Fig.6, middle). It has fewer parameters than other deep narrow models such as FitNet and Highway (Table 6), yet achieves the best results (6.43%, Table 6).

intensive reading

CIFAR-10 data set:50w training set, 100w test set, 10 categories in total

Compare the practices of plain network and residual network

(1) The input image is 32*32 pixels, and the image at this time has been preprocessed (the mean value is subtracted from each pixel)

(2) The first convolution layer is 3*3, and the convolution layer using 6n is 3*3. The feature map is (3232/ 1616/ 8*8). There are a total of 6n+2 convolutional layers (the last layer is the pooling layer: 1 +2n, 2n, 2n, 1)

(3) The number of convolution kernels is 16/32/64 respectively, the number of feature maps is halved, and the number of channels is doubled.

Q: Why is the feature map size halved and the number of channels doubled after downsampling?

Because pooling will halve the length and width, the number of convolution kernels will double the corresponding channels (see "MobileNet" for details)

Downsampling uses convolution with a stride of 2, and finally adds a global pooling, a fully connected layer of 10 neurons and softmax.

(1) The residual is fitted by a 2-layer neural network (each shortcut is composed of 3 * 3 convolutions). There are 6n in total, so there are 3n shortcuts in total.

(2) Downsampling is supplemented by 0 (the amount of calculation for downsampling residuals and without residuals is the same)

(3) The regularization during the training process is 0.0001, the dynamic quantization is 0.9, the weights proposed in the paper are initialized, BN is used without dropout, the batch processing is 128, the initial learning rate is 0.1, at 3.2w and Divide by 10 at 4.8w iterations, and finally terminate training at 6.4w

(4) Divide the training set into 4.5w training and 5k verification, use the image enhancement method, add 4 pixels outside the image, and then use a 32*32 image for cropping (horizontal flipped image enhancement). When testing, just use 32 * 32 images for testing.

[Figure 6: CIFAR-10 training. Dashed lines represent training errors, bold fonts represent test errors. Left: plain. The error of plain-110 is greater than 60% and is not displayed. Middle: ResNets. Right: ResNets110 and 1202 layers. 】


Analysis of Layer Responses—Analyze the response distribution of each layer of the network 

translate

Fig.7 shows the standard deviation (std) of the layer response. The response is the output after the BN of each 3*3 convolutional layer and before the nonlinear layer (ReLU/addition). For ResNets, the results of this analysis also reveal the response strength of the residual function. Fig.7 shows that the response of ResNets is smaller than that of its plain network counterpart. These results also verify our basic motivation (Sec3.1), that is, the residual function is closer to 0 than the non-residual function. From the results of ResNet-20, 56 and 110 in Fig. 7, we also notice that the deeper the ResNet, the smaller the response amplitude. As more layers are used, a single layer in ResNets changes the signal less.

intensive reading

The residual network is the modified input. The standard deviation of the response is shown below:

[Figure 7: CIFAR-Standard deviation (std) of 10-layer response. The response is the output of 33 layers each after BN and before nonlinearity. Top: Layers shown in their original order. Bottom: Responses in descending order. 】

method

BN processing, the mean has been adjusted to 0. The standard deviation measures the dispersion of the data (the larger the standard deviation, the greater the response). The response is that each layer is a 3 * 3 convolutional layer, between after BN and before activation.

in conclusion

(1) The response of ResNets is smaller than the response of its corresponding plain network

(2) The residual function is closer to 0 than the non-residual function

(3) The deeper the ResNet, the smaller the response amplitude.

(4) The closer to the starting layer, the greater the output


Exploring Over 1000 layers—deep network

translate

We explored an extremely deep model with over 1000 layers. We set n=200, which is a 1202-layer network model, and train as above. Our method is not difficult to optimize for 103103-layer models and achieves a training error rate of <0.1% (Fig. 6, right). Its test error rate is also quite low (7.93%, Table 6).

But on such an extremely deep model, there are still many problems. The test results of the 1202-layer model are worse than those of the 110-layer model, although their training error rates are similar. We believe this is caused by overfitting. Such a 1202-layer model is too large for the small data set (19.4M). The best results were obtained by applying powerful regularization methods such as maxout or dropout on this data set.

In this article, we did not use maxout/dropout, but simply performed regularization by designing a deep narrow model, and we did not have to worry about the difficulty of optimization. However, experimental results may be improved through strong regularization, which we will study in the future.

intensive reading

Taking n equal to 200, that is, the residual convolution network of 1202 (6 * 200 + 2) is the same as the previous training method, and the error is less than 0.1, indicating that there is no degradation and no optimization difficulty.

However, the performance of the test set is not as good as that of 110 layers. The article shows that this is overfitting (the model is too deep and has too many parameters, which is not necessary for this small data set)

This paper does not use maxout or dropout for regularization because the core task is to solve the degradation problem.


4.3. Object Detection on PASCAL and MS COCO—Object detection on PASCAL and MS COCO

translate

Our method shows good generalization ability to other recognition tasks. Tables 7 and 8 show the target detection results on PASCAL VOC 2007 and 2012 and COCO. We use Faster R-CNN as the detection method. Here, we are more concerned about the improvements brought by replacing VGG-16 with ResNet-101. The implementation of detection using different networks is the same, so the detection results can only benefit from a better network. Most notably, on the COCO dataset, we achieve a 6.0% increase over previous results on COCO's standard metric (mAP@[.5, .95]), which is equivalent to a relative improvement of 28%. And this is entirely due to the expressions learned.

Based on the deep residual network, we won first place in ImageNet detection, ImageNet localization, COCO detection and COCO segmentation in the ILSVRC & COCO 2015 competition.

intensive reading

[Table 7 Target detection mAP (%) using Faster R-CNN on the PASCAL VOC 2007/2012 test set. See appendix for better results. 】

[Table 8 Object detection mAP (%) using Faster R-CNN on the COCO validation set. 】

Based on the deep residual network, we won first place in ImageNet detection, ImageNet localization, COCO detection and COCO segmentation in the ILSVRC & COCO 2015 competition.


Ten questions about the paper

Q1: What problem does the paper try to solve?

This paper mainly solves the training problem of deep neural network. As the depth of the network increases, the effect of the model becomes worse. The paper proposes a residual learning method to train deep neural networks.

Q2: Is this a new question?

It’s not a new problem, it’s an optimization problem

Q3: What scientific hypothesis does this article want to test?

Study the degradation problem in deep models. Accumulated nonlinear layers may make it difficult to learn linear mappings.

Q4: What relevant research is there? How to classify? Who are the noteworthy researchers in the field on this topic?

  • In order to solve partial differential equations (PDEs), the Multigrid method is usually used to reformulate the system into multi-scale sub-problems to solve. Math problems.
  • In the appendix, the author conducts research on the application of ResNet in target detection and target positioning.

Q5: What is the key to the solution mentioned in the paper?

ResNet actually passes x directly to subsequent layers through shortcut connections, so that the network can easily learn the identity transformation, thus solving the problem of network degradation and making learning more efficient.

Q6: How were the experiments in the paper designed?

1.ImageNet2012:

  • First, train the plain network and the residual network separately, and compare the errors of their different layers in the training set and test set, and whether they are degraded.
  • Then Identity vs Mapping Shortcuts
  • Then deepen the depth, train the improved residual network and evaluate the error rate
  • Finally, compare with excellent methods such as VGG and GoogLeNet.

2.CIFAR-10:

  • First, train the plain network and the residual network separately, and compare the errors of their different layers in the training set and test set, and whether they are degraded.
  • Then study deeper models

3.PASCAL and MS COCO:

Compare with VGG

Q7: What is the dataset used for quantitative evaluation? Is the code open source?

ImageNet2012、CIFAR-10、PASCAL VOC 2007,2012、COCO

Open source

Q8: Do the experiments and results in the paper well support the scientific hypothesis that needs to be verified?

It supported it, solved the degradation problem, and achieved first place.

Q9: What contribution does this paper make?

1. Study the degradation problem in deep models and propose the ResNet network

2. Residual learning is proposed to assist the learning of deep models without adding learning parameters.

3.ResNet provides optimization ideas for target detection and target positioning

Q10: What’s next? Is there any work that can be further developed?

1. The issue of network convergence rate needs to be further explored.  Plain layer We speculate that the convergence rate of the deep plain network decays exponentially, which may affect the reduction of the training error rate. The reasons for this optimization difficulty will be studied in future work.

2. Solved the deep degradation problem.  The optimization of ResNet at layer 1202 is no longer obvious but has degraded.


This is where we finish the study of this paper "Deep Residual Learning for Image Recognition". The appendix of this paper explains ResNet’s research on target detection and target positioning. Interested students can take a look~

For code reproduction, please see:ResNet code reproduction + super detailed comments (PyTorch)

Next article preview: DenseNet

Guess you like

Origin blog.csdn.net/weixin_43334693/article/details/128401720