ResNet:Deep Residual Learning for Image Recognition

Deeper neural network training harder. We propose a residual learning framework to simplify the comparison of the training network previously used more in-depth network. We explicitly residual layer is rewritten as a function of learning (learning residual functions), instead the learning function is not defined according to the input layer. We offer a comprehensive empirical evidence that these residuals network optimization is easy, and can be obtained from a significant increase in the accuracy of depth. On ImageNet dataset, we estimate the depth residuals network layer up to 152 - 8 times vgg deep network [41], but still has a low complexity. The set of these errors on the net residual set of images reached 3.57%. The results obtained first place in the classification task ILSVRC2015, we also analyze the network layer 100 and 1000 layers with CIFAR-10 data sets.

For many express the depth of visual recognition task is very important. Just because we were very thorough, we got 28% relative improvement in the coco object detection data set. The depth of the residual network is the basis of our participation in the use of the model on ILSVRC & COCO 2015 contest, and we ImageNet detect, locate ImageNet, COCO COCO detection and segmentation on the first results were obtained.

 

 Driven by deep importance, there is a problem: learn better network as more layers stacked as easily? An obstacle to answer this question is notorious gradient disappear / explode [1,9] problems, which hampered from the outset convergence (hamper convergence). However, this problem is largely solved by standardized Initialization [23,9,37,13] and the normalized intermediate layer [16], which makes the fall in the network layer is several tens of stochastic gradient backpropagation You can converge on (SGD).

When the network can start a deeper convergence, a problem of degradation on the exposed: With the increase of network depth, accuracy becomes saturated (this may be surprising), then rapidly degraded. Surprisingly, this is not due to degradation caused by over-fitting, and model appropriate increase in depth of more layers result in higher training error, as [11, 42] as reported, and a thorough verification by our experiments. Figure 1 shows a typical example.

 

   (Training accuracy) degradation indicates that not all systems are equally easy to optimize. Let us consider a lighter architecture and its deeper structure, it adds more layers. For deeper model, by which there is a construction solution: identity mapping (identity mapping) to build additional layers, and other layers directly copied from the superficial model. Existence of the solution showed deeper shallower model should result in higher training error model. But experiments show that we are unable to find a solution built with such comparable or better program (or can not be achieved in a practical time).

  In this article, we have to solve the degradation problems by introducing residual depth learning framework (a deep residual learning framework). We do not want to fit each desired layer is directly laminated backing map (desired underlying mapping), but explicitly allow these layers to fit a residual map (residual mapping). The underlying assumption is mapped to the desired H (x), we let the nonlinear layer stack to fit another mapping: F (x): = H (x) -xF (x): = H (x) -x. Thus the original map into: F (x) + xF (x) + x. We assume that optimize the residual mapping easier than the original non-optimized reference mapping. In extreme cases, if the identity mapping is optimal, it will be pushed to zero the residual easier than fitting a pile of identity map nonlinear layer.

 Equation F (x) + x may be feed-forward neural network (feedforward neural networks) in "Quick Link (shortcut connections)" is achieved (FIG. 2). Shortcut connections [2,34,49] to skip one or more layers is connected. In the present embodiment, the quick connection only perform identity mapping, whose outputs are added to the output of the overlay (FIG. 2). Shortcut connections are neither identical additional parameters, computational complexity is not increased. The entire network can still use the SGD reverse the spread of the end of the training, and can use the public library (such as caffe [19]) is achieved without modifying the solver (solvers).

 

   In [36] on ImageNet category set, we use deep residual net to get a good result. Our network 152 is the deepest layer of the remaining date appearing on the network ImageNet, but its complexity is still lower than vgg network [41]. We have a combination of 3.57% of the top 5 errors (top-5 error) on ImageNet test set, and won the first place in ILSVRC 2015 classification competition. He also has on other tasks to identify good generalization ability, so that our image in ILSVRC network detection of image positioning network, coco coco detection and segmentation won the first place. This is strong evidence that the residuals learning principles are universal, we expect it to be applicable to other visual and non-visual issues.

 

Deep Residual Learning (residual depth study)

Let H (x) is considered by several underlying map layer stack (not necessarily the entire network) composed of a first layer represents the input layer with x. If it is assumed the plurality of layers may be asymptotic approximation of the nonlinear function complex [2 -.. This hypothesis, however, is still an open question See [28]], they may be assumed equal to the asymptotic approximation residual function, i.e. H ( X) -x (assuming that the input and output have the same dimensions). Thus, its desirable overlay approximation H (X), let us explicitly make these layers approximately a residual function F (x): = h (x) -x. Thus becomes the original function F (x) + x. Although both forms should asymptotically approximates a desired function (e.g., hypothesis), but the ease of learning may be different.

If you can add a layer structure is the identity map, the model should have a deeper level of training error not greater than its shallow structure. Degradation problem solvers difficult to show maps with multiple layers of nonlinear approximation identity. Learning the use of residual reconstruction, if the identity map is the best, it can simply be solved by more non-linear weights towards zero layer to approximate the identity map.

  In practice, the identity map is unlikely to be optimal, but we re-expression for the pretreatment of the problem is helpful. If the optimal function closer to the identity map mapping rather than zero, then solving should be easier to find the identity mapping-related disturbances (perturbations), rather than as a new disturbance to learn. We show that learning a residual function generally have a smaller response, indicating the identity map provides a reasonable pre-conditions of the experiment (Figure 7).

We use a residual level of learning for each of several stacked. In FIG. 2 show a building block (building block). Formally, in this paper, we consider a block is defined as:

 

 Where x and y are the input and output vectors of the layer considered,  represents the residual maps to learn. The example in Figure 2 comprises two layers, , xF + X + F. Operated by a quick connector element and stage (element-wise) addition to FIG. We then performed after the addition of another non-linear operations (e.g., σ (y) σ (y ), as shown in FIG. 2). 

 

 In eqn. (1), the dimensions F and x must be equal. If not the case (e.g., when changing the input / output channel), we can perform a linear projection of W s shortcut connection to match the dimensions:

 

 We can also use the square Ws in eqn (1) in. However, we will experimentally prove the identity map is enough to solve the degradation problem, but also the economy, so only use Ws only when matching dimension.

 In the form of residual function F is flexible. Experimental herein relate to a function F., It has two or three layers (FIG. 5), but it may have more layers. But if only a single layer of F, then eqn (1) is similar to the linear layer: y = w1x + x, which we did not find any advantage.

 We also note that, although for simplicity, with regard to the above representation is fully connected layers, but they apply to the convolution layer. The function F (x, {wi}) may represent a plurality of convolutional layers. Adder stage element is in the upper two feature maps corresponding channel execution.

 We tested a variety of general / residual network, and observed the same phenomenon. In order to provide examples of discussion, we have two models ImageNet were described as follows.

 

Our plain network structure (Figure 3) is mainly affected by VGG network (Fig. 3, left) inspired. 3 convolution filter layer mainly * 3, and the following two requirements: (i) map the same output characteristic sizes of layers containing the same number of filters; (ii) if the feature size is halved, the number of filter doubling each time to ensure that the same complexity. We sampled directly at step convolution layer 2. The average global pool to a network layer and a channel 1000 with Softmax end layer fully connected. In FIG 3 (in), the total number of layers 34 is the right value.

 Notably, the vgg net [41] (Fig. 3, left) compared to our model with less complexity and lower filters. Our 34-story baseline (baseline) FLOPs 3.6 billion multiply-add), only VGG-19 (196 Yi FLOPs) of 18%.

 

 

 

 Residual network. Based on the above plain network, we insert quick connector (FIG. 3, right) converted version of the corresponding network residuals. When the same input and output size (solid line in FIG. 3 fast connection), can be used as shortcuts identity (eqn.1). When the dimension increases (dotted line in Fig.3), two options were considered: (A) shortcut still using the identity map, using 0 to populate the added dimension in doing so no additional parameters; (B ) using Eq.2 mapping shortcut to make consistent dimensions (1 by convolution * 1). For both options when the shortcut feature sizes across views of two are used for the convolution stride 2.

We followed [21, 41] in practice to achieve the ImageNet. To extend the size (scale augmentation) [41] resize the image so that its short side length random sampling from [256,480] in. Or flip the image horizontally from randomly selected 224 × 224 crop, and subtracting each pixel averaging (the per-pixel mean) [21]. Using standard color [21] enhancement. We follow [16], after each convolution using batch normalization (BN) [16] before activating. We like [13] Like initialize the weights, training from scratch all plain / residual net. We use a small batch size of 256 SGD. From the beginning of the learning rate 0.1, divided by 10 when the error is stable, and the entire model is 60 * 10 ^ 4 iterations training. We use the right weight decay 0.0001, momentum is 0.9. We do not use Dropout [14], according to [16] approach.

Next, we evaluated 18 and the layer 34 network layer residual (the ResNet). Baseline same structure as the above ordinary network, add a shortcut only requires connection of each pair of 3 × 3 filter, Figure 3 (right). In the first comparison (Table 2 and FIG. 4 right), we used all the shortcuts to the identity map, and is padded with zeros to increase the dimensions (Option A). Thus, compared with the conventional corresponding network, they do not have additional parameters.

We have three main observations from Table 2 and FIG. 4. First, a case study of the use of residuals before (general network) ---- opposite layer 34 is better than 18 ResNet layer ResNet (2.8%). More importantly, the training error ResNet layer 34 is much smaller and can promotion to validate the data. this suggests that, in this case, the degradation problem has been satisfactorily resolved, and we managed to get the accuracy gain from the increased depth.

Second, compared to the corresponding normal network layer 34 so ResNet Top-1 error is reduced by 3.5% (Table 2), which is successful in reducing the error of the training results (FIG. 2) (4 Right Left VS)) . This compares to verify the validity of the residual deep study on the system.

    Finally, we also note that layer 18 Normal / residual web has a substantially equal accuracy rate (Table 2), but the convergence speed is faster ResNet layer 18 (FIG. 4 VS left and right). When the network is "not too" (such as 18-story here), the current SGD settlement procedures will still be able to find good solutions for common network. In this case, ResNet by providing faster convergence at an early stage to simplify optimization

 

Deeper bottleneck architecture. Next, we will describe our deeper for ImageNet network. Taking into account our affordable training time, we will be building blocks (building block) modify the design bottleneck (bottleneck design). For each residual function F., We use a stack of three layers instead of two layers (FIG. 5). The three layers are 1 × 1,3 × 3 and 1 × 1 convolution, 1 × 1 wherein the layer is responsible for decrease and increase (restore) the dimension of the 3 × 3 layer becomes smaller input / output bottleneck dimension . Figure 5 shows an example, where the two designs have similar time complexity.

 

 We explored a depth of more than 1,000 model layer. N = 200 we set 1202 obtained layer network, which network was trained just as described above. Our method for optimizing no difficulties, the network layer 1000 of the training error is less than 0.1% (Figure 6, right). It's still pretty good test error (7.93%, Table 6).

But these radical-depth model, there are still some problems to be solved. This test result is worse than the 1202 level network layer 110 of our network, although both have similar training error. We think this is because of over-fitting. For this small data set, the network layer 1202 may unnecessarily large (19.4M). In this dataset strong regularization, as maxout [10] or dropout [14], best results are obtained ([10,25,24,35]. However, in this paper, we use the non maxout / seamless the way to achieve just by skinny deep and structural design of regularization, rather than distract from the difficulties of optimization, but with a stronger regularization can be combined to improve results, which is the direction of our future research

 

Guess you like

Origin www.cnblogs.com/ziwh666/p/12482227.html