Classical taxonomy model (D): ResNet (2015CVPR)

Deep Residual Learning for Image Recognition-------ResNet_2015CVPR

Depth residuals for learning the image recognition

Summary

The deeper neural network training up more difficult. This paper shows a residual learning framework, it is possible to simplify those trained network is very deep, the frame such that the layer can not function to learn the original residual function (unreferenced functions) on its input. This article provides a comprehensive basis show that the optimization of these residuals network simpler, but also to achieve higher accuracy from the deeper layers. As used herein, a data set on a ImageNet 152 deep network layer residuals to evaluate our network, although it is eight times deeper networks VGG, but still only a very low complexity frame herein. A network of these residuals combined model (Ensemble) on ImageNet test set error rate was 3.57% . This results in the 2015 ILSVRC classification task of getting the first results. We residual layer 100 and network layer 1000 also analyzed on CIFAR-10.

The depth of expression has a very central importance in many visual recognition task. Because only express our very deep, then on COCO target detection data sets acquired a 28% relative increase. The depth of the residual network is the basis of our participation in the use of the model on ILSVRC & COCO 2015 contest, and we ImageNet detect, locate ImageNet, COCO COCO detection and segmentation on the first results were obtained.

1 Introduction

Depth convolution neural network in image classification field made a series of breakthroughs. The depth of the network will be a very good end of the multilayer model low / medium / high-level feature and a classifier integrate level features may be rich by the number (depth) of the stacked layers. Recent results show that the depth model plays a crucial role in this model leads to entries ImageNet competition tend to "very deep" --16 layer to 30 layers. We have benefited from very deep in many other models of visual recognition task.

Driven by the depth of the importance of the emergence of a new problem: the training and whether a better network stack more layers as simple as that? Obstacles to solve this problem is to troubled people for a long time to disappear gradient / gradient explosion, which from the outset hampered the convergence model. Initializing a normalized (normalized initialization) and the intermediate normalization (intermediate normalization) largely solves this problem, so that it is possible to converge the network layer in the tens of stochastic gradient back propagation decreased (SGD) on.

When the Deep Web converge, a degradation problem has emerged: With increasing depth of the network, the accuracy rate of saturation (no surprise) and then rapidly degraded. Surprisingly, this is not the degradation caused by the over-fitting, and a reasonable increase in the depth of the model more layers has led to a higher error rate, our experiment also proved this point. Fig.1 shows a typical example.
Here Insert Picture Description
"Plain" Network Fig.1 20 layers and 56 layers in the training error CIFAR-10 (left) and the test error rate (right). The deeper the network on the training and testing have higher error rates.

The degradation of the (training accuracy rate) indicates that not all systems are easily optimized. Let's compare a shallow framework and its deep version. For deeper model, by constructing it to have a solution: ** identity mapping (identity mapping) ** increased to build layers, and other layers directly copied from the superficial model. The build solutions also shows a deeper model should not produce more than its shallow version of the training error. Experimental results show that we are unable to find a solution built with such comparable or better program (or can not be achieved in a practical time).

In this paper, we propose a depth of residual learning framework to solve the degradation problem. We explicitly allow these layers to fit residuals map (residual mapping), rather than letting each layer stack directly to fit the required backing map (desired underlying mapping). The underlying assumption is mapped to the desired H (x), we let the nonlinear layer stack to fit another mapping: F (x): = H (x) -x. Thus the original map into: F (x) + x. We conclude that the residuals than the original map does not refer to the map (unreferenced mapping) easier optimization. In an extreme case, if an identity mapping is optimal, then the residual becomes 0 identity mapping is simpler than the use of a stack to fit the nonlinear layer.

Equation F (x) + x by the front feed forward neural network " Shortcut connections" to achieve (Fig.2). Shortcut that is connected to skip one or more layers. In our example, Shortcut connections simply perform the identity map, then the superposition of their output and output layers stacked together (Fig.2). The shortcut connection does not identical parameters and additional computational complexity. A complete end to end through the network still SGD back propagation training, and can simply be achieved through public libraries (eg, Caffe) without modifying the solver (solvers).
Here Insert Picture Description
Fig.2 residual learning: a building block.
We conducted a comprehensive set of experiments on ImageNet data to show the degradation and evaluate the proposed method. This article shows: 1) Our deep residual network optimization is very easy, but the corresponding "plain" network (only stacked layers) there was a higher error rate increased depth Shique. 2) the depth of our network residuals can be easily increased by layer to improve accuracy, and results are much better than the previous network.

CIFAR-10 data sets also experienced a similar phenomenon, which indicates the difficulty of optimization and effectiveness of the proposed method is not just for a particular data set in terms of. Us on this data set successfully raised more than 100 layers of the training model and explore a model of more than 1,000 layers.

On ImageNet classification data set, deep residual network obtained excellent results. Our residual layer of the network 152 is the deepest yet ImageNet network, and even lower than the complexity of the VGG network. ImageNet on the test set, the combination of our model (ensemble) of the top-5 error rate was 3.57%, and won first place ILSVRC 2015 classification competition. The deep model also has a very good generalization performance on other recognition tasks, ** which allows us to detect ImageNet in ILSVRC & COCO 2015 competition, ImageNet positioning, the COCO COCO detection and segmentation have won the first place results. ** This is a strong proof of the versatility of the residuals study law, so we will apply it to other visual or even non-visual issues.

2.Related Work**

Residual Representations represents a residual image recognition, VLAD [18] is represented by encoded residual vector with respect to the dictionary, Fisher Vector [30] may be represented as a probability VLAD version [18]. They are effective for image retrieval and classification shallow represents [4,48]. For vector quantization, coding the residual vector [17] proved to be more effective than the original encoding vectors.

In the low vision and computer graphics, in order to solve partial differential equations (PDE), Multigrid widely used method [3] The system re-form a plurality of sub-problems as scales, wherein each sub-problem and is responsible for between finer coarser the remaining solution scale. Multigrid alternative is the hierarchical base pretreatment [45,46], which depends on the residual vector indicates the scale between the two variables. It has been demonstrated [3,45,46], the convergence rate than the solver does not know the nature of the remaining standard solutions solver faster. These methods show good reconstruction may simplify the optimization or pretreatment.

Shortcut Connections quick connection . Resulting in fast connection [2,34,49] practice and theory has been studied for a long time. Early practical training multilayer perceptron (MLP) is added from a network input connected to a linear output layer [34, 49]. In [44,24], some of the intermediate layer is directly connected to the auxiliary classifier to solve the disappearance / explosion gradient. [39,38,31,47] The paper presents, centered gradient error propagation and quick connection method implemented by the middle ranking response. In [44], "start" and a layer composed of a number of fast branch deeper branch composition.

At the same time with our work, "highway networks" [42,43] provides a shortcut with a gating function [15] connections. Compared with our identity no shortcuts, these doors are relevant data and have parameters. If the "closed" (close to zero), then highway networks on behalf of the layer of non-residual function. Instead, we are always learning formula residual function. Our identity shortcuts never closed, all the information will always passing through and need to learn other residual functions. Further, highway networks are not shown to significantly increase the depth (e.g., more than 100 layers) accuracy.

3.Deep Residual Learning

3.1. Residual Learning

Let H (x) as the foundation for mapping by a number of stacked layers (not necessarily the entire network) fit, where x represents the input of the layers of the first layer. If it is assumed the plurality of layers may be asymptotic approximation of the nonlinear function 2 complex, they may be assumed that it is equivalent to a residual asymptotic approximation function, i.e., H (x) -x (assuming the same dimensions of the input and output). Therefore, we did not let the layer stack is approximately H (x), but to explicitly allow these layers is approximately residual function F (x): = H (x) -x. Thus, the original function becomes F (x) + x. Although the two forms should be required functions (such as assumptions) can asymptotically approaching, but the ease of learning may vary.

Counter-intuitive phenomenon about downgrading the issue prompted this re-form (Figure 1, left). As we discussed in the introduction, if the layer structure can be added for identity mapping, the model should have a deeper training error of not more than training error lighter model. Degradation showed solver may be difficult to map the identity of approximation by a plurality of nonlinear layer. Learning residual to re-form, is optimal if the identity mapping, the solver can simply weight a plurality of nonlinear layer approaches zero weight to approximate identity mapping.

In reality, identity mapping can not be at their best, but our re-enactment may help solve the problem. If the optimal mapping function closer to a nil identity, then the solver should be consulted to find the identity disturbance, rather than learning a new function. We (Figure 7) show experimentally, learn to function normally with a small residual response, suggesting that the identity mapping provides a reasonable pretreatment.

3.2. Identity Mapping by Shortcuts

We have adopted the residuals for each learning several stacked layers. A building block 2 shown in FIG. Formally, we consider a herein defined as:
Here Insert Picture Description
where x and y are considered as the input and output vectors layer. The function F (x, {W i} ) represents the residuals be learned mapping. For the example in FIG. 2 with two layers, F = W2σ (W1x), where σ represents ReLU [29], and the notation is omitted for simplicity bias. F + x quick connection and operation through the addition performed element-wise. We use the non-linearity of the second adder (i.e., σ (y), see FIG. 2).

-In connection (1) is neither introducing additional parameters, nor the introduction of computational complexity. This is not only attractive in practice, but also very important when we compare the ordinary network and Internet residuals. We can compare fairly while having the same number of parameters, depth, width and computational cost (except element-wise addition negligible) GENERAL / residual web.

x and F must be equal size in Equation (1). If this is not the case (e.g., when changing the input / output channel), we can perform a linear projection of W s shortcut connection to match the size:
Here Insert Picture Description
we can use square matrix equation Ws (1) in the. But we will experimentally prove the identity mapping enough to solve the degradation problems, and very economical, so Ws only used when matching size.

In the form of residual function F is flexible. The experiments described herein relates to the function F (FIG. 5) having two or three layers, but more layers are possible. However, if F has only one layer, then the equation (1) is analogous to linear layers: y = W1x + x, we have not observed this advantage.

We also note that, although for simplicity, the code being a full-on connection layer, but they also apply to the convolution layer. The function F (x, {Wi}) may represent a plurality of convolution layers. Perform element-wise addition on the two-by-channel functional FIG.

3.3. Network Architectures

Here Insert Picture Description
FIG 3. ImageNet example network architecture. Left: As VGG-19 Model Reference [41] (19,600,000,000 FLOP). Middle: 34 having a general network layer parameters (the FLOP 3600000000); the right: a residual network parameter layer 34 (the FLOP 3600000000) is. Shortcuts will increase the size of the dotted line. Table 1 shows more details and other variants.

We have tested a variety of general / residual network, and observed the same phenomenon. To provide a discussion of example, we describe two models ImageNet, as shown below.

Ordinary network. Our simple baseline (Figure 3, middle) network primarily inspired by the principles of VGG [41] (Figure 3, left). Most convolution layer having a 3 × 3 filter, and follow two simple design rules: (i) for the same elements in FIG output size, these layers having the same number of filters; (ii) if the characteristic of FIG size minus half the number of filters will be doubled in order to maintain each time complexity. We sampled directly for the next step by a convolution layer 2 is performed. Network globally layer having an average cell 1000 softmax passage layer fully connected end. In FIG. 3, increased the total number of layers is 34 (in).

Notably, networks VGG than our model [41] with less complexity and lower filter (FIG. 3, left). Our reference layer 34 having the FLOP 3600000000 (multiply-add), only VGG-19 (196 billion th the FLOP) 18%.

残差网络。在上面的普通网络的基础上,我们插入快捷方式连接(图3,右),将网络变成其对应的剩余版本。当输入和输出的尺寸相同时,可以直接使用标识快捷方式(等式(1))(图3中的实线快捷方式)。当尺寸增加时(图3中的虚线快捷方式),我们考虑两个选项:(A)快捷方式仍然执行身份映射,并为增加尺寸填充了额外的零条目。此选项不引入任何额外的参数。 (B)等式(2)中的投影快捷方式用于匹配尺寸(按1×1卷积完成)。对于这两个选项,当快捷方式遍历两种尺寸的特征图时,步幅为2。

3.4. Implementation

我们对ImageNet的实现遵循[21,41]中的做法。调整图像的大小,并在[256,480]中对短边进行随机采样以进行缩放[41]。从图像或其水平翻转中随机采样224×224作物,并减去每像素均值[21]。使用[21]中的标准色彩增强。在每次卷积之后和激活之前,紧接着[16],我们采用批归一化(BN)[16]。我们按照[13]中的方法初始化权重,并从头开始训练所有普通/残差网络。我们使用最小批量为256的SGD。学习率从0.1开始,当误差平稳时除以10,并且对模型进行了多达60×10 4次迭代的训练。我们使用0.0001的权重衰减和0.9的动量。遵循[16]中的做法,我们不使用辍学[14]。

在测试中,为了进行比较研究,我们采用了标准的10种作物测试方法[21]。为了获得最佳结果,我们采用[41,13]中的全卷积形式,并在多个尺度上平均分数(图像被调整大小,使得较短的一面在{224,256,384,480,640}中)。

发布了47 篇原创文章 · 获赞 21 · 访问量 7231

Guess you like

Origin blog.csdn.net/qq_18315295/article/details/103568221