1、Introduction
1) network depth is very important
Deep networks naturally integrate low/mid/highlevel features and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth).
2) the number of layers will result in the disappearance of a gradient / explosion, and the initial value of the normalized normalized intermediate layer can solve the aforementioned problems
An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning.
3)Plain and residual networks
Plain networks simply stacked layers, is similar to that VGG
4) degradation problem
When a deeper level of network convergence can begin, a problem of degradation on the exposed: With the increase of network depth, accuracy saturation (which may not be surprising), then rapidly degraded. Unexpectedly, this degradation is not caused by the over-fitting, a suitable increase in depth model more layers leads to higher training error. If the nonlinear layer can be added as a peer map, then constructed a training error deeper model should be not more than superficial counterpart. Degradation is indicated by multiple non-linear approach to mapping and so on may be difficult.
5) In this paper to address degradation by introducing Deep Residual Learning
If the mapping is optimal for the other, it can be directly intermediate layer of non-linear parameter set to zero, i.e., to obtain identity mapping. If the optimal function closer to the identity map mapping rather than zero, then solvers easier to find relevant disturbance and the identity map, rather than function as a new function to learn.
2、Deep Residual Learning
1)Identity Mapping by Shortcuts
The first term called residual mapping, the second term is called shortcut connection. F (x) with x in the form elementwise are summed, and then will go through a nonlinear layer ReLU. If the width is not equal to O, the following structures may be employed:
Wherein Ws is a linear mapping. Mentioned later in the text, Ws of two methods may be implemented: (A) 0 padding with increased dimensions (channel), i.e. identity shortcut, (B) 1 * 1 Convolution with, i.e., projection shortcut. Both methods use step 2
2)Network Architectures
VGG compared with using a similar plain network.
3) realization
Using a picture enhancement, after conv plus BN, then ReLU. weight decay of 0.0001 and a momentum of 0.9
3、Experiments
1) compared with plain network (parameters, operands are equal (without considering the addition of a bypass))
2) Comparison of identity shortcut and projection shortcut
ResNet-34 A在升维的地方使用 0 padding,ResNet-34 B在升维的地方使用projection shortcut,维数不变时使用identity shortcut;ResNet-34 C都用projection shortcut
3)Deeper Bottleneck Architectures