论文-Deep Residual Learning for Image Recognition

1、Introduction

  1) network depth is very important

  Deep networks naturally integrate low/mid/highlevel features and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). 

  2) the number of layers will result in the disappearance of a gradient / explosion, and the initial value of the normalized normalized intermediate layer can solve the aforementioned problems

  An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning.

  3)Plain and residual networks

  Plain networks simply stacked layers, is similar to that VGG

  4) degradation problem

  When a deeper level of network convergence can begin, a problem of degradation on the exposed: With the increase of network depth, accuracy saturation (which may not be surprising), then rapidly degraded. Unexpectedly, this degradation is not caused by the over-fitting, a suitable increase in depth model more layers leads to higher training error. If the nonlinear layer can be added as a peer map, then constructed a training error deeper model should be not more than superficial counterpart. Degradation is indicated by multiple non-linear approach to mapping and so on may be difficult.

      

  5) In this paper to address degradation by introducing Deep Residual Learning

  If the mapping is optimal for the other, it can be directly intermediate layer of non-linear parameter set to zero, i.e., to obtain identity mapping. If the optimal function closer to the identity map mapping rather than zero, then solvers easier to find relevant disturbance and the identity map, rather than function as a new function to learn.

     

 

2、Deep Residual Learning

  1)Identity Mapping by Shortcuts

 

 

 

 

 

    

    The first term called residual mapping, the second term is called shortcut connection. F (x) with x in the form elementwise are summed, and then will go through a nonlinear layer ReLU. If the width is not equal to O, the following structures may be employed:

  

    Wherein Ws is a linear mapping. Mentioned later in the text, Ws of two methods may be implemented: (A) 0 padding with increased dimensions (channel), i.e. identity shortcut, (B) 1 * 1 Convolution with, i.e., projection shortcut. Both methods use step 2

   2)Network Architectures 

  VGG compared with using a similar plain network.

  3) realization

  Using a picture enhancement, after conv plus BN, then ReLU. weight decay of 0.0001 and a momentum of 0.9

3、Experiments

  1) compared with plain network (parameters, operands are equal (without considering the addition of a bypass))

      

 

      

 

 

   2) Comparison of identity shortcut and projection shortcut

  

  ResNet-34 A在升维的地方使用 0 padding,ResNet-34 B在升维的地方使用projection shortcut,维数不变时使用identity shortcut;ResNet-34 C都用projection shortcut

  3)Deeper Bottleneck Architectures

  

 

 

 

 

  

 

 

 

 

 

 

  

 

Guess you like

Origin www.cnblogs.com/wt-seu/p/12381770.html