【WRNs】《Wide Residual Networks》

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/bryant_meng/article/details/95754575

Here Insert Picture Description

arXiv-2016

Keras code:https://github.com/titu1994/Wide-Residual-Networks



1 Background and Motivation

ResNet 能训练上千层的网络,然而 each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train.

Exploration of the residual focused on

  • order of activations inside a ResNet block(bottleneck)
  • depth of residual networks

From the point of view of width, he began his story. . .

The authors point out the pros and cons of identity connection,

  • 利:train very deep convolution networks
  • Disadvantages: as gradient flows through the networks there is nothing to force it to go though residual block weights, ha ha, easy to become Hunzi the residual block!

This (harm) than the stars of Pegasus, the paper "Deep networks with stochastic depth" to randomly disabling residual blocks during training, the effectiveness of the approach confirmed the author's hypothesis.

2 Advantages / Contributions

  • Proposed wide residual network, is the result of residual exploration in terms of width
  • Proposed a new way of utilizing dropout, insert between convolutional layer rather than the inner residual block ( "Identity Mappings in Deep Residual Networks" paper latter effect is not good, as shown below),
    Here Insert Picture Description
  • state-of-the-art results on CIFAR-10, CIFAR-100, SVHN and COCO, on ImageNet also significant improvements!

3 Method

We first review under the resnet original structure ( "Deep Residual Learning for Image Recognition ")
Here Insert Picture Description
and then look back at pre-activation of resnet version ( "Deep in the Identity Mappings Residual Networks")
Here Insert Picture Description
Full pre-activation best results, this paper describes the blog They are used in the structure (e) of FIG, i.e. the sequence of BN-ReLU-conv

Haha, hindsight says (standing in the second half of 2019 to evaluate the results of 2016), this is a parameter adjustment of the paper, and did not change the nature of the structure, of course, can be said that a large complex to simple Ha (but it is still super multi harvested)!
Here Insert Picture Description

three simple way to increase representational power of residual blocks(上图就是 residual blocks):

  • more layers per block——deepening factor l l
  • widen layers by adding more feature planes——widening factor k k
  • increase filter sizes(eg:3×3→5×5)

1)Type of convolutions in residual block

B ( M ) B(M) 表示 residual block, M M is a list with the kernel sizes of the convolutional layers in a block.
Here Insert Picture Description

2) Number of convolutional layer per residual block

To ensure the same parameters, so d d should decrease whenever l l increases.

3) Width of residual blocks

WRN- n n - k k n n 是 total depth,widening factor k k

4) Dropout in residual blocks

After more wider parameters, the authors would like to study ways of regularization, it is used in the form of dropout
Here Insert Picture Description
can help deal with diminishing feature reuse problem enforcing learning in different residual blocks

Contrast dropout under the "Identity Mappings in Deep Residual Networks" paper usage
Here Insert Picture Description

Here the author explains why the BN at the time also with dropout, read on anyway, not the special feeling (not particularly agree! There are arguments that the role of bn is dropout, interested can discuss this issue! Both regularization effect, here initiate omitted)

Residual networks already have batch normalization that provides a regularization effect, however it requires heavy data augmentation, which we would like to avoid, and it’s not always possible.

4 Experiments

4.1 Datasets

  • CIFAR-10
  • CIFAR-100
  • svhn
  • ImageNet
  • COCO

4.2 Type of convolutions in a block

Here Insert Picture Description
WRN- n n - k k n n 是 total depth,widening factor k k

B ( 3 , 3 ) B(3,3) accuracy best, B ( 3 , 1 ) B(3,1) and B ( 3 , 1 , 3 ) B(3,1,3) the accuracy and which is similar to, but much less parameters, B ( 3 , 1 , 3 ) B(3,1,3) the fastest! Authors follow behind B ( 3 , 3 ) B(3,3) the way! ! !

4.3 Number of convolutions per block

deepening factor l l (the number of convolutional layers per block)
Here Insert Picture Description
WRN-40-2,same parameters(2.2M),3×3 convolution 堆叠

B ( 3 , 3 ) B(3,3) Preferably, the ratio of B ( 3 , 3 , 3 ) B(3,3,3) B ( 3 , 3 , 3 , 3 ) B(3,3,3,3) are good, the author analyzes this is probably due to the increased difficultyThat is, the same depth, depth = 40, due to the depth of the residual block is not the same, so the identity connection the fewer the number, the authors feel that this is not conducive to optimization.

Authors follow behind B ( 3 , 3 ) B(3,3) the way! ! !

4.4 Width of residual blocks

Here Insert Picture Description
Here Insert Picture Description
Comparison among the two red box, of WRN IS-40-4. 8 Times Faster Faster to Train, SO evidently The depth to width ratio in Original
Thin IS FAR from residual Optimal Networks.

Here Insert Picture Description
The solid line test error, the broken line train error, the abscissa Epoch, legends wrong feeling, ha, the yellow part of the subject

4.5 Dropout in residual blocks

ON ON CIFAR 0.4 and 0.3 SVHN
Here Insert Picture Description
WRN-16-4 decreased, the authors speculate is due to the relatively small number of parameters.
Here Insert Picture Description
The authors found that when the train resnet, train loss would suddenly rise, it is caused by weight decay, but the weight decay little tune, acc will drop a lot, the authors found that dropout can solve this problem! ! ! Compare FIG. 3 dashed line on the right red and blue!

The author believes, This is probably due to the fact that we do not do any data augmentation and batch normalization overfits, so dropout adds a regularization effect. Because there is no dropout, train loss would drop particularly low!

Overall, despite the arguments of combining with batch normalization, dropout shows itself as an effective techique of regularization of thin(k=1) and wide(k>1) networks.

4.6 ImageNet and COCO experiments

Here Insert Picture Description
Here Insert Picture Description
Here WRN-50-2-bottleneck Is the resnet-50 width double yet? B ( 3 , 3 ) B(3,3) becomes B ( 1 , 3 , 1 ) B(1,3,1) or B ( 3 , 1 , 3 ) B(3,1,3)

Here Insert Picture Description

4.7 Computational efficiency

Here Insert Picture Description
Here Insert Picture Description

5 Conclusion(owns)

  • diminishing feature reuse problem is mentioned several times, also the author of the important problems to solve, feeling refers to the introduction of identity connection, the residual block may cause problems soy sauce!

  • wide residual networks are several times faster to train.

  • Authors repeatedly referred to the regularization effect, eg: bn, dropout, width, length can add regularization effect, a good understanding of the first two, the latter two not understand yeah, there regularization, let generalization performance better? To better performance? From this point of understanding?
    Here Insert Picture Description

  • This article width only from the start, "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" on this from the depth, width

Guess you like

Origin blog.csdn.net/bryant_meng/article/details/95754575