[Deep Learning] Understand the architecture of residual network ResNet and ResNeXt

1. Description

        Understanding and implementing the architecture of ResNet and ResNeXt for state-of-the-art image classification: From Microsoft to Facebook [Part 1] In this two-part blog post, we explore residual networks. More specifically, we will discuss three papers published by Microsoft Research and Facebook AI Research, state-of-the-art image classification networks - ResNet and ResNeXt architectures, and try to implement them on Pytorch.

2. Residual Network History Review

        This is part 1 of a two-part series explaining blog posts exploring residual networks.

We will review the following three papers that introduce and improve residual networks:

2.1 Has ResNet succeeded?

  • Won 1st place in ILSVRC 1 classification competition, top 2015 error rate 5.3% (ensemble model)
  • Won first place in ImageNet detection, ImageNet localization, Coco detection and Coco segmentation in ILSVRC and COCO 1 competitions.
  • Replace the VGG-101 layer in Faster R-CNN with ResNet-16. They observed a relative improvement of 28%
  • Efficiently trained networks with 100 and 1000 layers.

2.2 What problem does ResNet solve?

        When deep networks start to converge, a degradation problem is exposed: as the depth of the network increases, the accuracy becomes saturated and then drops rapidly.

3. See redundant and invalid actions:

        Let's get a shallow network by adding more layers to it. How effective can it be?

3.1 Worst case:

        Early layers of deeper models can be replaced by shallow networks, and the remaining layers can simply act as identity functions (input equals output).

Both the shallow network and its deeper variants give the same output

        

3.2 Beneficial scenarios 

        In deeper networks, additional layers approximate the mapping better than their shallower counterparts and significantly reduce error.

3.3 Experiment 

         In the worst case, both the shallow network and the deep variant should provide the same accuracy. In beneficial scene situations, deeper models should provide better accuracy than their shallower counterparts. But experiments with our current solver show that deeper models perform poorly. Therefore, using a deeper network will reduce the performance of the model. This paper attempts to address this problem using a deep residual learning framework .

4. How to solve the problem of "deep depth and low precision"

        In conventional neural network operations, the mapping of layer x to layer y is a direct mapping. Let's make a change here: Let it H(x)be a nonlinear function, define the residual equation F(x) = H(x) - x,and it can be reconstructed as H(x)= F(x) + x, where F(x) and x represent stacked nonlinear layers and identity, function (input=output) respectively.

The authors' hypothesis is that it is easier to optimize the residual mapping function F(x) than the original, unreferenced mapping H(x).

4.1 Intuition behind residual blocks 

        If the identity map is optimal, we can easily push the residual to zero (F(x) = 0) instead of fitting the identity map (x, input = output) through a stack of nonlinear layers. In plain language, it's easy to come up with solutions like F(x)=0, instead of F(x)=x, using a stack of nonlinear CNN layers as a function (think about it). Therefore, this function F(x) is what the author calls the residual function.

  Identity Mapping in Residual Blocks

      

        The authors performed several tests to verify their hypothesis. Now let's take a look at them one by one.

4.2 Test cases:

        Take a normal network (18-layer network of VGG type) (Network 1) and its deeper variant (34 layers, Network 2), and add a residual layer to Network 2 (34 layers with residual connections, Network 3 ).

        Design network:

  1. Mainly use 3*3 filters.
  2. Downsampling is performed using a CNN layer with a stride of 2.
  3. A global average pooling layer and a 1000-way fully connected layer, followed by Softmax.

        Ordinary VGG and VGG with residual blocks

        There are two types of residual connections:

 residual block

       

  1. When the input and output have the same dimension, the identity shortcut (x) can be used directly.

        Residual function with same input and output dimensions

        2. When the dimensions change, the A) shortcut still performs the identity mapping and fills the extra zero entry with the increased dimension. B) Projection shortcut is used to match dimensions using the following formula (done by 1*1 conv)

        Residual block function when input and output dimensions are different.

        In the first case no additional parameters are added, in the second case it is added in the form of W_{s}

        result:

        Although the 18-layer network is only a subspace in the 34-layer network, it still performs better. ResNet significantly outperforms with deeper networks

Comparison of ResNet model with its anti-normal net

4.3 Further research

        Researched the following networks

ResNet architecture

        Each ResNet block is either 2 layers deep (for small networks like ResNet 18, 34) or 3 layers deep (ResNet 50, 101, 152).

ResNet 2-layer and 3-layer blocks

The Pytorch implementation can be seen here:

        Bottleneck classes implement 3-layer blocks, and basic blocks implement 2-layer blocks. It also implements all ResNet architectures and trains pretrained weights on ImageNet.

4.4 Observation:

  1. ResNet networks converge faster than their vanilla counterparts.
  2. Identity vs Projection The incremental gain of using the projection shortcut (Formula-2) in all layers is very small. Therefore, all ResNet blocks only use the identity shortcut, and the projection shortcut is only used when the size changes.
  3. The top-5 validation error of ResNet-34 is 5.71% higher than that of BN-inception and VGG. ResNet-152 achieves 5.4% top 49 validation error. An ensemble of 6 different depth models achieves a top-57 validation error of 5.3%. Won the first place in ILSVRC-1
ResNet ImageNet Results-2015

5. Implementation using Pytorch

        I have detailed implementations of almost all image classification networks here. Fast reading allows you to implement and train a ResNet in a fraction of a second. Pytorch already has its own implementation, my opinion is just to consider different situations when doing transfer learning.

        I wrote a detailed blog post about  transfer learning . While the code is implemented in keras here, the ideas are more abstract and may be useful to you in prototyping

6. About the series 

        This is part 2 of a two-part series explaining the blog post Exploring Residual Networks.

  • Understanding and Implementing the ResNet Architecture [ Part 1 ]
  • Understanding and Implementing the ResNeXt Architecture [Part 2]

        For someone who already understood Part 1, this will be a fairly easy read. I'll follow the same approach as part 1.

  1. A Brief Discussion on Identity Mapping in Deep Residual Networks ( Link to Paper ) [Important Case Study]
  2. ResNeXt Architecture Review ( link to paper )
  3. Experimental Research on ResNeXt
  4. Implementation of ResNeXt in PyTorch

7. Brief description of identity mapping in deep residual network

        In this paper, we theoretically understand why the vanishing gradient problem does not exist in residual networks and the role of skip connections by replacing the identity map (x) with a different function.

        residual network equation

        F is the stacked nonlinear layer and f is the Relu activation function.

        They found that when both f(y1) and h(x1) are identity maps, signals can propagate directly from one unit to any other, either forward or backward. Furthermore, both achieve the smallest error rate when they are identity maps. Let's look at each case individually.

7.1. Finding the optimal h(x_{l}) function

   Best features for skip connections in residual networks

     

 Backpropagation for the ResNet module

       

        Case-1, λ = 0:  This will be a normal network. Since w2, w1, w0 are all between {-1, 1}, the gradient disappears as the depth of the network increases. This clearly shows the vanishing gradient problem

        Case-2, Lambda > 1: In this case, the backpropagation value gradually increases and causes the gradient to explode.

        Case-3, Lambda < 1:  For shallow networks, this may not be a problem. But for very large networks, weight+lambda is still less than <1 in most cases, and it achieves the same problem as case-1.

        Case-4, Lambda = 1:  In this case, each weight is increased by 1, which eliminates the problem of multiplying very large numbers like case 2 and small numbers like case 1, and works very well Good barrier effect.

        The paper was also reviewed by adding backpropagation and convolutional layers in skip connections and found that the performance of the network degrades. Below are the 5 experimental networks they tried, of which only the first one (a) gave the smallest error rate.

Different skip connections in residual networks.

Deep Residual Network Results

7.2. Find the optimal f(y_{l}) function

different residual blocks

        

        The above 164 architectures were studied on ResNet-5 and ResNet-110, and the following results were obtained. In both networks, preactivation outperforms all other networks. Therefore, it is more appropriate to use simple addition and identity mapping instead of the Relu f(x) function. Having Relu and BN layers in the residual layer helps the network to optimize quickly and regularize better (less test error), thus reducing overfitting.

Residual Network Error Metrics

                                                        

8. Conclusion

        Therefore, having an identity shortcut connection (case-1) and activation after identity addition is crucial for smooth information dissemination. The ablation experiments are consistent with the derivation discussed above.

8.1 ResNeXt Architecture Review

        ResNeXt achieved 2nd place on the ILSVRC 2 classification task and also outperformed its ResNet counterparts on Coco detection and the ImageNet-2016k set.

        It's a very simple paper that introduces a new term called "radix". This article simply explains the term and uses it in ResNet networks with various ablation studies.

        The paper makes several attempts to describe the complexity of the Inception network and why the ResNeXt architecture is simple. I won't do that here because it requires the reader to understand the Inception network. I'm only talking about architecture here.

ResNet (left) and ResNeXt (right) architectures.

  • The diagram above distinguishes between a simple ResNet block and a ResNeXt blog.
  • It follows a split-transform-aggregate strategy.
  • The number of paths within a ResNeXt block is defined as the cardinality. In the above figure C=32
  • All paths contain the same topology.
  • Rather than having high depth and width, having high cardinality helps reduce validation errors.
  • ResNeXt tries to embed more subspaces than its ResNet counterpart.
  • The two architectures have different widths. Layer 1 in ResNet has one convolutional layer with a width of 64, while Layer 1 in ResNext has 32 different convolutional layers with a width of 4 (32*4 width). Despite the larger overall width of ResNeXt, both architectures have the same number of parameters (~70k) (ResNet 256*64+3*3*64*64+64*26) (ResNeXt C*(256*d+3*d *d+d*3), C=256 and d=32)

Following are the architectural differences between ResNet and ResNeXt

ResNet vs ResNeXt structure.

        Thus, resnext_32*4d represents a network with 4 bottleneck [one block in the diagram above] layers, each with base 32. Later we will look at the resnext_32*4D and resnext_64*4D implementations in PyTorch.

8.2 Research:

        1 Cardinality vs Width: As C increases from 1 to 32, we can clearly see a decrease in the top-1% error rate. Therefore, increasing C by reducing the width can improve the performance of the model.

Cardinality and Width

                                                                       

        2. Increased base vs deeper/wider : basically 3 cases studied. 1) The number of floors has been increased from 200 to 101. 2) Scale up by increasing the bottleneck width. 3) Increase the base by doubling C.

They observed that increasing C provided even better performance improvements. Below are the results.

Cardinality vs. Deeper/Broader Networks

9. Conclusion 

        An ensemble of different ResNeXt architectures gives a top-03 error rate of 5.3%, thus ranking second in the ILSVRC competition.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131771594