1. Description
Understanding and implementing the architecture of ResNet and ResNeXt for state-of-the-art image classification: From Microsoft to Facebook [Part 1] In this two-part blog post, we explore residual networks. More specifically, we will discuss three papers published by Microsoft Research and Facebook AI Research, state-of-the-art image classification networks - ResNet and ResNeXt architectures, and try to implement them on Pytorch.
2. Residual Network History Review
This is part 1 of a two-part series explaining blog posts exploring residual networks.
We will review the following three papers that introduce and improve residual networks:
- [Article 1] Deep Residual Learning for Image Recognition ( link to Microsoft Research paper )
- [Part 2] Identity Mapping in Deep Residual Networks (link to paper from Microsoft Research )
- [Part 2] Aggregated Residual Transformation for Deep Neural Networks ( Link to Facebook AI Research paper ))
2.1 Has ResNet succeeded?
- Won 1st place in ILSVRC 1 classification competition, top 2015 error rate 5.3% (ensemble model)
- Won first place in ImageNet detection, ImageNet localization, Coco detection and Coco segmentation in ILSVRC and COCO 1 competitions.
- Replace the VGG-101 layer in Faster R-CNN with ResNet-16. They observed a relative improvement of 28%
- Efficiently trained networks with 100 and 1000 layers.
2.2 What problem does ResNet solve?
When deep networks start to converge, a degradation problem is exposed: as the depth of the network increases, the accuracy becomes saturated and then drops rapidly.
3. See redundant and invalid actions:
Let's get a shallow network by adding more layers to it. How effective can it be?
3.1 Worst case:
Early layers of deeper models can be replaced by shallow networks, and the remaining layers can simply act as identity functions (input equals output).
3.2 Beneficial scenarios
In deeper networks, additional layers approximate the mapping better than their shallower counterparts and significantly reduce error.
3.3 Experiment
In the worst case, both the shallow network and the deep variant should provide the same accuracy. In beneficial scene situations, deeper models should provide better accuracy than their shallower counterparts. But experiments with our current solver show that deeper models perform poorly. Therefore, using a deeper network will reduce the performance of the model. This paper attempts to address this problem using a deep residual learning framework .
4. How to solve the problem of "deep depth and low precision"
In conventional neural network operations, the mapping of layer x to layer y is a direct mapping. Let's make a change here: Let it be a nonlinear function, define the residual equation and it can be reconstructed as , where F(x) and x represent stacked nonlinear layers and identity, function (input=output) respectively.
The authors' hypothesis is that it is easier to optimize the residual mapping function F(x) than the original, unreferenced mapping H(x).
4.1 Intuition behind residual blocks
If the identity map is optimal, we can easily push the residual to zero (F(x) = 0) instead of fitting the identity map (x, input = output) through a stack of nonlinear layers. In plain language, it's easy to come up with solutions like F(x)=0, instead of F(x)=x, using a stack of nonlinear CNN layers as a function (think about it). Therefore, this function F(x) is what the author calls the residual function.
The authors performed several tests to verify their hypothesis. Now let's take a look at them one by one.
4.2 Test cases:
Take a normal network (18-layer network of VGG type) (Network 1) and its deeper variant (34 layers, Network 2), and add a residual layer to Network 2 (34 layers with residual connections, Network 3 ).
Design network:
- Mainly use 3*3 filters.
- Downsampling is performed using a CNN layer with a stride of 2.
- A global average pooling layer and a 1000-way fully connected layer, followed by Softmax.
Ordinary VGG and VGG with residual blocks
There are two types of residual connections:
- When the input and output have the same dimension, the identity shortcut (x) can be used directly.
Residual function with same input and output dimensions
2. When the dimensions change, the A) shortcut still performs the identity mapping and fills the extra zero entry with the increased dimension. B) Projection shortcut is used to match dimensions using the following formula (done by 1*1 conv)
Residual block function when input and output dimensions are different.
In the first case no additional parameters are added, in the second case it is added in the form of W_{s}
result:
Although the 18-layer network is only a subspace in the 34-layer network, it still performs better. ResNet significantly outperforms with deeper networks
4.3 Further research
Researched the following networks
Each ResNet block is either 2 layers deep (for small networks like ResNet 18, 34) or 3 layers deep (ResNet 50, 101, 152).
The Pytorch implementation can be seen here:
Bottleneck classes implement 3-layer blocks, and basic blocks implement 2-layer blocks. It also implements all ResNet architectures and trains pretrained weights on ImageNet.
4.4 Observation:
- ResNet networks converge faster than their vanilla counterparts.
- Identity vs Projection The incremental gain of using the projection shortcut (Formula-2) in all layers is very small. Therefore, all ResNet blocks only use the identity shortcut, and the projection shortcut is only used when the size changes.
- The top-5 validation error of ResNet-34 is 5.71% higher than that of BN-inception and VGG. ResNet-152 achieves 5.4% top 49 validation error. An ensemble of 6 different depth models achieves a top-57 validation error of 5.3%. Won the first place in ILSVRC-1
5. Implementation using Pytorch
I have detailed implementations of almost all image classification networks here. Fast reading allows you to implement and train a ResNet in a fraction of a second. Pytorch already has its own implementation, my opinion is just to consider different situations when doing transfer learning.
I wrote a detailed blog post about transfer learning . While the code is implemented in keras here, the ideas are more abstract and may be useful to you in prototyping
6. About the series
This is part 2 of a two-part series explaining the blog post Exploring Residual Networks.
- Understanding and Implementing the ResNet Architecture [ Part 1 ]
- Understanding and Implementing the ResNeXt Architecture [Part 2]
For someone who already understood Part 1, this will be a fairly easy read. I'll follow the same approach as part 1.
- A Brief Discussion on Identity Mapping in Deep Residual Networks ( Link to Paper ) [Important Case Study]
- ResNeXt Architecture Review ( link to paper )
- Experimental Research on ResNeXt
- Implementation of ResNeXt in PyTorch
7. Brief description of identity mapping in deep residual network
In this paper, we theoretically understand why the vanishing gradient problem does not exist in residual networks and the role of skip connections by replacing the identity map (x) with a different function.
residual network equation
F is the stacked nonlinear layer and f is the Relu activation function.
They found that when both f(y1) and h(x1) are identity maps, signals can propagate directly from one unit to any other, either forward or backward. Furthermore, both achieve the smallest error rate when they are identity maps. Let's look at each case individually.
7.1. Finding the optimal h(x_{l}) function
Case-1, λ = 0: This will be a normal network. Since w2, w1, w0 are all between {-1, 1}, the gradient disappears as the depth of the network increases. This clearly shows the vanishing gradient problem
Case-2, Lambda > 1: In this case, the backpropagation value gradually increases and causes the gradient to explode.
Case-3, Lambda < 1: For shallow networks, this may not be a problem. But for very large networks, weight+lambda is still less than <1 in most cases, and it achieves the same problem as case-1.
Case-4, Lambda = 1: In this case, each weight is increased by 1, which eliminates the problem of multiplying very large numbers like case 2 and small numbers like case 1, and works very well Good barrier effect.
The paper was also reviewed by adding backpropagation and convolutional layers in skip connections and found that the performance of the network degrades. Below are the 5 experimental networks they tried, of which only the first one (a) gave the smallest error rate.
7.2. Find the optimal f(y_{l}) function
The above 164 architectures were studied on ResNet-5 and ResNet-110, and the following results were obtained. In both networks, preactivation outperforms all other networks. Therefore, it is more appropriate to use simple addition and identity mapping instead of the Relu f(x) function. Having Relu and BN layers in the residual layer helps the network to optimize quickly and regularize better (less test error), thus reducing overfitting.
8. Conclusion
Therefore, having an identity shortcut connection (case-1) and activation after identity addition is crucial for smooth information dissemination. The ablation experiments are consistent with the derivation discussed above.
8.1 ResNeXt Architecture Review
ResNeXt achieved 2nd place on the ILSVRC 2 classification task and also outperformed its ResNet counterparts on Coco detection and the ImageNet-2016k set.
It's a very simple paper that introduces a new term called "radix". This article simply explains the term and uses it in ResNet networks with various ablation studies.
The paper makes several attempts to describe the complexity of the Inception network and why the ResNeXt architecture is simple. I won't do that here because it requires the reader to understand the Inception network. I'm only talking about architecture here.
- The diagram above distinguishes between a simple ResNet block and a ResNeXt blog.
- It follows a split-transform-aggregate strategy.
- The number of paths within a ResNeXt block is defined as the cardinality. In the above figure C=32
- All paths contain the same topology.
- Rather than having high depth and width, having high cardinality helps reduce validation errors.
- ResNeXt tries to embed more subspaces than its ResNet counterpart.
- The two architectures have different widths. Layer 1 in ResNet has one convolutional layer with a width of 64, while Layer 1 in ResNext has 32 different convolutional layers with a width of 4 (32*4 width). Despite the larger overall width of ResNeXt, both architectures have the same number of parameters (~70k) (ResNet 256*64+3*3*64*64+64*26) (ResNeXt C*(256*d+3*d *d+d*3), C=256 and d=32)
Following are the architectural differences between ResNet and ResNeXt
Thus, resnext_32*4d represents a network with 4 bottleneck [one block in the diagram above] layers, each with base 32. Later we will look at the resnext_32*4D and resnext_64*4D implementations in PyTorch.
8.2 Research:
1 Cardinality vs Width: As C increases from 1 to 32, we can clearly see a decrease in the top-1% error rate. Therefore, increasing C by reducing the width can improve the performance of the model.
2. Increased base vs deeper/wider : basically 3 cases studied. 1) The number of floors has been increased from 200 to 101. 2) Scale up by increasing the bottleneck width. 3) Increase the base by doubling C.
They observed that increasing C provided even better performance improvements. Below are the results.
9. Conclusion
An ensemble of different ResNeXt architectures gives a top-03 error rate of 5.3%, thus ranking second in the ILSVRC competition.