Article directory
ResNet
reference
ResNet paper:
https://arxiv.org/abs/1512.03385
Main reference video for this article: https://www.bilibili.com/video/BV1T7411T7wa
https://www.bilibili.com/video/BV14E411H7Uw
Structure overview
The network structure diagram of ResNet is shown in the figure:
This is a network structure diagram of different layers of ResNet.
It can be seen that the structure is not bad. Whether it is 18 floors, 34 floors, 50 floors, or 101 floors, 152 floors.
Up is one 7x7的卷积层
, and then a 3x3 maximum pooling downsampling.
Then follow the residual structure in conv2_x
, conv3_x
, conv4_x
, in the figure conv5_x
.
Finally, follow an average pooling downsampling, and a fully connected layer, sofmax output.
conv1 and pooling layer
Let's look at the first two layers first.
First of all, ResNet uses the ImagesNet dataset, and the default input size used is 224x224, RGB image, three channels
According to the table, we can see that after the picture is input, there is a7x7,64,stride 2
That is, a convolution layer with a convolution kernel size of 7x7, an output channel of 64 (that is, the number of convolution kernels), and stride=2.
Did not mention padding, we need to calculate by ourselves, the output of this layer is written in the table is 112x112
Add a little knowledge:
Assuming that the input image is W x W, the size of the convolution kernel is F x F, the step size is stride=S, padding=P (the number of filled pixels), then the
size of the output image is W2 = (W - F +2P)/S +1It can be noticed that there is division in this formula. Generally, when we do convolution, we round down when we cannot divide.
You can refer to the official pytorch documentation: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d
But when we do pooling, we can also use upward rounding
See the official pytorch documentation: https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d
There is a parameter ceil_mode, the default is floor is rounded down, can be set to True, rounded up
ceil_mode – when True, will use ceil instead of floor to compute the output shape
Sometimes pooling will choose to round up (maximum pooling and average pooling sometimes have different rounding methods)
That is to say 112 = (224 - 7 + 2P) / 2 + 1
After simplification, 111 = (217 + 2P)/2 = 108.5+P
So P=3 so Padding is 3
So we input the picture, the first layer isin_channel=3,out_channel=64,kernel_size=7,stride=2,padding=3
There is no bias. After this layer we will get a size of 112x112 with 64 channels
Then after a 3x3 maximum pooling downsampling, stride=2
W o u t = ( W i n − F + 2 P ) / S + 1 W_{out}= (W_{in} - F + 2P)/S + 1 Wout=(Win−F+2P)/S+1
The pooling layer also uses rounding down. So 56=(112 - 3 + 2P)/2 +1 calculates P=1
So the second pooling layer isin_channel=64,out_channel=64,kernel_size=3,stride=2,padding=1
After the pooling layer, we will get a 56x56, 64-channel output, followed by conv2_x
a series of residual structures corresponding to
residual structure
After the first two layers, we get a 56x56, 64-channel output
followed byconv2_x
This is the residual block. There are roughly two types of residual blocks, one is two-layer convolution, and the other is three-layer convolution, which is drawn in the red frame on the way.
For example, if it is ResNet34, then after the pooling layer, there are two 3x3,64 convolutional layers, which form a residual block.
If it is ResNet50, then the pooling layer is followed by a 1x1,64 convolutional layer + a 3x3,64 convolutional layer + a 1x1,256 convolutional layer, and the three form a residual block.
What is written later x3
means that there are three such residual blocks connected together
As shown in the figure below, there is not much to say about the principle of the residual structure. Here we mainly talk about the implementation.
There are a few points to note:
Taking ResNet34 as an example, according to the table, after pooling, our size is 56x56x64, then conv2_x
our output is still 56x56x64 after passing
The size of the input and output feature matrices are the same, then it meansstride=1,padding=1
Because the formula W out = ( W in − F + 2 P ) / S + 1 W_{out}= (W_{in} - F + 2P)/S + 1Wout=(Win−F+2P)/S+1
To ensure the same, then S=1, otherwise, if it is greater than 1, it will not be the same, and then calculate the padding, it is also 1
So after going through these two layers, the obtained feature map is still 56x56x64, which can be directly added to the branch of the residual block.
But here we need to pay attention, let's see conv3_x
, it is 3x3, 128 channels, that is to say, the obtained conv2_x
56x56, 64 channels
The dimension has been upgraded inside conv3_x
, and the size has also changed, becoming 28x28
At this time, we should pay attention to a problem. The results of the branch and the main branch cannot be added together. The sizes and dimensions are inconsistent.
What to do, so there is also a residual block connected by a dotted line, as shown in the figure
A 1x1 convolutional layer is made in the branch, and the 1x1 convolutional layer is mainly used for dimensionality enhancement and dimensionality reduction. At the same time, the size can be changed by setting stride
So the 28x28, 128-channel feature map is obtained through this convolutional layer, which can be directly added to the main branch
And we can calculate it by calculation, the main branch of the residual block padding=1, the right branch padding=2
For ResNet50 and 101, the deep ResNet is also processed in this way
But pay attention, this dotted line structure is only to solve the situation that the sizes are different and cannot be added
Under what circumstances do you need to use this?
For ResNet18 and ResNet34 conv3_x
, the dotted line structure is only required for the first layer of conv4_x
and conv5_x
.
You can see in the table that conv3_x
the first layer of ResNet34 will output 28x28, 128 channels, but the input is 56x56, 64 channels, so the dotted line structure is needed, conv2_x
but not because the input and output are the same.
For ResNet50 and ResNet101, ResNet152, conv2_x
and conv3_x
, the first layer of conv4_x
and requires a dotted line structureconv5_x
Because ResNet50 conv2_x
starts from the beginning, the input is 56x56, 64 channels, and the output is 56x56, 256 channels, so the conv2_x
dotted line structure is also required, but this dotted line structure only needs to adjust the number of channels.
Therefore, according to the table, we can see conv3_x
that there are four residual blocks in ResNet50, and only the first one will encounter the problem of different sizes. It needs to use a dotted line structure, and nothing else is required.
Similarly, the first layer of conv4_x
and conv5_x
also needs
Also note another improvement:
Batch Normalization
The purpose of Batch Normalization is to make our batch (Batch) feature map satisfy the distribution law with a mean of 0 and a variance of 1.
I won’t write about the principle of the BN layer here, you can watch this video and blog
https://www.bilibili.com/video/BV1T7411T7wa
https://blog.csdn.net/qq_37541097/article/details/104434557
what we need to know is
It is recommended to place the bn layer between the convolutional layer (Conv) and the activation layer (such as Relu), and the convolutional layer should not use bias
So, our actual residual network block should look like this
Summarize
At this point, the network structure of the entire ResNet is clear. The main points to note are:
- Calculate the parameters in the convolution process W out = ( W in − F + 2 P ) / S + 1 W_{out}= (W_{in} - F + 2P)/S + 1Wout=(Win−F+2P)/S+1
- The first layer in ResNet18 layer and 34 layer
conv3_x
,conv4_x
, need to use dotted line structureconv5_x
- The first layer of ResNet50, 101 and 152 layers
conv2_x
,conv3_x
,conv4_x
,conv5_x
all need to use dotted line structure - The BN layer is placed between the convolutional layer (Conv) and Relu, and the convolutional layer does not use bias bias