[Deep Learning] ResNet Network Detailed Explanation

ResNet

reference

ResNet paper:
https://arxiv.org/abs/1512.03385
Main reference video for this article: https://www.bilibili.com/video/BV1T7411T7wa
https://www.bilibili.com/video/BV14E411H7Uw

Structure overview

The network structure diagram of ResNet is shown in the figure:

insert image description here

This is a network structure diagram of different layers of ResNet.

It can be seen that the structure is not bad. Whether it is 18 floors, 34 floors, 50 floors, or 101 floors, 152 floors.

Up is one 7x7的卷积层, and then a 3x3 maximum pooling downsampling.

Then follow the residual structure in conv2_x, conv3_x, conv4_x, in the figure conv5_x.

Finally, follow an average pooling downsampling, and a fully connected layer, sofmax output.

conv1 and pooling layer

Let's look at the first two layers first.

insert image description here

First of all, ResNet uses the ImagesNet dataset, and the default input size used is 224x224, RGB image, three channels

According to the table, we can see that after the picture is input, there is a7x7,64,stride 2

That is, a convolution layer with a convolution kernel size of 7x7, an output channel of 64 (that is, the number of convolution kernels), and stride=2.

Did not mention padding, we need to calculate by ourselves, the output of this layer is written in the table is 112x112

Add a little knowledge:

Assuming that the input image is W x W, the size of the convolution kernel is F x F, the step size is stride=S, padding=P (the number of filled pixels), then the
size of the output image is W2 = (W - F +2P)/S +1

It can be noticed that there is division in this formula. Generally, when we do convolution, we round down when we cannot divide.

You can refer to the official pytorch documentation: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d

insert image description here

But when we do pooling, we can also use upward rounding

See the official pytorch documentation: https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d

There is a parameter ceil_mode, the default is floor is rounded down, can be set to True, rounded up

ceil_mode – when True, will use ceil instead of floor to compute the output shape

Sometimes pooling will choose to round up (maximum pooling and average pooling sometimes have different rounding methods)

That is to say 112 = (224 - 7 + 2P) / 2 + 1

After simplification, 111 = (217 + 2P)/2 = 108.5+P

So P=3 so Padding is 3

So we input the picture, the first layer isin_channel=3,out_channel=64,kernel_size=7,stride=2,padding=3

There is no bias. After this layer we will get a size of 112x112 with 64 channels

Then after a 3x3 maximum pooling downsampling, stride=2

W o u t = ( W i n − F + 2 P ) / S + 1 W_{out}= (W_{in} - F + 2P)/S + 1 Wout=(WinF+2P)/S+1

The pooling layer also uses rounding down. So 56=(112 - 3 + 2P)/2 +1 calculates P=1

So the second pooling layer isin_channel=64,out_channel=64,kernel_size=3,stride=2,padding=1

After the pooling layer, we will get a 56x56, 64-channel output, followed by conv2_xa series of residual structures corresponding to

residual structure

After the first two layers, we get a 56x56, 64-channel output

followed byconv2_x

This is the residual block. There are roughly two types of residual blocks, one is two-layer convolution, and the other is three-layer convolution, which is drawn in the red frame on the way.

insert image description here

For example, if it is ResNet34, then after the pooling layer, there are two 3x3,64 convolutional layers, which form a residual block.

If it is ResNet50, then the pooling layer is followed by a 1x1,64 convolutional layer + a 3x3,64 convolutional layer + a 1x1,256 convolutional layer, and the three form a residual block.

What is written later x3means that there are three such residual blocks connected together

As shown in the figure below, there is not much to say about the principle of the residual structure. Here we mainly talk about the implementation.

insert image description here

There are a few points to note:

Taking ResNet34 as an example, according to the table, after pooling, our size is 56x56x64, then conv2_xour output is still 56x56x64 after passing

The size of the input and output feature matrices are the same, then it meansstride=1,padding=1

Because the formula W out = ( W in − F + 2 P ) / S + 1 W_{out}= (W_{in} - F + 2P)/S + 1Wout=(WinF+2P)/S+1

To ensure the same, then S=1, otherwise, if it is greater than 1, it will not be the same, and then calculate the padding, it is also 1

So after going through these two layers, the obtained feature map is still 56x56x64, which can be directly added to the branch of the residual block.

But here we need to pay attention, let's see conv3_x, it is 3x3, 128 channels, that is to say, the obtained conv2_x56x56, 64 channels

The dimension has been upgraded inside conv3_x, and the size has also changed, becoming 28x28

At this time, we should pay attention to a problem. The results of the branch and the main branch cannot be added together. The sizes and dimensions are inconsistent.

What to do, so there is also a residual block connected by a dotted line, as shown in the figure

insert image description here

A 1x1 convolutional layer is made in the branch, and the 1x1 convolutional layer is mainly used for dimensionality enhancement and dimensionality reduction. At the same time, the size can be changed by setting stride

So the 28x28, 128-channel feature map is obtained through this convolutional layer, which can be directly added to the main branch

And we can calculate it by calculation, the main branch of the residual block padding=1, the right branch padding=2

For ResNet50 and 101, the deep ResNet is also processed in this way

insert image description here

But pay attention, this dotted line structure is only to solve the situation that the sizes are different and cannot be added

Under what circumstances do you need to use this?

For ResNet18 and ResNet34 conv3_x, the dotted line structure is only required for the first layer of conv4_xand conv5_x.

You can see in the table that conv3_xthe first layer of ResNet34 will output 28x28, 128 channels, but the input is 56x56, 64 channels, so the dotted line structure is needed, conv2_xbut not because the input and output are the same.

For ResNet50 and ResNet101, ResNet152, conv2_xand conv3_x, the first layer of conv4_xand requires a dotted line structureconv5_x

Because ResNet50 conv2_xstarts from the beginning, the input is 56x56, 64 channels, and the output is 56x56, 256 channels, so the conv2_xdotted line structure is also required, but this dotted line structure only needs to adjust the number of channels.

insert image description here

Therefore, according to the table, we can see conv3_xthat there are four residual blocks in ResNet50, and only the first one will encounter the problem of different sizes. It needs to use a dotted line structure, and nothing else is required.

Similarly, the first layer of conv4_xand conv5_xalso needs

Also note another improvement:

insert image description here

Batch Normalization

The purpose of Batch Normalization is to make our batch (Batch) feature map satisfy the distribution law with a mean of 0 and a variance of 1.

I won’t write about the principle of the BN layer here, you can watch this video and blog

https://www.bilibili.com/video/BV1T7411T7wa

https://blog.csdn.net/qq_37541097/article/details/104434557

what we need to know is

It is recommended to place the bn layer between the convolutional layer (Conv) and the activation layer (such as Relu), and the convolutional layer should not use bias

So, our actual residual network block should look like this

insert image description here

Summarize

At this point, the network structure of the entire ResNet is clear. The main points to note are:

  • Calculate the parameters in the convolution process W out = ( W in − F + 2 P ) / S + 1 W_{out}= (W_{in} - F + 2P)/S + 1Wout=(WinF+2P)/S+1
  • The first layer in ResNet18 layer and 34 layer conv3_x, conv4_x, need to use dotted line structureconv5_x
  • The first layer of ResNet50, 101 and 152 layers conv2_x, conv3_x, conv4_x, conv5_xall need to use dotted line structure
  • The BN layer is placed between the convolutional layer (Conv) and Relu, and the convolutional layer does not use bias bias

Guess you like

Origin blog.csdn.net/holly_Z_P_F/article/details/127350894