In-depth understanding of ResNet principle analysis and code implementation

table of Contents

1. The problem

2. How to solve

3. Why it works

4. Network implementation

5. Network input part

6. The intermediate convolution part of the network

7. Residual block implementation

8. Network output part

9. Network characteristics

 

10. Summarized as follows


1. The problem

The problems of gradient disappearance and gradient explosion prevent the initial convergence. This problem is solved by initialization normalization and intermediate layer normalization. After solving the problem of convergence, a phenomenon of degradation occurs again: as the number of layers deepens, the accuracy increases and then decreases sharply. And this degradation is not caused by overfitting, and adding appropriate layers to the network results in greater training errors. As the network depth increases, the accuracy of the model does not always improve, and this problem is not caused by overfitting, because not only the test error becomes higher after the network is deepened, but its training error also becomes higher. . The author proposes that this may be because the deeper network will be accompanied by gradient disappearance / explosion problems, thus hindering the convergence of the network. This phenomenon of deepening the network depth but degrading the network performance is called a degradation problem. That is to say, as the depth increases, there is a significant degradation, and the training error and the test error of the network both increase significantly. ResNet was born to solve this degradation problem.

2、How to solve

So the author proposes a solution: add more layers on a shallower architecture. For deeper models: the added layer is identity mapping, and the other layers are copied from the learned shallow model. In this case, the deeper model should not produce higher training errors than its corresponding shallower network. The basic unit of residual learning:
                                             Insert picture description here

The original network input x, hope to output H (x). Now we let H (x) = F (x) + x, then our network only needs to learn to output a residual F (x) = H (x) -x

                        

Assuming that the input is x, the mapping learned by two fully connected layers is H (x), which means that the two layers can be asymptotically fitted to H (x). Assuming that the dimensions of H (x) and x are the same, then fitting H (x) is equivalent to fitting residual function H (x) -x. Let the residual function F (x) = H (x) -x, then the original The function becomes F (x) + x, so a cross-layer connection is directly added to the original network. The cross-layer connection here is also very simple, that is, the identity mapping of x is passed to the past .


3、Why it works

  1. Adaptive depth: The problem of network degradation exemplifies the difficulty of fitting identity maps in multi-layer networks, which means that H (x) is difficult to fit x, but after using the residual structure, fitting identity maps becomes It is easy to directly learn all the network parameters to 0, leaving only the cross-layer connection of the identity mapping. So when the network does not need to be so deep, the identity mapping in the middle can be a little more, otherwise it can be a little less.
  2. "Differential amplifier": Assuming that the optimal H (x) is closer to the identity map, then the network is more likely to find small fluctuations other than the identity map
  3. Mitigation gradient disappears: Derivation of the input x for a residual structure can be known, because of the existence of cross-layer connections, the total gradient will add 1 to the derivative of F (x) to x

The biggest difference between the ordinary network and the deep residual network is that the deep residual network has many bypass branches that directly connect the input to the subsequent layer, so that the subsequent layer can directly learn the residual. These branches are called shortcuts. In the traditional convolutional layer or fully connected layer, there are more or less information loss and loss problems when information is transmitted. ResNet solves this problem to some extent. By directly bypassing the input information to the output to protect the integrity of the information, the entire network only needs to learn the part of the difference between input and output, simplifying the learning goals and difficulty.


4. Network implementation

There are five main forms of ResNet: Res18, Res34, Res50, Res101, Res152;
as shown in the following figure, each network includes three main parts: the input part, the output part, and the intermediate convolution part (the intermediate convolution part includes such as (Stage1 to Stage4 shown in the figure are a total of four stages) . Although ResNet's variants are rich, they all follow the above structural characteristics. The difference between the networks is mainly due to the difference in the block parameters and the number of intermediate convolution parts. Let's take ResNet18 as an example to see how the entire network is implemented.
img

class ResNet(nn.Module):
    def forward(self, x):
        # 输入
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        # 中间卷积
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        # 输出
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

# 生成一个res18网络
def resnet18(pretrained=False, **kwargs):
    model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet18']))
    return model

(1) After entering the network, the data first passes through the input part (conv1, bn1, relu, maxpool);
(2) Then enters the intermediate convolution part (layer1, layer2, layer3, layer4, the layer here corresponds to the stage we mentioned earlier) ;
(3) Finally, the data is output after an average pooling and fully connected layer (avgpool, fc);
specifically, the difference between resnet18 and other res series networks is mainly layer1 ~ layer4, and other components are similar.

5. Network input part

The input part of all ResNet networks is a large convolution kernel with size = 7x7, stride = 2, and a maximum pooling size = 3x3, stride = 2. Through this step, a 224x224 input image will become 56x56 in size The feature map greatly reduces the size required for storage.

self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

6. The intermediate convolution part of the network

The middle convolution part is mainly the blue frame part in the figure below, and the information is extracted through the stacking of 3 * 3 convolutions. [2, 2, 2, 2] and [3, 4, 6, 3] in the red box represent the repeated stacking times of bolck.

img

There is a sentence ResNet (BasicBlock, [2, 2, 2, 2], * kwargs) in the resnet18 () function we just called. The [2, 2, 2, 2] here is consistent with the red box in the figure, if You change this line of code to ResNet (BasicBlock, [3, 4, 6, 3], * kwargs), then you will get a res34 network.

7. Residual block implementation

Let's take a closer look at how a residual block is implemented. In the basic-block shown in the figure below, the input data is divided into two paths, one path passes through two 3 * 3 convolutions, and the other path is directly short-circuited. The summation is output by relu, which is very simple.

img

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

img

8. Network output part

The network output part is very simple. Through global adaptive smooth pooling, all the feature maps are pulled into 1 * 1. For res18, the input data of 1x512x7x7 is pulled into 1x512x1x1, and then connected to the fully connected layer for output, and the number of output nodes Consistent with the number of predicted categories.

self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)

9. Network characteristics

img

The entire ResNet does not use dropout, all uses BN. In addition:

  1. Inspired by VGG, the convolutional layer is mainly 3 × 3 convolution;
  2. For layers of the same output feature map size, ie the same stage, with the same number of 3x3 filters;
  3. If the feature map size is halved, the number of filters is doubled to maintain the time complexity of each layer;
  4. Each stage performs downsampling through a convolutional layer with a step size of 2, but this downsampling will only be completed in the first convolution of each stage, only once.
  5. The network ends with an average pooling layer and softmax's 1000-way fully connected layer. In practice, Adaptive Global Average Pooling is generally used in engineering;

From the network structure in the figure, after the convolution, there is a Global Average Pooling (GAP) structure before the fully connected layer.

10. Summarized as follows

  1. Compared with the traditional classification network, here is connected to the pool, rather than the fully connected layer. Pooling does not require parameters, compared to the fully connected layer, a large number of parameters can be cut off. For a 7x7 feature map, direct pooling can save nearly 50 times the parameters compared to switching to a fully connected layer. It has two effects: one is to save computing resources, and the other is to prevent overfitting of the model and improve the generalization ability;
  2. The global average pooling is used here. The experimental results of some papers show that the average pooling effect is slightly better than the maximum pooling, but the effect of the maximum pooling is not much worse. ** In the actual use process, you can make some adjustments according to your own needs. For example, the multi-classification problem is more suitable for using global maximum pooling.
     

 

Published 943 original articles · Like 136 · Visit 330,000+

Guess you like

Origin blog.csdn.net/weixin_36670529/article/details/105224997