Detailed explanation of ResNet principle

- Residual Net

Residual Net

First of all, the following figure shows a simple residual connection. Let's not introduce the specific connotation first, but discuss why residual connection is needed?
insert image description here

Note: googlenet is splicing here, resnet is addition

Why introduce residual connection?

First of all, everyone has formed a common sense. To a certain extent, the deeper the network, the stronger the expressive ability and the better the performance.

However, it is good, as the depth of the network increases, it brings many problems:

Consumption of computing resources
The model is prone to overfitting
Generation of gradient disappearance/gradient explosion problem

Didn't everyone think of a way to solve it before resnet came out? of course not. Better optimization methods, better initialization strategies, various activation functions such as BN layer, Relu, and even using GPU clusters have been used, but they are still not enough, and the ability to improve the problem is limited. However, with the network layer As the number increases, the network degenerates : as the number of network layers increases, the training set loss gradually decreases, and then tends to be saturated. When you increase the network depth, the training set loss will increase instead. Note that this is not overfitting, because the training loss is always reduced during overfitting. This problem was not solved until residual connections became widely used.

An example of backpropagation looks like this:

insert image description here

It has hidden dangers. Once one of the derivatives is very small, the gradient may become smaller and smaller after multiple multiplications. This is often referred to as gradient dissipation . For deep networks, it is almost gone when it is passed to the shallow layer. But if the residual is used, an identity term 1 is added to each derivative, dh/dx=d(f+x)/dx=1+df/dx . At this time, even if the original derivative df/dx is very small, the error can still be effectively backpropagated at this time, which is the core idea.

From this we think:

When the network degrades, the shallow network can achieve a better training effect than the deep network. At this time, if we pass the low-level features to the high-level, the effect should be at least no worse than the shallow network, or if a VGG-100 The network uses the same features as the 14th layer of VGG-16 on the 98th layer, so the effect of VGG-100 should be the same as that of VGG-16. So, we can add a direct mapping between layer 98 and layer 14 of VGG-100 to achieve this effect.

From the perspective of information theory, due to the existence of DPI (data processing inequality), in the process of forward transmission, as the number of layers deepens, the image information contained in the Feature Map will decrease layer by layer, and the addition of ResNet's direct mapping, It is guaranteed that the network of the L+1 layer must contain more image information than the L layer.

Based on this idea of using direct mapping to connect different layers of the network directly, the residual network came into being.

Summarize:

**The authors believe that the degradation of the neural network is the root cause of the difficulty in training deep networks, rather than gradient dissipation. **Although the gradient norm is large, if the available degrees of freedom of the network contribute very unbalanced to these norms, that is, only a small number of hidden units in each layer change their activation values for different inputs, and most of the hidden units The unit responds the same to different inputs, and the rank of the entire weight matrix is not high at this time. And as the number of network layers increases, the entire rank becomes lower after multiplication.

Residual network structure analysis

The overall structure of ResNet is similar to the overall framework of VGG and GoogleNet, which are replaced by ResNet blocks.

The residual network is composed of a series of residual blocks (as follows). A residual block can be expressed as:
insert image description here
the input of the next layer is obtained by directly adding the residual to the output of the previous layer.

The residual block is divided into two parts, the direct mapping part and the residual part. h(Xl) is a direct mapping, and the response is the straight line on the left in the figure below; [External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly F(Xl, Wl) is the residual part , generally consists of two or three convolution operations, that is, the part containing convolution on the right side in the figure below.

Note: Weight refers to the convolution operation in the convolutional network, and addition refers to the unit addition operation.

From this we can see the reason why Resnet can be superimposed to 1000 layers: the information of the lower layer directly receives the information of the upper layer, and the information loss is less, which can be used for deep network structure.

In the convolutional network, the number of Feature Maps of Xl and Xl+1 may be different. At this time, it is necessary to use 1 * 1 convolution to increase or decrease the dimension (Formula 2). At this time, the residual block is expressed as:
insert image description here

where h(Xl) = W'lX. Among them, W'l is a 1×1 convolution operation, but the experimental results show that the 1×1 convolution can only improve the performance of the model, so it is generally used when increasing or reducing the dimension.

Here, the 1×1 convolution can reduce the amount of computation without losing too much information . Usually we use multiple convolution kernels to extract features (different convolution kernels extract different features). If the amount of information in the original image is not Many (very few features) The feature maps we get through multiple convolution kernels must be sparse, so we can use 1*1 convolution kernels for dimensionality reduction (because the matrix is sparse, after compression, the loss of information is also very little) , after performing certain operations, you can upgrade the dimension again.

Implementation of Residual Block

import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

class Residual(nn.Module):
    def __init__(self, input_channels, num_channels,
                 use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels, num_channels,
                               kernel_size=3, padding=1, stride=strides)
        self.conv2 = nn.Conv2d(num_channels, num_channels,
                               kernel_size=3, padding=1)
        if use_1x1conv: #若使用1*1卷积层
            self.conv3 = nn.Conv2d(input_channels, num_channels,
                                   kernel_size=1, stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.BatchNorm2d(num_channels)
        self.bn2 = nn.BatchNorm2d(num_channels)

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X    #在这里实现了直接映射和残差映射的相加，实现了一个残差块
        return F.relu(Y)

The above code can generate two types of networks: One is use_1x1conv=Falsewhen , before applying the ReLU nonlinear function, the input is added to the output. The other is use_1x1conv=Truethen , added to adjust channels and resolution via 1×1 convolution.

ResNet model implementation

The first two layers of ResNet are the same as those in the GoogLeNet introduced earlier: After the 7×7 convolutional layer with 64 output channels and a stride of 2, it is followed by a 3×3 maximum pooling layer with a stride of 2. The difference is that ResNet adds a batch normalization layer after each convolutional layer.

b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
                   nn.BatchNorm2d(64), nn.ReLU(),
                   nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

ResNet uses 4 modules composed of residual blocks, and each module uses several residual blocks with the same number of output channels. The number of channels of the first module is the same as the number of input channels. Since the maximum pooling layer with a stride of 2 has been used before, there is no need to reduce the height and width. Each subsequent module doubles the number of channels of the previous module in the first residual block and halves the height and width.

def resnet_block(input_channels, num_channels, num_residuals,
                 first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(input_channels, num_channels,
                                use_1x1conv=True, strides=2))
        else:
            blk.append(Residual(num_channels, num_channels))
    return blk

b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
b3 = nn.Sequential(*resnet_block(64, 128, 2))
b4 = nn.Sequential(*resnet_block(128, 256, 2))
b5 = nn.Sequential(*resnet_block(256, 512, 2))

net = nn.Sequential(b1, b2, b3, b4, b5,
                    nn.AdaptiveAvgPool2d((1,1)),
                    nn.Flatten(), nn.Linear(512, 10))

Each module has 4 convolutional layers (not including the 1×1 convolutional layer for the identity map). Adding the first 7×7 convolutional layer and the last fully connected layer, there are 18 layers in total. Therefore, this model is often called ResNet-18.