Deep Learning - Residual Network (ResNet)

With the development and popularity of convolutional neural networks, we have learned that increasing the number of layers of neural networks can improve the training accuracy and generalization ability of the model, but simply increasing the depth of the network may cause "gradient dispersion" and "gradient explosion " . " And other questions. The traditional corresponding solutions are weight initialization (normalized initializatiton) and batch normalization (batch normalization). Although this solves the problem of gradient, but the depth is deepened, but it brings another problem, which is the degradation of network performance.

1. What is the network degradation problem?

As can be seen from the above figure, the 56-layer (layer) network performs worse than the 20-layer network on the training set and the test set [Note: This is not overfitting (overfitting is on the training set performed well, but performed poorly in the test set)], indicating that if the network depth is simply increased, the neural network model may be degraded, and then the features acquired by the network may be lost.

Network degradation: In the process of increasing the number of network layers, the training accuracy (accuracy) gradually tends to be saturated, and if the number of layers continues to increase, the training accuracy will decline, and this decline is not caused by overfitting. In fact, what is added behind the deeper model is not an identity map, but some nonlinear layers.

The essence of the degeneration problem: it may be difficult to approximate the identity map through multiple nonlinear layers (the identity map is also called the identity function: it is an important map, for any element, it is like the same map as the original image). The neural network must continuously propagate the gradient during the backpropagation process, and when the number of layers deepens, the gradient gradually disappears during the propagation process [and the gradient disappearance is an important factor leading to network degradation], resulting in the inability to effectively optimize the weight of the previous network. Adjustment (Here, you can think about shaking a very long rope from the beginning, sometimes it cannot affect the end.

2. Problems solved by residual network

As the neural network gets deeper and deeper, the correlation between the back-transmitted gradients will become worse and worse, and finally approach white noise. Because we know that the image has local correlation, it can be considered that the gradient should also have a similar correlation, so that the updated gradient is meaningful. If the gradient is close to white noise, the gradient update may be doing random disturbance at all.

He Kaiming proposed the paper address of the residual network concept: https://arxiv.org/abs/1512.03385

Based on the problem of network degradation, the author of the paper ( https://arxiv.org/pdf/1702.08591.pdf ) constructs a deep model through the shallow network equivalent mapping. As a result, the deep model does not have the same or lower error rate than the shallow network , it is inferred that the degradation problem may be because the deep network is not so well trained, that is, it is difficult for the solver to use the multi-layer network to fit the same function .

If the later layers of the deep network are identity maps, then the model degenerates into a shallow network. The current problem to be solved is to learn the identity mapping function. But it is difficult to directly let some layers fit a potential identity mapping function: H(x)=x, which may be the reason why deep networks are difficult to train. However, if the network is designed as:H(x)=F(x)+x       

1. Principle of residual block:

The mathematical model of a residual block is shown below. The biggest difference between the residual network and the previous network is that there is an additional shortcut branch of identity. And because of the existence of this branch, when the network is backpropagating, the loss can pass the gradient directly to the previous network through this shortcut, thus slowing down the problem of network degradation. When analyzing the causes of network degradation in the second section, we learned that there is a correlation between gradients. After we had the indicator of gradient correlation, the author analyzed a series of structures and activation functions, and found that resnet is excellent in maintaining gradient correlation. From the perspective of gradient flow, there is a gradient that keeps going back as it is. According to legend, the correlation of this part is very strong. In addition, the residual network does not add new parameters, but only adds one more step. Under the acceleration of GPU, this extra computation is almost negligible.

However, we can see that because the residual block is finally an operation of F ( x ) + x F(x) + x F(x)+x, it means that F ( x ) F(x) F(x) and xxx The shapes must be consistent. But in the actual network construction, you can also use 1x1 convolution to change the number of channels.

Figure 1.

We can switch to learning a residual function: F(x)=H(x)-x, as long as: F(x)=0constitutes an identity map: H(x)=xAlso, fitting residuals is definitely easier.

F is the network map before summation, and H is the network map from the input to the summation. For example, if 5 is mapped to 5.1, then before the residual is introduced:

{F(5)}'=5.1 ,

After introducing the residual is: H(5)=5.1, H(5.1)=F(5)+5,F(5)=0.1

Here {F}'and Fboth represent the network parameter mapping, and the mapping after introducing the residual is more sensitive to the change of the output. For example, if the S output changes from 5.1 to 5.2, the mapped output {F}'increases by 2%, while for the residual structure output from 5.1 to 5.2, the mapped F is from 0.1 to 0.2, which increases by 100%. Obviously, the output change of the latter has a greater effect on the adjustment of the weight, so the effect is better. The idea of ​​residuals is to remove the same main part, thereby highlighting small changes.

2. Residual block code example:

import torch
import torch.nn as nn
#残差块
class Res_Block(nn.Module):
    def __init__(self,c):
        super(Res_Block, self).__init__()
        #残差网络要求整个模型输入和输出的通道一样
        self.layer=nn.Sequential(
            #增加Padding是为了保持通道一样
            nn.Conv2d(in_channels=c, out_channels=c, kernel_size=3, stride=1,padding=1,bias=False),
            nn.BatchNorm2d(c),#批标准化
            nn.ReLU(),
            nn.Conv2d(c, c, 3, 1,1),
            nn.BatchNorm2d(c),
            nn.ReLU(),
        )
    def forward(self,x):
        #残差网络的应用+x
        return self.layer(x)+x
if __name__ == '__main__':
    net=Res_Block(3)
    x=torch.randn(1,3,28,28)
    y=net.forward(x)
    print(y.shape)

3. Residual Network Discussion

As for why the input of shortcut (shortcut) is X, not X/2 or other forms. The author discussed this issue in another article, and compared the residual structures of the following 6 structures (Figure 2). The shortcut is X/2, which is the second one. It turns out that the first one is better.

This residual learning structure can be realized through the forward neural network + shortcut connection, as shown in the structure figure 1. Moreover, the shortcut connection is equivalent to simply performing the same mapping, which will not generate additional parameters and will not increase the computational complexity. Moreover, the entire network can still be trained through end-to-end backpropagation. 

According to the theory that a multi-layer neural network can fit any function, some layers can be used to fit the function. The question is whether to fit directly H(x)=x or to the residual function F(x)=H(x)-x, fitting the residual function is simpler. Although theoretically both can be approximated, the latter is obviously easier to learn. This form of residual is motivated by a degradation problem, the authors say. According to the previous article, if the added layer is constructed as an equivalent function, then in theory, the training error of the deeper model should not be greater than that of the shallow model, but the degradation problem that occurs shows that it is difficult for the solver to use the multi-layer network to fit the same function. . However, the representation of the residuals makes it much easier to approximate multilayer networks. If the `equality function can be approximated optimally, then the weights of the multilayer network will simply approach 0 to achieve the equivalence mapping, ie  F(x)=0.

 In actual situations, the equivalent mapping function may not be so well optimized, but for residual learning, the solver will find the disturbance more easily according to the input equivalent mapping. In short, it is much easier than directly learning an equivalent mapping function. According to experiments, it can be found that the learned residual function usually has a relatively small response value, and the equivalent mapping (shortcut) provides a reasonable precondition.

Equivalence mapping via shortcut

y=F(x,W_{i})+x

F(x)=W_{2}\partial (W,x)

The addition of F(x) to x is element-wise addition, but if the dimensions of the two are different, a linear mapping needs to be performed on x to match the dimensions:

y=F(x,W_{i})+W_{S}^{X}

The number of network layers used to learn the residual should be greater than 1, otherwise it degenerates into linearity. The article experimented with layers = 2 or 3, and more layers are also feasible.

Residual learning with convolutional layers: The above formulas are based on fully connected layers for simplification. In fact, they can of course be used for convolutional layers. The addition then becomes an element-by-element addition of the two feature maps between the corresponding channels.

1. Network structure

The author designed the plain network and residual network from VGG19, as shown in the middle and right networks in Figure 3. Then use these two networks for experimental comparison.

Rules for designing a network:

1. For layers with the same output feature map size, there are the same number of filters, that is, the same number of channels;

2. When the feature map size is halved (pooling), the number of filters doubles.

For the residual network, the shortcut connection of dimension matching is a solid line, and vice versa is a dashed line. When the dimensions do not match, there are two options for equivalent mapping: directly increase the dimension (channel) through zero padding, and multiply by the W matrix to project to a new space. The implementation is implemented with 1x1 convolution, which directly changes the number of filters for 1x1 convolution. This will increase the parameters.

image 3:

The recommended parameters of ResNet are shown in the figure above. The author also replaced the fully connected layer with global average pooling. On the one hand, it reduces the amount of parameters. Normalization itself plays a role of regularization, which itself prevents the overfitting of the overall structure. Furthermore, global average pooling aggregates spatial information and thus is more robust to spatial transformations of the input.

2. Residual network code implementation:

import torch
import torch.nn as nn
DEVICE=torch.device( "cuda"if torch.cuda.is_available()else"cpu")
#残差块
class Res_Block(nn.Module):
    def __init__(self,c):
        super(Res_Block, self).__init__()
        #残差网络要求整个模型输入和输出的通道一样
        self.layer=nn.Sequential(
            nn.Conv2d(c, c, 3, 1,padding=1),#增加Padding是为了保持通道一样
            nn.ReLU(),
            nn.Conv2d(c, c, 3, 1, padding=1),
            nn.ReLU(),
        )
    def forward(self,x):
        #残差网络的应用+x
        return self.layer(x)+x
#封装Net类
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer=nn.Sequential(
            nn.Conv2d(1,64,3,1,padding=1),
            nn.ReLU(),

            Res_Block(64),#实例化对象
            Res_Block(64),
            Res_Block(64),
            Pool(64,128),

            Res_Block(128),
            Res_Block(128),
            Res_Block(128),
            Res_Block(128),
            Pool(128, 256),

            Res_Block(256),
            Res_Block(256),
            Res_Block(256),
            Res_Block(256),
            Res_Block(256),
            Pool(256,512),

            Res_Block(512),
            Res_Block(512),
            # nn.Linear(512*32*32,10)
        )
        self.layer2=nn.Sequential(
            nn.Linear(512*32*32,10)
        )

    def forward(self, x):
        OUT=self.layer(x)
        OUT=OUT.reshape(-1,512*32*32)
        return self.layer2(OUT)
#实现下采样
class Pool(nn.Module):
    def __init__(self,c_in,c_out):
        super(Pool, self).__init__()
        self.layer=nn.Sequential(
            nn.Conv2d(c_in,c_out,3,1,padding=1),
            nn.ReLU(),
            nn.Conv2d(c_out,c_out,3,1,padding=1),
            nn.ReLU()
        )
    def forward(self,x):
        return self.layer(x)
if __name__ == '__main__':
    res=Net()
    x=torch.randn(32,1,32,32)
    y=res.forward(x)
    print(y.shape)
    # print(res)

Deeper networks explored by the authors. Considering the time spent, the original building block (residual learning structure) is changed to a bottleneck structure, as shown in Figure 4. The 1x1 convolution at the head and end is used to reduce and restore the dimension. Compared with the original structure, only the middle 3x3 becomes the bottleneck. The time complexity of the two structures is similar. At this time, the parameters brought by the projection method mapping become a non-negligible part (as the input dimension increases), so the equivalent mapping of zero padding should be used. Replacing the residual learning structure of the original ResNet can also increase the number of structures and increase the network depth. Generated ResNet-50, ResNet-101, ResNet-152. As the depth increases, the performance continues to improve because the degradation problem is solved.

Summarize:

I am still learning about the residual network. Most of the above content comes from the Internet. I just sorted them out and realized it with code. I may conduct an in-depth study of the residual network later, bye.

Guess you like

Origin blog.csdn.net/GWENGJING/article/details/126730949