Deep Learning - Residual Network (ResNet)

1. Preliminary knowledge
In VGG, the convolutional network has reached 19 layers, and in GoogLeNet, the network has reached an unprecedented 22 layers. So, will the accuracy of the network increase as the number of layers in the network increases? In deep learning, the increase in the number of network layers is generally accompanied by the following problems

  • Consumption of computing resources
  • The model is prone to overfitting
  • Generation of gradient disappearance/gradient explosion problem

Problem 1 can be solved by GPU clusters, which is not a big problem for an enterprise resource; problem 2 can be effectively avoided by collecting massive data and cooperating with methods such as dropout regularization; problem 3 can also be solved by Batch Normalization avoid. It seems that as long as we increase the number of layers of the network without thinking, we can benefit from it, but the experimental data gave us a blow.
The author found that as the number of network layers increases, the network has a degradation phenomenon: as the number of network layers increases, the training set loss gradually decreases, and then tends to be saturated. When you increase the network depth, the training set The loss will increase instead. Note that this is not overfitting, because the training loss is always reduced during overfitting.
When the network degrades, the shallow network can achieve a better training effect than the deep network. At this time, if we pass the low-level features to the high-level, the effect should be at least no worse than the shallow network, or if a VGG-100 The network uses the same features as the 14th layer of VGG-16 on the 98th layer, so the effect of VGG-100 should be the same as that of VGG-16. Therefore, we can add a direct mapping (Identity Mapping) between the 98th layer and the 14th layer of VGG-100 to achieve this effect.
From the perspective of information theory, due to the existence of DPI (data processing inequality), in the process of forward transmission, as the number of layers deepens, the image information contained in the Feature Map will decrease layer by layer, and the addition of ResNet's direct mapping, Guaranteed l + 1 l+1l+Layer 1 network must be better than llLayer l contains more image information.
Based on this idea of ​​using direct mapping to connect different layers of the network, residual networks came into being.
2. Residual Network
2.1. Residual Block
The residual network is composed of a series of residual blocks (Figure 1). A residual block can be expressed as:
xl + 1 = xl + F ( xl , W l ) ( 1 ) x_{l+1}=x_l+F(x_l, W_l)\qquad (1)xl+1=xl+F(xl,Wl)( 1 )
insert image description here
In convolutional network,xl x_lxlpossibly sum xl + 1 x_{l+1}xl+1The number of Feature Maps is different, then you need to use 1 × 1 1 \times 11×1 Convolution for dimensionality enhancement or dimensionality reduction (Figure 2). At this time, the residual block is expressed as:
xl + 1 = h ( xl ) + F ( xl , W l ) ( 2 ) x_{l+1}=h(x_l)+F(x_l, W_l)\qquad (2 )xl+1=h(xl)+F(xl,Wl)( 2 )
whereh ( xl ) = W l ′ xlh(x_l)={W_l}^\prime x_lh(xl)=Wlxl W l ′ {W_l}^\prime Wl is1 × 1 1 \times 11×1 convolution operation, but the experimental result is1 × 1 1 \times 11×1 Convolution has limited improvement in model performance, so it is generally used when increasing or reducing dimensions.
The residual block is divided into two parts, the direct mapping part and the residual part. h ( xl ) h(x_l)h(xl) is a direct mapping, and the response is the straight line on the left in Figure 1;F ( xl , W l ) F(x_l, W_l)F(xl,Wl) is the residual part, which is generally composed of two or three convolution operations, that is, the part containing convolution on the right side in Figure 1.
insert image description here
Generally, this version of the residual block is called resnet_v1, and the keras code is implemented as follows:

def res_block_v1(x, input_filter, output_filter):
    res_x = Conv2D(kernel_size=(3,3), filters=output_filter, strides=1, padding='same')(x)
    res_x = BatchNormalization()(res_x)
    res_x = Activation('relu')(res_x)
    res_x = Conv2D(kernel_size=(3,3), filters=output_filter, strides=1, padding='same')(res_x)
    res_x = BatchNormalization()(res_x)
    if input_filter == output_filter:
        identity = x
    else: #需要升维或者降维
        identity = Conv2D(kernel_size=(1,1), filters=output_filter, strides=1, padding='same')(x)
    x = keras.layers.add([identity, res_x])
    output = Activation('relu')(x)
    return output

2.2. Residual network
In the implementation process, the residual block is usually directly stacked.

def resnet_v1(x):
    x = Conv2D(kernel_size=(3,3), filters=16, strides=1, padding='same', activation='relu')(x)
    x = res_block_v1(x, 16, 16)
    x = res_block_v1(x, 16, 32)
    x = Flatten()(x)
    outputs = Dense(10, activation='softmax', kernel_initializer='he_normal')(x)
    return outputs

2.3. Why is it called residual network
In statistics, residual and error are two concepts that are very confusing. Error is a measure of the gap between the observed value and the true value, and the residual is the gap between the predicted value and the observed value. For the reason for the naming of the residual network, the author's explanation is that a layer of the network can usually be regarded as y = H ( x ) y=H(x)y=H ( x ) , and a residual block of the residual network can be expressed asH ( x ) = F ( x ) + x H(x)=F(x)+xH(x)=F(x)+x , that is,F ( x ) = H ( x ) − x F(x)=H(x)-xF(x)=H(x)x , in the identity map,y = xy=xy=x is the observed value, andH ( x ) H(x)H ( x ) is the predicted value, soF ( x ) F(x)F ( x ) corresponds to the residual, so it is called the residual network.
Reference:
Explain the residual network in detail

Guess you like

Origin blog.csdn.net/weixin_40826634/article/details/128291339