Residual Network—ResNet

ResNet-34

insert image description here

In the schematic diagram of the 34-layer ResNet structure: first is the convolutional layer, then the pooling layer, the structure with connecting lines is a residual structure, and the 34-layer ResNet is composed of a series of residual structures. Finally, it is composed of an average pooling layer and a full-face base layer, which is the output layer. The structure of this network is very simple, basically composed of stacked residual structures.

insert image description here

Some highlights of the ResNet structure:

  • Ultra-deep network structure (breaking through 1000 layers)
  • Propose residual module
  • Use BN to speed up training

Would a simple stack of convolutional and pooling layers work?

 In the picture on the left (in the paper), when stacked on a 20-layer network structure, the training error of the model is about 1%-2%, but when the number of training layers increases to 56 layers, the training error is 7% -8%. Obviously, the addition of convolutional and pooling layers will not work.

What is the reason for the poor training results?

In the paper, the author raised two questions: As the number of network layers increases, the problems of gradient disappearance and gradient explosion become more and more obvious. Let's make an assumption, assuming that the gradient error of each layer is a number less than 1. In the process of backpropagation, if it does not propagate forward once, it must be multiplied by an error gradient less than 1. When the network becomes more and more luna, the multiplied coefficient of less than 1 is getting closer to 0. In this way, the gradient becomes smaller and smaller, causing the gradient to disappear.

Conversely, the gradient is a number greater than 1. In the process of backpropagation, every time the gradient is propagated, it must be multiplied by a number greater than 1. When the network is getting deeper and deeper, multiplying by a coefficient greater than 1 is infinite. Larger, the gradient becomes larger and larger, causing a gradient explosion.

How to solve the situation of gradient disappearance and gradient explosion?

It is usually solved by standardizing data, weight initialization and BN methods.

In the paper, the author also raises a problem that is the degradation problem.

After solving the situation of gradient disappearance and gradient explosion, there is still a situation where the error of increasing the number of layers is very large. How to solve the degradation problem mentioned in the article?

The structure of the residual is proposed, and the degradation problem can be solved through the structure of the residual. The solid line in the right figure represents the error rate of the verification set, and the dotted line represents the error rate of the training set. Let's look at the error rate of the verification set. As the number of layers increases, the error rate decreases and the effect is better.

residual module

The proposed residual module. It is precisely because of the proposed residual module that a deeper network can be built.

The picture on the left is mainly for the residual structure used for the network with fewer layers (ResNet-34), and the picture on the right is the residual structure for the network with more layers.

First look at the structure on the left. The main line is that the input features are passed through two 3×3 convolutional layers to obtain results. On the right side of this main line is a structure from input to output. The whole structure means passing through the main line. The feature matrix obtained after a series of convolutional layers is added to the input feature matrix (the matrix of the two branches is added in the same dimension), and then output through the activation function after the addition. Note: The output feature matrix shape of the main branch and the shortcut must be the same.

 The structure on the right is different from the structure on the left in that a 1×1 convolutional layer is added to the input and output. What are the functions of these two 1×1 convolutional layers?

It can be seen in the figure that the depth of the input matrix is ​​256-d. After passing through the convolutional layer of the first layer (the convolution kernel is 64), the length and width of the input matrix remain unchanged, but the number of channels is changed from the original 256 becomes 64. (The convolutional layer of the first layer plays the role of dimension). At the third layer, the number of channels becomes 256. At this time, the dimensions of the output and input are the same, and they can be added at this time.

How many parameters are saved by comparing the left and right residual structures?

The structure on the left is 1179648 and the structure on the right has parameters 69632, so it seems that the more residual structures used, the fewer parameters are used.

The picture given in the article:

insert image description here

 On the 34-layer residual structure, solid lines are used between layers, and dashed lines are used for some layers. What's the difference?

   

 

 First, the shapes of the input matrix and the output matrix of the part of the solid line are the same, so they can be added directly, but the shapes of the input and output matrices connected by the dotted line are different. A comparison of the two pictures on the right can be obtained: the step size of the first 3×3 128 convolutional layer is 2, and the length of the input matrix from the beginning is 56 to the length of the output matrix is ​​28, which is reduced half. An operation of increasing the dimension is performed through the convolution kernel 128, so that the depth of the output matrix is ​​128. In the shortcut part, a 128 convolution kernel is added, and the step size is also 2. By increasing the convolution kernel of this shortcut, the length and width of the input matrix are also reduced by half. This operation ensures that the output matrix of the main line is the same as the output matrix of the shortcut.

The dimension of the input matrix is ​​[56, 56, 256], and the dimension of the output matrix is ​​[28, 28, 512]. In this case, the corresponding layers are 50, 101, 152. How does this work?

For No-Try 1 of the first convolutional layer, knowledge plays the role of a general dimension. Changing the depth of the input matrix to 128 does not change the height and width of the feature matrix. Through the 3×3 convolution layer of the second layer, the length and width of the output convolution kernel are reduced to half of the original. It becomes 28×28×128. Then the depth is increased through the last 1×1 convolutional layer, from the original 256 to 512

 So the function of the dotted line is to change the length, width and height of the input matrix. For solid lines there is no change. So for the first layer of conv3, conv4, conv5 is worth the dotted structure. The first layer must adjust the length, width, and height of the feature matrix of the previous layer to the length, width, and height required by the current layer.

BN

Using BN to speed up training, using this method does not need to use the Dropout method. The purpose of BN is to make the mean value of the feature matrix corresponding to our batch data 0, and the variance is 1. This method accelerates the training of network i and improves the accuracy of network i.

In the process of building the network before, the image data was preprocessed first. Satisfy a certain distribution law, so that the training of the network can be accelerated, and the feature layer obtained by the image through the convolutional layer 1 does not necessarily meet a certain desired distribution. So the method of BN is to adjust the distribution of the feature layer (the feature layer of each layer) to satisfy the distribution with a mean value of 0 and a variance of 1.

Note: BN is to adjust the distribution of each layer of the feature map that inputs a batch of data, so that the feature map of each layer can satisfy the distribution law with a mean value of 0 and a variance of 1. It is not to adjust the distribution corresponding to a certain feature map.

The formula for standardization processing first calculates the mean value corresponding to each channel, and each channel mentioned here refers to the mean and variance of all data in the same channel of a batch of data. The third formula is the initial value obtained by standardization. The fourth formula is a further adjustment \gammato adjust the variance of the data, \betawhich is used to adjust the mean value of the data. These two learning parameters are learned through back propagation. The mean and variance are calculated and calculated in batches.

The next example will be:

 Assuming that the size of the batch is 2, two feature matrices obtained by inputting two pictures: feature1 and feature2. Now, if the BN process is performed on these two feature matrices, the mean and variance are first calculated. For channel 1, the channel of the entire batch is calculated. Variance and mean of 1. In the same way, channel 2 is the same. The mean and variance sought are vectors rather than a number. The vector dimension is the same as the channel depth. 1 corresponds to the mean value of channel1, and 0.5 corresponds to the mean value of channel2. The value of the feature matrix after BN is obtained through the BN formula.

Guess you like

Origin blog.csdn.net/upupyon996deqing/article/details/124862680