ResNet model principle

The main difference between ResNet and Vgg:

1. ResNet has a deeper network structure than Vgg

2. Compared with Vgg, ResNet introduces the structure of residual connection

3. ResNet introduces the BatchNorm layer, enabling ResNet to train a deeper network structure

4. ResNet uses the convolutional layer with stride=2 instead of the pooling layer in Vgg for downsampling

5. The big difference between ResNet and Vgg design is that when the feature map size of ResNet is reduced by half, the number of convolutional layer channels is doubled.

VGG

 

The ResNet network was proposed by He Kaiming and other great masters in Microsoft Labs in 2015. It won the first place in the classification task and the first place in the target detection in the ImageNet competition that year. Won the first place in object detection and image segmentation in the COCO dataset.

ResNet is transformed on the basis of VGG. The skip connect layer skip connection is introduced to avoid gradient disappearance , gradient explosion and other phenomena.

1. What is Skip Connection?

Skip Connection is a method of connecting nodes between different layers in a deep neural network. In a traditional neural network, the signal is transmitted from the input layer to the output layer, and the output of each hidden layer needs to be processed by the activation function before being transmitted to the next layer, while Skip Connection will simultaneously transmit the signal of the current layer back to the The next layer at a deeper level, that is, "skips" the middle layer. And this cross-layer connection can speed up information transmission, avoid gradient disappearance, and retain more information.

Two, the advantages of Skip Connection

For deep neural networks, the advantages of Skip Connection are as follows:

1. Solving the problem of gradient disappearance
With the increase of the number of neural network layers, the problem of gradient disappearance becomes more serious, making it difficult for deep nodes to be effectively updated, and even the training process will completely stagnate. The Skip Connection can retain more information , so that the gradient can be propagated between different layers through cross-layer connections, thus effectively solving the problem of gradient disappearance.

2. Accelerate model training
Since Skip Connection allows the signal to be directly transmitted to the next layer at a deeper level without having to go through the middle layer, it can shorten the transmission path of the neural network, accelerate the transmission speed of information and the training speed of the entire neural network .

3. Improve the generalization ability of the model
In the training of some deep neural networks, due to the difference between the training set and the test set, the phenomenon of overfitting is caused. By adding Skip Connection, more information can be retained, thereby enhancing the generalization ability of the model and reducing the risk of overfitting.

The residual network is built on top of BN, and the difference is fitted with a polynomial. The advantage is that the weight response near the solution is more sensitive, and it is easier to learn the optimal solution

ResNet

Why is the depth of the network important?

Because CNN can extract low/mid/high-level features, the more layers of the network, the richer the features that can be extracted at different levels. Moreover, the features extracted by the deeper network are more abstract and have more semantic information.

Why can't we simply increase the number of network layers?

对于原来的网络,如果简单地增加深度,会导致梯度弥散或梯度爆炸。

The solution to this problem is to regularize the initialization and the intermediate regularization layer (Batch Normalization), so that tens of layers of networks can be trained.

虽然通过上述方法能够训练了,但是又会出现另一个问题,就是退化问题,网络层数增加,但是在训练集上的准确率却饱和甚至下降了。这个不能解释为overfitting,因为overfit应该表现为在训练集上表现更好才对。
退化问题说明了深度网络不能很简单地被很好地优化。
作者通过实验:通过浅层网络+ y=x 等同映射构造深层模型,结果深层模型并没有比浅层网络有等同或更低的错误率,推断退化问题可能是因为深层的网络并不是那么好训练,也就是求解器很难去利用多层网络拟合同等函数。

How to solve the degradation problem?

Deep Residual Networks. If the later layers of the deep network are identity maps, then the model degenerates into a shallow network. Then what we have to solve now is to learn the identity mapping function. However, it is difficult to directly let some layers fit a potential identity mapping function H(x) = x, which may be the reason why deep networks are difficult to train. However, if the network is designed as H(x) = F(x) + x, as shown below. We can switch to learning a residual function F(x) = H(x) - x. As long as F(x)=0, it constitutes an identity map H(x) = x. Moreover, the fitting residual must be much easier.

ResNet model principle

The VGG network has great advantages in feature representation, but it is very difficult to train a deep network . In order to solve this problem, researchers have proposed a series of training techniques, such as Dropout and normalization (Batch Normalization ).

In 2015, He Kaiming proposed Residual Network (ResNet) in order to reduce the difficulty of network training and solve the problem of gradient disappearance .

picture

Figure 1 Gradient vanishing

ResNet allows CNN to learn residual mapping by introducing a skip connection. The residual structure (Bottleneck) is shown in Figure 2.

picture

Figure 2 residual structure

In the residual structure in Figure 2, the input x is first a 1 x 1 convolution kernel, 64 convolution layers, and finally a 1 x 1 convolution kernel, 256 convolution layers, and the dimension first becomes smaller and then larger. The output of the network is H(x). If no skip structure branch is introduced, H(x) = F(x). According to the chain rule, the gradient becomes smaller and smaller when deriving x. After introducing the branch, H(x) = F(x) + x, deriving x, the obtained local gradient is 1, and when the gradient is backpropagated, the gradient will not disappear.

Figure 3 is the structure of ResNet, which shows the frame details of 18 layers, 34 layers, 50 layers, 101 layers, and 152 layers. "x 2" and "x 23" in the figure indicate that the convolutional layer is repeated 2 times or 23 times . We can find that all networks are divided into 5 parts, namely conv1, conv2_x, conv3_x, conv4_x, conv5_x.

picture

Figure 3 ResNet structure

Conv1 in Figure 3 uses a 7  7 convolution kernel. When the number of channels is the same, the amount of calculation of the convolution parameters is that the 7 x 7 convolution kernel is greater than the 3 x 3 convolution kernel; when the number of channels is inconsistent, if the number of channels is small, a large convolution kernel can be used.

When the number of channels of the first convolutional layer is 3, three 3 x 3 convolution kernels have the same receptive field effect as one 7 x 7 convolution kernel, but one 7 x 7 is better than three 3 x 3 There are many parameters. In the 19th layer of VGG and the 34th layer of ResNet, the calculation amount of the parameters is shown in Figure 4. The calculation amount of the ResNet 34th layer using a 7 x 7 convolution kernel is much smaller than that of the VGG 19th layer using three 3 x 3 convolutions. nuclear.

picture

Figure 4 Calculations of parameters

The output sizes of the convolutional layers conv2_x and conv3_x in Figure 3 are 56 x 56 and 28 x 28 respectively. If the convolutional layer conv2_x adopts a skip structure to conv3_x, due to the inconsistent dimensions of the feature maps, they cannot be added directly. At this time, the jump structure can use convolution to ensure that the dimensions of the feature maps are consistent, and the feature maps can be added.

FLOPs (floating-point operations) in the last row in Figure 3 refers to the number of floating-point operations, which can measure the complexity of the framework. The complexity of the framework is related to weights and biases. The height, width, and number of channels of the input image are represented by H_in, W_in, and D_in, respectively; the height, width, and number of channels of the output feature map are represented by H_out, W_out, and D_out, respectively; the width and height of the convolution kernel are represented by F_w, F_h, respectively Represents; N_p represents the calculation amount of a point in the feature map, and its calculation formula is as follows:

picture

The formula for calculating the FLOPs of a convolution is as follows:

picture

For the fully connected layer, the input feature map will be stretched into a vector of 1 x N_in, and the output vector dimension will be 1 x N_out. The formula for calculating the FLOPs of a fully connected layer is as follows:

picture

The complexity of a network can be calculated in PyTorch using the toolkit Flops.

picture

Figure 5 FLOPs of ResNet 34 and VGG 16 networks

ResNet code reproduction

The ResNet network refers to the VGG 19 network, and has been modified on the basis of it. The changes are mainly reflected in the fact that ResNet directly uses the convolution of stride=2 for downsampling, and replaces the fully connected layer with the Global Average Pool layer.

ResNet uses two residual structures, as shown in Figure 5 below. The left picture corresponds to a shallow network. When the input and output dimensions are consistent, the input can be directly added to the output. The image on the right corresponds to the deep network. When the dimensions are inconsistent (corresponding to the doubling of the dimension), a 1 x 1 convolution is used to first reduce the dimension and then increase the dimension.

picture

Figure 5 residual structure

The code implementation of the two residual structures is as follows, class BasicBlock(nn.Module) refers to the residual unit of the shallow network ResNet 18/34:

picture

class BottleNeck(nn.Module) refers to the residual unit of the deep network ResNet 50/101/152:

picture

The overall structure of ResNet is as follows:

picture

The forward( ) function in the ResNet class specifies the flow of network data:

(1) After the data enters the network, it first undergoes convolution (conv1), and then performs downsampling pool (f1);

(2) Then enter the intermediate convolution part (conv2_x, conv3_x, conv4_x, conv5_x);

(3) Finally, the data is output through an average pooling (avgpool) and fully connected layer (fc) to obtain the result;

The middle convolution part is mainly the blue box part in the figure below, and [2, 2, 2, 2] and [3, 4, 6, 3] in the red box part represent the number of repetitions of bolck.

picture

The difference between ResNet18 and other Res series networks is mainly  conv2_x  ~ conv5_x, and other components are similar.

picture

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132502386