Resnet Residual Network|Convolutional Neural Network|Principle|Newcomer Summary

 

Preface: This article is a summary of the author's beginners cnn and resnet. This article involves as much relevant knowledge as possible, but many of them are not introduced in detail; readers can extract keywords to search for themselves, or view reference links. If you are new to it, the video lessons and articles in the reference are highly recommended. In addition, if there are any mistakes in the article, I hope to leave a comment.

1 Convolutional Neural Network Basics

1.1 Traditional neural network and convolutional neural network

In a regular neural network, the input is a vector, which is then transformed in a series of hidden layers. Each hidden layer is composed of several neurons, and each neuron is connected to all neurons in the previous layer.

If the input is an RGB color mode image with a size of 256x256x3, the neural network will contain at least 200x200x3=120000 weight values ​​and corresponding bias values. And in general, there are multiple hidden layers and neurons in the network. Obviously this will generate a large number of parameters. Therefore, this full connection method is inefficient, and a large number of parameters will quickly lead to network overfitting.

traditional neural network

The structure of Convolutional Neural Networks is based on the assumption that the input data is images. We just added some unique properties to the structure. These unique properties make the forward propagation function more efficient to implement and greatly reduce the number of parameters in the network.

convolutional neural network

1.2 Convolutional Neural Network Structure

1.2.1 Input layer

For image classification tasks, the input layer is an image of H ∗ W ∗ C, where H refers to the length of the image, W is the width of the image, and C refers to the number of channels of the image. Generally, the number of channels for grayscale images is 1, and the number of channels for color images is 1 . The number of channels in the graph is 3.

1.2.2 Convolution layer

Convolutional layers are the core of Convolutional Neural Networks. It divides the image into many small areas, and weights the entire original input image according to one area to obtain a feature map. Each element of each feature map is the result of convolution at the corresponding position. We can use the inner product (multiply the corresponding positions) and then add the offset to calculate.

As shown in the figure, the calculation process of a single convolution kernel of an image with a channel number of 3 (where +1 is the bias):

In the convolutional layer there are these parametric parameters:

Sliding window step size S (the smaller the step size, the more you can move, the larger the feature map obtained, and the finer the extracted features).

Convolution kernel size K (the size of the selection area is equal to the size of the final number of results).

Edge filling P (due to the step size selection, some elements have repeated weighted contributions, the more inward points contribute more, the further outward points contribute less, and the boundary points contribute more, adding a circle of 0 outside can make up for some boundaries Feature missing) zero padding uses 0 as the value for edge padding.

Number of convolution kernels: equal to the number of feature maps obtained.

1.2.3  Pooling layer

The pooling operation takes the overall statistical characteristics of the adjacent area of ​​​​the input matrix as the output of the position. In other words, it is the compression of the feature map. Reducing the number of parameters in the network in this way reduces the consumption of computing resources and effectively controls overfitting.

There are mainly average pooling (Average Pooling), maximum pooling (Max Pooling) and so on. Pooling layer parameters include pooling window and pooling step size.

1.2.4 Batch Normalization Layer

During preprocessing, we normalize the images to speed up convergence. However, after the convolution calculation, the feature map becomes a new data distribution, and further calculation at this time may cause the gradient to disappear. To this end, Batch Normalization is introduced to make the feature map satisfy the distribution law with a mean value of 0 and a variance of 1.

1.2.5 Activation function and fully connected layer

The activation function introduces nonlinear factors into the model, and Relu is commonly used in convolutional neural networks, that is, f(u)=max(0,u). A fully connected layer acts as a classifier in a convolutional neural network. That is, the above-mentioned layer extracts the input features and obtains the classification through the fully connected layer.

2 Resnet solves the depth problem of the network

In computer vision, the depth of neural networks is important. Deep networks can extract more dimensional features. However, the deep network before Resnet was difficult to train and the effect was not good and it was difficult to break through. Even after reaching a certain level, the deeper the network, the worse the effect.

The figure below is the result statistics obtained by the resnet author using the traditional network. Compared with the 20-layer network, the 56-layer network has larger errors in training and testing; and in the network with fewer layers, the batch normalization and gradient descent methods have been able to largely solve the problem of gradient disappearance and convergence speed. Related questions. So this is obviously not an overfitting phenomenon caused by gradient disappearance.

We call this phenomenon the network degradation phenomenon. That is, as the depth of the network increases, the accuracy begins to saturate and then drops rapidly. In response to this problem, He Yuming and others proposed the residual network ResNet, and won the championship of the 2015 ImageNet Image Recognition Challenge. Even reached 1000 layers on CIFAR-10.

2.1 Nested and non-nested functions

The blue five-pointed star in the figure indicates the optimal value. The closed region of Fi represents the function. In this area, an optimal model (a point in the area) can be found, and the distance from it to the optimal value can be used to measure the quality of the model. The figure above shows a non-nested function. It can be seen that as the complexity increases, although the area of ​​the function increases, the optimal model (a point in the area) that can be found in this area is far from the optimal value. The distance may be getting farther and farther. We can understand that the model has gone astray. The following figure shows nested functions. After each increase in function complexity, the area covered by the function will include the area where the original function is located. In other words, increasing the complexity of the function will only expand the area covered by the function on the original basis, without deviating from the original area.

For deep neural networks, if the newly added layer can be trained to identify function f(x) = x, the new model and the original model will be equally effective; at the same time, because the new model may get a better solution to Fit the training dataset so adding layers doesn't make performance worse.

3 Resnet core idea and structure

Suppose the learning goal is f(x), where x is the output of the shallow layer. The dashed box is the newly added layer. According to the tradition, we can directly learn f(x). But the newly added layer in Resnet learns f(x)-x. Add the two to get f(x) when outputting. So for the new network, its training error rate is at least no worse than that of the shallow network.

In other words, f(x) is the ideal mapping and f(x)-x is the residual mapping. Residual mapping is easier to optimize in practice. And when the ideal map f(x) is very close to the identity map, the residual map is easy to capture the subtle fluctuation of the identity map.

In addition, if the identity map f(x) = x is used as the ideal map f(x) you want to learn, you only need to set the weight and bias parameters of the weighted operation in the dotted box in the residual block as 0. In a residual block, inputs can be propagated more quickly through cross-layer data wires.

3.1 Residual block details

The residual block first has 2 3 * 3 convolutional layers with the same number of output channels, each convolutional layer is followed by a batch normalization layer and a ReLu activation function; through the cross-layer data path, skip the Two convolution operations, adding the input directly before the final ReLu activation function.

This design requires that the output of the two convolutional layers have the same shape as the input, so that the output of the second convolutional layer has the same shape as the original input and can be added.

If you want to change the number of channels, you need to introduce an additional 1 * 1 convolutional layer to transform the input into the required shape and then do the addition operation, which is the above right picture. The principle is that the 1*1 convolutional layer does not do anything in the spatial dimension, mainly in the channel dimension. Select a 1*1 convolution so that the output channel is twice the input channel, so that the residual The input and output of the connection correspond to each other. In ResnNet, if the number of output channels is doubled, the height and width of the input will be reduced by half, so here the stride is set to 2, so that the height, width and channels can be matched. In addition, there are two ways: zero filling and projection. The author of Resnet conducted experiments and compared the above schemes.

3.2 Residual structure of various layers

Different ResNet models can be obtained by configuring different number of channels and number of residual blocks in the module.

The residual block × N in the above table means that the residual structure is repeated N times.

The figure below shows the detailed structure of the 18-layer resnet, where k, s, and p are the three parameters of the convolutional layer. The solid line is the structure of the left figure in 3.1, and the dashed line is the structure of the right figure in 3.1.

3.2.1 Deeper bottleneck design

In the first picture of this section, it can be seen that when the number of layers reaches more than 50, a residual block with a slightly changed structure appears. As shown below

First of all, if you want to learn more deeply, you can increase the number of channels. But this will increase the computational complexity. The bottleneck design projects 256 dimensions back to 64 dimensions through a 1*1 convolution, then performs convolution with the same number of channels, and then projects back to 256 (matching the number of input and output channels to facilitate Compared). It is equivalent to reducing the feature dimension first, then making a spatial thing on top of the reduced dimension, and then projecting it back. Therefore, although the number of channels is 4 times that of the previous one, under this design, the algorithm complexity of the two is similar. In addition, the 1*1 convolution can also be used for the fusion of features on different channels.

3.3 How Resnet handles gradient disappearance

First of all, for the parameter update (blue part), we hope that the derivative of y with respect to w will not be too small, otherwise even if the learning rate is increased, it will not be improved.

For the traditional network (purple part) through chain derivation, the derivative of g(y) with respect to y has an equivalent relationship with the difference between the predicted value and the true value. If the model prediction is relatively good at a certain time, then this value will become very small, and the gradient problem will be caused by multiplication and the network structure cannot be deepened.

For Resnet (green part), we see that even if the derivative of y´ with respect to w is small, the gradient will not disappear through addition. This is why Resnet can be trained to 1000 layers. Regarding why the degradation problem still occurs after 1000 layers, the reason is no longer because of the problem of structural optimization, but because the network layer is too deep and too strong, and the data set is too small and too weak to deal with such a strong residual structure. lead to overfitting.

4 Summary

Resnet makes the deep network easier to train, uses the residual connection to make the new network at least not worse than the old network, and ensures that the gradient will not disappear and can always run. In theory, such a thing can be learned without adding a residual network, but in practice it cannot be done because there is no guidance. And Resnet is to guide the entire network not to go astray, so that it can train a deeper and better model.

reference

  1. 【What is CNN? ] Zhejiang big brother teaches you how to roll CNN, convolutional neural network CNN from entry to actual combat, easy to understand Paramecium nods after listening (artificial intelligence, deep learning, machine learning, computer vision)_哔哩哔哩_bilibili icon-default.png?t=M85Bhttps ://www.bilibili.com/video/BV1zF411V7xu/?note=open&vd_source=a78ac0f401dbefddd573755a05848c85
  2. Hands-on Deep Learning Online Course - Li Mu Course Schedule - Hands-on Deep Learning Course (d2l.ai) icon-default.png?t=M85Bhttps://courses.d2l.ai/zh-v2/
  3. Intensive reading of ResNet papers paragraph by paragraph [Paper intensive reading]_哔哩哔哩_bilibili icon-default.png?t=M85Bhttps://www.bilibili.com/video/BV1P3411y7nn/?spm_id_from=333.999.0.0&vd_source=a78ac0f401dbefddd573755a05848c85
  4. ResNet paper translation and interpretation_Autumn's Feng'er's blog-CSDN blog_resnet paper icon-default.png?t=M85Binterpretation 2522166374698116782417093330%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fall.%2522%257D&request_id=1663746981167824 17093330&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~first_rank_ecpm_v1~pc_rank_v39-9-123285262-null -null.142%5ev48%5epc_rank_34_2,201%5ev3%5econtrol_2&utm_term=resnet&spm=1018.2226.3001.4187
  5. An article to understand the principle of BN and its implementation process (Batch Normalization icon-default.png?t=M85B) distribute.pc_relevant_t0.none-task-blog-2~default~CTRLIST~Rate-1.pc_relevant_antiscan&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2~default~CTRLIST~Rate-1.pc_relevant_antiscan&utm_re levant_index=1
  6. ResNet Detailed Explanation_qq_45649076's Blog-CSDN Blog_resnet icon-default.png?t=M85Bhttps://blog.csdn.net/qq_45649076/article/details/120494328
  7. [Official bilingual] The gradient decline method of deep learning part 2 Ver 0.9 beta_epi Bilibili Lick & VD_Source = icon-default.png?t=M85BA78AC0F401DBEFDDDDDDDDDDDDDDDDDDDDDD573755848C85
  8. Principles of Convolutional Neural Networks_The Blog of Fried Tomatoes-CSDN Blog_Principles of Convolutional Neural Networkshttps://blog.csdn.net/hjskj/article/details/123683095?ops_request_misc=%257B%2522request%255Fid%2522% icon-default.png?t=M85B253A %2522166374775216782417066333%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=16637477521678 2417066333&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~pc_rank_v39-2-123683095- null-null.142%5ev48%5epc_rank_34_2,201%5ev3%5econtrol_2&utm_term=%E5%8D%B7%E7%A7%AF%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB %9C&spm=1018.2226.3001.4187

Guess you like

Origin blog.csdn.net/qq_54499870/article/details/127011866