Pytorch Quick Start and Practical Combat - 2. Deep Learning Classic Network Development

Column directory: pytorch (image segmentation UNet) quick introduction and actual combat - zero, preface
pytorch quick introduction and actual combat - 1, knowledge preparation (introduction to elements)
pytorch quick introduction and actual combat - 2, deep learning classic network development
pytorch quick introduction And actual combat - three, Unet realizes
pytorch quick introduction and actual combat - four, network training and testing

Continuation of the above pytorch quick start and actual combat - 1. Knowledge preparation (introduction to elements) Chapter 2.4
Because if the network is divided into modules, the four-level structure is too large, and the length itself is too long, so it is inconvenient to write and read.

The tutorial modules are highly independent, and can be read anywhere, regardless of whether it is between different articles or different modules of the article.
How to be happy. Anyway, it's all from " what's it all " to " hehe, this "
part is not listed in detail because it can be skipped, even if you don't understand it.

You can just look at ResNet and Unet.
Or just watch Unet.

Reference: Summary of Xiaoqiang DL Column Articles
Use the following references to skip machine learning and understand computer vision (I must only focus on CV when doing image processing!)
Some of the more important papers are described below: AlexNet, VGG, GoogLe-1: Going Deeper with Convolutions, GoogLe-2: Batch Normalization, GoogLeNet-3, ResNet, FCN, UNet
( It is recommended to get started with the network first and then come back to read these, it is better to understand, for example, I just kill ResNet directly and reproduce the structure of a UNet directly Run through and then come back to see this development process ) Anyway, it is convolution (distribution of adjacent pixel values) to extract image features.

1 AlexNet

1.1 Significance

milestone. pull awayconvolutional neural networkThe prelude to dominating computer vision also accelerates the application of computer vision.

1.2 Structure

5 convolutional layers + 3 fully connected layers
insert image description here
The easy-to-implement network diagram is shown in the VGG chapter.
Reference: Must-read paper AlexNet

2 VGG

2.1 Significance

Open the era of small convolution kernels. The 3*3 convolution kernel has become the mainstream model. As the backbone network structure of various image tasks.

2.2 Structure:

Replace the 5*5 convolutional layer with a 3*3 convolutional layer
insert image description here

2.3 Evolution process:

insert image description here

A: 11-layer convolution
A-LRN: Add an LRN based on A
B: Add a convolution 3*3 convolution in the 1st and 2nd blocks
C: Add a 1* in the 3rd, 4th, and 5th blocks respectively 1 convolution, indicating that adding nonlinearity is beneficial to index improvement
D: The 1*1 convolution of the 3rd, 4th, and 5th blocks is replaced by 3*3,
E: The 3rd, 4th, and 5th blocks are respectively added with a 3 *3 convolution

Among them, the realization network diagram of AlexNet and VGG16 and 19 (from bottom to top):
insert image description here

2.4 Features

To increase the receptive field, two 3*3 stacks are equivalent to one 5*5, and three 3*3 stacks are equivalent to one 7*7.
The reason for the equivalence is derived by yourself or by referring to the above article

Reference: Must-read paper VGG

3 GoogleNet

Also in 2014, they went hand in hand with VGG and teamed up for the championship.

  1. Use 1x1 convolution to reduce dimensionality;
    the first convolutional neural network using 1*1 convolution, discarding the fully connected layer, greatly reducing network parameters. It opened the prelude to the widespread application of 1*1 convolution.
  2. Convolutional re-aggregation at multiple scales simultaneously. Open the era of multi-scale convolution.

Reference:
Going Deeper with Convolutions
In-depth understanding of GoogLeNet structure

4 GoogLe Net 2 (BN network layer)

It is the BN layer mentioned in the previous article.

4.1 Significance

  1. Accelerated the development of deep learning
  2. Opening a new era of neural network design, the normalization layer has become the standard configuration of deep neural networks

4.2 Advantages

  1. A larger learning rate can be used to speed up model convergence
  2. You don't need to carefully design the weight initialization
  3. No dropout or smaller dropout is possible
  4. Can not use L2 or smaller weight decay
  5. You can use LRN (local response normalization)

Also as the Batch Normalization study notes say:

(1) You can choose a relatively large initial learning rate to make your training speed soar. In the past, it was necessary to slowly adjust the learning rate. Even when the network was halfway through training, it was necessary to consider how much the learning rate should be further reduced. Now we can use a large initial learning rate, and then the learning rate decays. The speed is also great, because this algorithm converges very quickly. Of course, even if you choose a smaller learning rate, this algorithm will converge faster than the previous ones, because it has the characteristics of fast training convergence; (2) You no longer have to worry about the parameters of
drop out and L2 regularization in overfitting The selection problem, after using the BN algorithm, you can remove these two parameters, or you can choose a smaller L2 regular constraint parameter, because BN has the characteristics of improving the generalization ability of the network; (3) no longer need to
use The partial response normalization layer is used (partial response normalization is the method used by the Alexnet network, which is more familiar with visual estimation), because BN itself is a normalized network layer; (4) The training data can be completely
processed Chaos (to prevent a certain sample from being often selected during each batch of training, the literature says that this can improve the accuracy by 1%, I am also puzzled by this sentence).

参考:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

5 GoogleNet-V3

I didn't look at this carefully, I just carried it here
Rethinking the Inception Architecture for Computer Vision

5.1 Introduction to VGG and Google V1 and V2

The VGG network model is large, with many parameters and a large amount of calculation, so it is not suitable for real scenarios.
GoogLeNet-V1 uses multi-scale convolution kernel, 1*1 convolution operation.
On the basis of GoogLeNet-V2, a BN layer is added, and the 5*5 convolution is fully replaced by two 3*3 convolution stacks to further improve the model performance.

5.2 Comparison with V2

  1. Using the RMSProp optimization method
  2. Using Label Smoothing regularization method
  3. Using asymmetric convolution to extract 17*17 feature maps
  4. Using Auxiliary Classification with BN

5.3 Innovation points

Key Points & Innovation Points

• Asymmetric convolution decomposition: reduce the amount of parameter calculation and provide new ideas for convolution structure design
• Efficient feature map descent strategy: use stride=2 convolution and pooling to avoid information representation bottlenecks
• Label smoothing: avoid network overconfidence , to reduce overfitting

6 ResNet

I finally arrived at ResNet. In fact, I have only read ResNet and UNet so far. I only read the previous networks when I was writing articles.

6.1 Significance

Another milestone in the development history of modern convolutional neural networksmilestone, breaking through the thousand-layer network, and layer-hopping connections have become standard configurations.

Personal understanding:crutch!
It can support the same crutches as all networks, making deep learning worthy of its name.
Like the BN layer, it can strengthen the effect of other networks. Personally, I understand that the BN layer is a way to set things right. What can I do if the distribution is crooked during training? Standardization corrects it.

6.2 Background

In the CNN network, what we input is the matrix of the picture, which is also the most basic feature. The entire CNN network is a process of information extraction, gradually extracting from the underlying features to highly abstract features. The more layers of the network, the more The different levels of abstract features that can be extracted are more abundant, and the deeper the network, the more abstract the extracted features are, and the more semantic information they have.
So generally we tend to use a deeper network structure in order to obtain higher-level features.

6.3 Problems and Analysis

But deeper networks don't always lead to good results.

As the depth of the network increases, the accuracy of the network becomes saturated or even declines. A 56-layer network is even worse than a 20-layer network. This would not be an overfitting problem, since the training error is equally high for a 56-layer network.

And there are many solutions to overfitting, such as: data enhancement, dropout (pruning), regularization, etc.
In addition to overfitting, the deep network will also bring about the problem of gradient disappearance and gradient explosion, but this problem can also be solved by the BN layer. And if the commonly used activation function ReLu is used, the derivative part is always equal to 1 in positive numbers, that is, the gradient remains unchanged, so it will not cause the gradient to disappear and explode.
Even if the BN layer is added, due to the existence of the nonlinear activation function Relu, the process of each input to output is almost irreversible, which also causes a lot of irreversible information loss, and the deep network will still have a degradation problem (Degradation problem).
The reason for this may also be that the stochastic gradient descent strategy often solves not the global optimal solution, but a local optimal solution. Because the structure of the deep network is more complex, the gradient descent algorithm may obtain a local optimal solution. Sex will be greater.

6.4 Solutions

In order to solve the degradation problem: we need to ensure that the performance of the deep network is at least equal to that of the shallow network while deepening the network. That is, the identity map:

There is a shallow network, and I want to build a deep network by stacking new layers upwards. An extreme case is that these added layers do not learn anything, but only copy the characteristics of the shallow network, that is, the new layer is Identity mapping. . In this case, the deep network should perform at least as well as the shallow network and should not show degradation.

6.5 Solutions

Based on this, Dr. He proposed the idea of ​​residual learning:
if the last few layers of the deep network are learned into an identity map h(x)=x, then the model will degenerate into a shallow network. But it is very difficult to learn this identity mapping directly, so in another way, the network is designed as:
H(x)=F(x)+x => F(x)=H(x)-x
as long as F(x)=0 constitutes an identity map H(x) = x, where F(x) is the residual.
insert image description here

Resnet provides two ways to solve the degradation problem: identity mapping and residual mapping. Identity mapping refers to the "curved line" part in the figure, and residual mapping refers to the remaining part of the non-curved line. F(x) is the network map before summation, and H(x) is the network map from the input to the summation.

In fact, compared with the original image, an identity map is added.

The author in the blog gave such an example: Suppose there is a network parameter mapping: g(x) and h(x), here you want to map 5 to 5.1, then g(5)=5.1, and introduce the residual mapping H(5 )=5.1 = F(5)+5 => F(5)=0.1 . The mapping that introduces residuals is more sensitive to changes in the output. For example, when the output is changed from 5.1 to 5.2, the output of the mapping g(x) increases by 1/51=2%. For the output of the residual structure, the mapping F(x) increased by 100% from 0.1 to 0.2. Obviously, the output change of the latter has a greater effect on the adjustment of the weight, so the effect is better.
This residual learning structure is implemented through the forward neural network + shortcut link, where the shortcut link is equivalent to simply performing the same mapping, without generating additional parameters, and without increasing computational complexity. The entire network can still be trained through end-to-end backpropagation.

The introduction of shortcut connection allows network information to spread effectively, and the gradient backpropagation is smooth, so that thousands of layers of convolutional neural networks can converge.
insert image description here
As shown in the figure: the residual network can be expressed as H(x)=F(x)+x, which shows that the reciprocal (gradient) of the output H(x) to the input x, that is, during backpropagation, H'(x)=F'(x)+1 , the constant 1 of the residual structure can also ensure that the gradient will not disappear when calculating the gradient .

6.6 Network structure:

I think the residual is the recombination of input and output, so that the output does not deviate too much from the input. It is actually okay to get this idea.
insert image description here

6.7 Implementation Details

Some jumper lines in ResNet are solid lines, and some jumper lines are dashed lines.
The dotted line represents that the dimensions before and after these modules are inconsistent, because the Plain network with the residual structure removed is still the same as VGG, that is, downsampling is performed every n layers but the depth is doubled (VGG downsamples ResNet through the pooling layer through convolution).
When the depth is inconsistent, there are two solutions. One is to add a 1×1 convolutional layer to increase the dimension during the jumping process, and the other is to directly fill in zeros (downsampling first) for
deeper Considering the amount of calculation, the neural network will first use 1×1 convolution to reduce the input 256-dimensional to 64-dimensional, and then restore it through 1×1. The purpose of this is to reduce the amount of parameters and computation.

insert image description here

The two structures are respectively aimed at ResNet34 (left picture) and ResNet50/101/152 (right picture), the main purpose of which is to reduce the number of parameters.
The left picture is two 3x3x256 convolutions, the number of parameters: 3x3x256x256x2 = 1179648,
the right picture is the first 1x1 convolution to reduce the 256-dimensional channel to 64 dimensions, and then restore it through 1x1 convolution at the end, the parameters used as a whole Number: 1x1x256x64 + 3x3x64x64 + 1x1x64x256 = 69632.
The number of parameters in the right picture is 16.94 times smaller than that in the left picture. Therefore, the main purpose of the right picture is to reduce the number of parameters, thereby reducing the amount of calculation.

Note that when calculating the parameters of the left image, it is assumed that the number of channels is 256, so it is multiplied by 256 instead of 64, because the left channel is used for the low channel, and the right channel is used for the high channel.

You must know the CNN model: ResNet
image processing must-read papers 6 ResNet
deep learning 16 - residual network (ResNet)
CVPR2016: ResNet fundamentally solves the problem of deep network degradation

7 FCN

(Don't ask why you don't put the network structure diagram, isn't this thing just paving the way for UNet) Just look at Unet

CNN is image-level recognition, that is, from image to result.
The FCN is a pixel-level recognition, marking which category each pixel on the input image is most likely to belong to.

7.1 Background

The CNN convolutional network starting from AlexNet in the imageClassificationandpositionThere are good results in the tasks, basically under the background of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC: Large-Scale Image Recognition Challenge) founded by Li Feifei and others in 2010.
The competition items cover: image classification (Classification), object localization (Object localization), object detection (Object detection), video object detection (Object detection from video), scene classification (Scene classification), scene analysis (Scene parsing) but
with The application expansion of deep learning, whether it is semantic segmentation in the field of NLP (Natural Language Understanding) or image segmentation in the field of CV (Computer Vision), cannot be solved by the original CNN. For general classification CNN networks, such as VGG and Resnet, some fully connected layers are added at the end of the network, and category probability information can be obtained after softmax. However, this probability information is one-dimensional, that is, it can only identify the category of the entire picture, but not the category of each pixel, so this fully connected method is not suitable for image segmentation.
The Fully Convolution Network ( Fully Convolution Network, FCN ) proposes that the next few full connections can be replaced by convolutions, so that a 2-dimensional feature map can be obtained, followed by softmax to obtain the classification information of each pixel. Pixelwise prediction solves the segmentation problem.

7.2 Technical realization:

Regarding these three points (or some concepts and reasons), if you don’t understand it well, it is recommended to come back to this point after reading Unet. Unet has perfectly reflected these. If the explanation here seems too redundant and messy.

  1. Fully Convolutional

A fully convolutional (fully conv) network without a fully connected layer (fc).

Replace the last fully connected layer of traditional CNN with a convolutional layer.

  • Can be adapted to input and output of any size (the size and channels can be changed by modifying the convolution kernel, padding and stride)
  1. Deconvolution (Deconvolution)

Instead of using the previous interpolation upsampling (Interpolation), a new upsampling method, namely deconvolution (Deconvolution), is proposed. Deconvolution can be understood as the inverse operation of the convolution operation. Deconvolution cannot recover the value loss caused by the convolution operation. It only reverses the steps in the convolution process once, so it can also be called is transposed convolution.

  • Using deconvolution can output fine results (the weights of convolution can all participate in backpropagation!)

I personally think that the name deconvolution is not very good, but transposed convolution is very suitable. Why? In fact, transposed convolution is still a kind of convolution, just a convolution with different parameters. So the function name in pytorch is also transposed2d.
Let me think about how to describe the difference:stride(The step size of deconvolution is always 1, and its parameter stride does not indicate the step size) The parameter stride
=n of general convolution means that the step size is n, and
the parameter stride=n of transposed convolution means interpolation between every two pixels n-1, what about the step size? is always 1. OK, look at the example:
when the parameter stride=1, deconvolution is no different from convolution.
Deconvolution as shown below: blue input=2x2, moving gray kernels_size=3x3, edge white padding=2, stride=1, at this time the green output is 4x4.
insert image description here
The parameter stride! What about when =1? As mentioned above, it means inserting a few in the middle.
Deconvolution as shown below: input=3x3, kernels_size=3x3, padding=1, stride=? , the green output=5x5
answer is 2, as shown in the figure, a white pixel is inserted between every two pixels of the blue input, so the parameter stride=1+1=2. The step
size of the kernel movement is always 1.
insert image description here

  1. Jump layer structure (Skip Layer)

The combination of different depth features is the compensation for the front and rear feature maps.

(My understanding: always remind the network to move closer to the reality and not to deviate) Similar to the residual, don't think too much at this stage.

  • It can enhance the robustness and accuracy of the network.

Reference:
Image Segmentation: Full Convolutional Network (FCN) in Semantic Segmentation
FCN Learning and Understanding
Full Convolutional Network (FCN) Detailed Explanation

7.3 Historical significance:

• Pioneering work in the field of semantic segmentation
• End-to-end training paved the way for the development of subsequent semantic segmentation algorithms

7.4 Other related concepts

7.4.1 Local Information

  • Extraction location: extract local information in shallow network
  • Features: The geometric information of the object is relatively rich, and the corresponding receptive field is small
  • Purpose: Helps to segment targets with smaller sizes, and helps to improve the accuracy of segmentation

7.4.2 Global Information

  1. Extraction position: Extract global information in deep network
  2. Features: The spatial information of the object is relatively rich, and the corresponding receptive field is relatively large
  3. Purpose: Helps segment larger targets and improves segmentation accuracy

7.4.3 Receptive field

In the convolutional neural network, the area size of the input layer corresponding to an element in the output result of a certain layer is determined, which is called the receptive field.
Generally speaking, the effect of a large receptive field is better than that of a small receptive field.
It can be seen from the formula that the larger the stride, the larger the receptive field. But too large stride will make the feature map retain less information.
Therefore, in the case of reducing the stride, how to increase the receptive field or keep it unchanged has become a major problem in segmentation.

7.4.4 End-to-end

In the field of computer vision, end-to-end can be simply understood as the input is the original image, the output is the predicted image, and the specific process in the middle depends on the learning ability of the algorithm itself.
Through the internal structure of the network, the original image isDimensionality reductionandfeature extraction, and in the subsequent process, the smaller-sized feature maps are graduallyRevert to the predicted image with the same size as the original image.
The quality of feature extraction will directly affect the final prediction result. The most important feature of end-to-end network is to learn features according to the designed algorithm without human intervention.

Dimensionality reduction and feature extraction process isdownsampling, the process of restoring the size isupsampling.
Just look at Unet and you will understand.

8 UNet

8.1 Research results and significance:

  1. Fast speed, for a 512*512 image, it only takes less than one second to use a GPU
  2. It has become the baseline for most medical image semantic segmentation tasks, and has also inspired a large number of researchers to think about U-shaped semantic segmentation networks.
  3. U-Net combines low-resolution information (providing the basis for object category recognition) and high-resolution information (providing the basis for accurate segmentation and positioning), which is perfectly suitable for medical image segmentation.

Basically, for all segmentation problems, you can use U-Net to look at the basic results first, and then make "magic changes".

8.2 network structure

Let’s put the structural diagram first: The diagram found by Baidu feels more fragrant than the official diagram, mainly because there is a comparison with the classification, which is more friendly to newcomers.
insert image description here
Let’s talk about this picture. The part in the box in the lower right corner is the structural diagram of the classification model. U-net does not include this part, and it is only released to show a difference.
Let me borrow from Zhihu @陈义新 and briefly describe this picture:

The left half of the Encoder:
consists of two 3x3 convolutional layers (RELU) plus a 2x2 maxpooling layer to form a downsampling module (as can be seen in the code behind); the right
half of the Decoder:
consists of an upsampling convolution Layer (deconvolution layer) + feature splicing concat + two 3x3 convolutional layers (ReLU) are repeatedly formed;

OK, start parsing.

  • First of all, as a whole, it is a U-shaped symmetrical structure. After disassembly, it has the same structural idea as FCN:
    the left is downsampling (encoding Encoder), and the right is upsampling (decoding Decoder).

Before disassembling and looking at it, we will3*3 convolutional layer Conv+ activation function ReLu+Bn layeras a whole, called a block

  • Starting from the top layer, the blue rectangle on the left is the input, and an output (the rectangle with channel=64) is obtained after two blocks. Then go down and see a 2*2 maximum pooling maxpool (that is Pick out the maximum value in the 2*2 rectangle, so that 2*2 becomes 1 number, and the length and width of size become half), and then proceed to the next layerblock+maxpool.
    [The picture above is 3 downsampling, a total of 4 layers. The original text of Unet is 4 times of downsampling, a total of 5 layers, but this is not fixed, and it is harmless.
  • After looping to the bottom layer, downsampling ends and upsampling is ready.
  • Upsampling as shown in the structure isblock+deconvolution(The deconvolution process is said in FCN, and upsampling is deconvolution instead of maximum pooling).
    Upsampling 3 times back to get the same output as the original input size.
  • Finish!
  • It is worth noting that, the left side and the right side are connected by a gray line. This line is Skip connection, which is called copy and crop in the original Unet paper. In fact, it is a residual idea : combine the output on the left with the output belowstitchingup as the input to the layer on the right .

Add a little more detail:
FCN's decoder is relatively simple, only one deconvolution operation is used, and then it does not keep up with the convolution structure.
For skip connection, FCN uses summation, and U-Net uses concatenation.

For add and concat operations, please refer to: Deep Feature Fusion-Understanding the multi-layer feature fusion of add and concat to understand this article.

8.3 Module function

  • Convolution (Conv) is used for feature extraction;
  • Pooling (maxpool) is used to reduce dimensions;
  • Splicing (skip-connection) is used for feature fusion; [so the channel is the feature]
  • Upsampling (Upsample) is used to restore the dimension;

From a bigger perspective:

  • Downsampling (encoding Encoder): The model understands the content of the image, but discards the location information of the image.
  • Upsampling (decoding Decoder): Make the model combine the Encoder's understanding of the image content to restore the position information of the image.

Then we know the difference between different layers:

  • The shallow layer is mainly geometric information, what is the content of the image, and what you see is a small field of view (that is, what you can see when you zoom in, such as the color of a cat's eyes)
  • The deep layer is mainly location information, what is the location of the image, what you see is the big field of view (that is, what you can see when you zoom out, such as where the cat is in the picture)

8.4 Code implementation

Hehe, we have finally reached this point, although... But, there are still many details. Open a single article, let's work together.
Pytorch quick start and actual combat - three, Unet implementation

Guess you like

Origin blog.csdn.net/weixin_43938876/article/details/123379054