CV—BaseLine Summary (Development from AlexNet to SENet)

1. Original intention

Deep learning has developed from 2015 to the present, and the model is constantly iteratively optimized;

Nowadays, many new models are often standing on the shoulders of giants. Here I want to record the development process of the baseline model and the content of continuous updates; instead of cutting each layer of the model, the key The key points of innovation are recorded, and of course some personal understanding and thinking are also included. If there is something wrong or incomplete, welcome to communicate, and hope to continue to improve the understanding of the model in future work;

What does BaseLine refer to?

Usually we also call the BaseLine model a classification model, and it is often used to implement a classification task when getting started. But these models do not appear only in classification tasks, but are the cornerstone of the entire CV field. The detection, segmentation, and key point regression tasks that will be encountered later are inseparable from a key step- feature extraction ; this key step is based on the feature map after feature extraction for subsequent operations, and the BaseLine model is used in these tasks. Both are essential, and choosing different models has a great impact on the performance of the entire task (of course, it does not rule out designing your own model for feature extraction)



2. AlexNet

Paper address

Significance: The model belonging to the originator of the opening chapter opened the prelude to the domination of computer vision by convolutional networks;

This model is relatively simple, just look at the network structure diagram :

insert image description here

This network has a characteristic that it is trained in parallel on two GPUs and finally fused in order to use more training resources. The idea of ​​setting this network structure is also involved in the new network. Here you can mark it first;

key concept


1. What are the advantages of the proposal of the ReLU activation function?

First of all, activation functions such as ReLU and Sigmoid are nonlinear activation functions, because if a linear activation function is used, then the neural network of the multi-layer hidden layer is no different from the single layer, and the characteristics of the convolutional network are lost;

Here to compare the Sigmoid function and the ReLU function, first look at the formula, which must be remembered;

Sigmoid calculation formula:
y = 1 1 + e − xy=\frac{1}{1+e^{-x}}y=1+ex1
梯度公式:
y ′ = y ∗ ( 1 − y ) y^{\prime}=y *(1-y)y=y(1y )
ReLu calculation formula:
y = max ( 0 , x ) y = max(0, x)y=max(0,x)
梯度公式:
y ′ = { 1 , x > 0  undefined,  x = 0 0 , x < 0 y^{\prime}=\left\{\begin{array}{cc} 1, & x>0 \\ \text { undefined, } & x=0 \\ 0, & x<0 \end{array}\right. y=1, undefined, 0,x>0x=0x<0
insert image description here

insert image description here

As can be seen from the above two figures, when the sigmoid function takes a large or small value of x, the gradient is almost 0, which will cause the gradient to disappear;

ReLU has the following advantages:

1. Make network training faster (because the calculation is simple and the gradient gradually increases when it is greater than 0)

2. Prevent the gradient from disappearing;

3. Make the network sparse (when x is negative, the gradient is 0, and neurons do not participate in training)

think:

When ReLU is less than 0, the gradient is also 0. Is there room for improvement?

Leaky relu is an improvement to ReLU, which solves the above problems;

Calculation formula:
y = max ( 0.01 x , x ) y = max(0.01x, x)y=max(0.01x,x)


Second, the calculation of the output size of the convolutional layer?

The parameters we pass in to the convolution are: input image size I * I, convolution kernel size K * K, step size S, padding is P

Output size calculation formula:
O = ( I − K + 2 P ) / S + 1 O = (I - K+2P) / S +1O=(IK+2P)/S+1


3. The concept of Dropout

Random deactivation of neurons is used to reduce overfitting and improve the generalization ability of the network;

For a more detailed explanation, please refer to this article: https://zhuanlan.zhihu.com/p/77609689

insert image description here

Note: The output of the neuron needs to be multiplied by the inactivation ratio p when testing;



3. VGG

Paper address

Significance: VGG has been used as the backbone structure of other networks until now, and its proposal opened the era of small convolution kernel and deep convolution;

insert image description here

The picture shows the continuous evolution process of VGG given in the paper. At present, VGG16 [D] and VGG19 [E] are mainly used;


key concept


1. A role of stacked 3x3 convolution? Why is a small convolution kernel better than a large convolution kernel?

In fact, the function of convolution is to continuously extract the input image features. In AlexNet, a large convolution kernel is used, so that the downsampling speed is relatively fast, and the number of layers of the model is not stacked too much. After using the 3x3 convolution kernel in VGG, it is usually down-sampled with a probability of 2 times, which is conducive to deepening the number of layers of the network;

advantage:

1. Increase the receptive field

Two 3*3 stacks are equivalent to one 5*5, and the receptive fields of both are the same;

2. Reduce the amount of calculation

The amount of parameters required for three 3x3 convolution kernels is 27C

The amount of parameters required for a 7x7 convolution kernel is 49C

The receptive field of the two is the same, but the parameter volume of the 3x3 convolution kernel is reduced by 44%

Extended thinking: **What is the role of the 1x1 convolution kernel? ** will be explained in the next network;


2. How can the network be set to input with any size?

When there is a fully connected layer in the model, the input of the model is fixed. Can the fully connected layer be replaced?

Replace the last fully connected layer with a convolutional layer, as shown in the following figure:

insert image description here

The idea here is very important. With the development of the network, the fully connected layer is slowly replaced by convolution due to its huge amount of calculation and restrictions on the input size;

Of course, the most important thing is that in the follow-up FCN, the form of full convolution is also used to realize the network structure of encoding and decoding;



4. GoogleNet

Paper address

Here GoogleNet is divided into four versions: V1, V2, V3, and V4. Since it is not commonly used today, here is the main introduction of its key ideas;

There is no version difference here, mainly to explain important concepts, GoogleNet is important in some of its tricks;

The first thing to understand is the Inception module

insert image description here

Features: Improve the utilization of computing resources, increase the depth and width of the network, and increase the parameters by a small amount;


key concept


1. What is the role of 1x1 convolution?

This concept is a very important concept, which is used in the Inception module and is also used in ResNet;

Features: Only change the number of output channels, not the width and height of the output;

effect:

1. It plays the role of dimension increase or dimension reduction, can be used to compress the thickness, and can be used for the final classification output;

2. Adding nonlinearity can keep the number of channels unchanged and add a layer of 1x1 convolution;

3. Reduce the amount of calculation. After reducing the number of channels, the amount of calculation will naturally decrease;


2. What is the concept of auxiliary loss?

This concept is not common in subsequent networks, mainly to extract intermediate layer information for classification;

Implementation: Compulsory calculation of a loss when outputting in the middle layer, weighted with the final output loss;

Personal understanding:

This trick is actually not very necessary. If you want to use the information of the middle layer in the subsequent network, you can use the method of feature fusion, using concat or add, which is more effective than auxiliary loss;


3. A definition of sparse matrix?

We often hear a concept that makes the network sparse, so what is the sparseness, and what are the advantages?

sparse matrix:

The number of elements with a value of 0 in the matrix is ​​far more than the number of non-zero elements, and it is irregular;

Dense matrix:

The number of non-zero elements in the matrix is ​​far more than the number of zero elements, and there are no rules;

Advantages: Sparse matrices can be decomposed into dense matrix calculations to speed up the convergence speed. A simple summary is that it can not only reduce memory but also improve training speed;


4. What is the function of the BN layer?

Reference article: https://blog.csdn.net/weixin_42080490/article/details/108849715

The full name is Batch Normalization, which is mainly to solve the problem of slower and slower training and slower convergence as the network deepens;

The reason for the problem:

With the stacking of network layers, the update of the parameters of each layer will cause the input data distribution of the upper layer to change, so that the input distribution of the upper layer will change extremely sharply, so that the upper layer has to constantly re-adapt to the parameter update of the lower layer; The case is also known as internal covariate shift, or ICS;

During the training process of the neural network, the input of each layer of the neural network maintains the same distribution;

Practical usage: Change the data to a standard normal distribution with a mean of 0 and a standard deviation of 1, so that the input value of the activation function falls in the sensitive area, and the output of the network will not be very large, and a larger gradient can be obtained, avoiding the gradient The disappearance problem also plays a role in speeding up the training;

main effect:

1. Accelerate the training and convergence speed of the network (the mid-term version of GoogleNetV2 is 10 times faster than the previous version)

2. Control the gradient explosion to prevent the gradient from disappearing;

3. Prevent overfitting;

Notice:

After using the BN layer, we can abandon the dropout operation, and we don’t need to pay too much attention to the initialization of the weights. We can also use a larger learning rate to accelerate model convergence. There are really many benefits. Basically, every network will Add a structure like this;


5. What is the strategy of volume decomposition?

1. The large convolution kernel is decomposed into a stack of small convolution kernels;

Here is the previously introduced two 3x3 convolutions instead of one 5x5 convolution;

2. Decompose into asymmetric convolution;

That is to decompose an nxn convolution into a stack of 1xn and an nx1 convolution;

The main purpose here is to reduce network parameters, but this strategy is only useful when the resolution of the network feature map is small, so it is not common in subsequent networks


6. What is the label smoothing strategy?

question:

There is a problem with traditional one-hot encoding, that is, overconfidence, which can easily lead to overfitting;

Solution:

Propose label smoothing, attenuate the item with a confidence of 1 in One-hot encoding, avoid overconfidence, and evenly divide the attenuated part of the confidence into each category;

example:

The original label value is (0, 1, 0, 0) ——>> (0.00025, 0.99925, 0.00025, 0.00025)

Then pass it into the loss function;



5. ResNet

Paper address

ResNet, like VGG, is the most popular convolutional neural network structure in the industry, and its important modules are shown in the figure below:

0c94820e7971c002d65685c6a2ed10eb.png

Its significance lies in promoting the development of the network to a deeper level, and for the first time successfully trained a network with hundreds or thousands of layers;


key concept


1. What is the above structure called? What role does it play?

Reference article: https://zhuanlan.zhihu.com/p/80226180

The structure in the above figure is called the residual structure (Residual learning)

What's wrong with deepening the network layer?

1. It is easy to cause gradient disappearance or gradient explosion (can be solved by adding BN layer and regularization)

2. Network degradation problem: The performance of the network gradually saturates with the increase of the number of layers, and then declines rapidly (difficult to optimize the network)

The residual structure is also a process of network identity mapping. The specific formula is as follows:
F ( x ) = W 2 ∗ R e LU ( W 1 ∗ x ) F(x) = W2*ReLU(W1*x)F(x)=W2R e L U ( W 1x)

H ( x ) = F ( x ) + x = W 2 ∗ R e L U ( W 1 ∗ x ) + x H(x) = F(x) + x = W2*ReLU(W1*x) + x H(x)=F(x)+x=W2R e L U ( W 1x)+x

When F(x) is 0, H(x) = x, thus realizing the identity mapping of the network;

effect:

1. It is conducive to gradient propagation, so that the gradient will not disappear or explode, and the network can be stacked to thousands of layers;

2. The skip connection structure is introduced, which plays the role of citing upper-level information, and there is also a concept of integrated learning in it;

Of course, in order to reduce the amount of calculation, the previously improved 1x1 convolution is also introduced for dimensionality reduction:

insert image description here



6. ResNeXt

Paper address

Main idea: Based on the ResNet network, the concept of aggregation transformation in Inception is introduced;

insert image description here


key concept


1. What is group convolution? what's the effect?

insert image description here

The structure shown in the above figure is the structure of group convolution. This idea is derived from the idea of ​​splitting the convolution onto two GPUs in AlexNet;

Implementation strategy: Divide the input feature map into C groups, perform normal convolution inside each group, and then splicing by channel to obtain the output feature map;

effect:

1. Use fewer parameters to get the same feature map;

2. Let the network learn more different features and obtain richer information;

Note: Although grouped convolution is similar to depthwise separable convolution, it is not the same and will be introduced later;



7. DenseNet

Paper address

Significance: rethink the short path and feature resuse in the convolutional network to achieve better results;

The entire DenseNet network is divided into three parts, that is, head convolution, Dense Block stacking, and full connection output classification probability;

Head convolution is to use the operation of convolution and pooling to control the output dimension. This time we mainly explain the structure of Dense;

insert image description here


key concept


1. What is a dense connection? What are the advantages?

The most important structure in DenseNet, as shown in the figure above, is called a dense connection;

Mainly in a Block, the input of each layer comes from the feature maps of all layers in front of it, and the output of each layer will be directly connected to the input of all layers behind it;

effect:

1. Obtain more features with fewer parameters and reduce the amount of parameters. DenseNet only uses ResNet1/3 parameters to achieve similar accuracy;

2. Low-level features are reused, and the features are more abundant;

3. Stronger gradient flow, more skip connections, and easier gradient propagation;

Notice:

Dense connection actually plays a role of regularization on small data. There is a saying that dense connection is more suitable for small data sets;

shortcoming:

The training of DenseNet consumes storage resources extremely, because a large number of feature maps need to be reserved for backpropagation calculation;


2. What is the difference between add and concat in feature fusion? Which is better?

add: The value of the feature map is added, and the number of channels remains unchanged;

concat: refers to the splicing of feature maps, and the number of channels increases;

Personal understanding: In fact, the difference between the two is that add is the addition of two feature maps, and there is a risk of losing information. Concat is feature map splicing and retains all original information, so it is better to use concat, but add requires memory and The parameter will be less than concat;



8. SENet

Paper address

Significance: The attention mechanism was first introduced into the convolutional neural network, and this mechanism is a plug-and-play module;

insert image description here

The above picture is the most important module structure in the network - SE block, which reflects the process of compression and fusion;

key concept


1. What is the core idea of ​​the attention mechanism?

The attention mechanism can be understood as designing some network output layer weight values, using the weight values ​​and feature maps for calculation, and realizing the transformation of feature maps, so that the model strengthens the feature values ​​of the attention area;


2. The principle of SE-block structure, how to realize the attention mechanism?

As can be seen from the above figure, SE-block is mainly divided into three parts: Squeeze, Excitation, Scale;

Squeeze: Compress feature maps to vector form (1x1xC) through global average pooling (GAP);

Excitation: Two fully connected layers map and transform the feature vector, and finally limit the range to [0, 1] through a sigmoid;

Scale: Multiply the obtained weight vector with each original channel to obtain a weighted feature map;

Summarize:

The above is the method of implementing the attention mechanism here. Of course, the attention mechanism can not only act on the channel, but also on HxW. Some cutting-edge applications of the attention mechanism are also used in VIT. If you are interested, you can learn about the development and application of Attation;



at last

So far, the basic model in CV-Baseline has been sorted out, but it is only summarized from some characteristics of the network. Some small tricks are not explained in this article. If you are interested, you can read the paper;

Of course, understanding the principle of the model does not count as mastering the model, and code implementation is also a key step. I hope that every module and structure can be implemented with a familiar framework, especially for the official open source code, after the principle + code are proficient can be regarded as true mastery;

insert image description here

Guess you like

Origin blog.csdn.net/weixin_40620310/article/details/125244851