Big talk CNN classic model: GoogLeNet (evolution from Inception v1 to v4)

In 2014, GoogLeNet and VGG were the two heroes of the ImageNet Challenge (ILSVRC14) that year. GoogLeNet won the first place and VGG won the second place. The common feature of these two types of model structures is that they are deeper. VGG inherits some framework structures of LeNet and AlexNet (see the  classic model of Dahua CNN: VGGNet ), while GoogLeNet has made a bolder network structure attempt. Although the depth is only 22 layers, its size is much smaller than that of AlexNet and VGG. GoogleNet The number of parameters is 5 million, the number of AlexNet parameters is 12 times that of GoogleNet, and the parameters of VGGNet are 3 times that of AlexNet. Therefore, when memory or computing resources are limited, GoogleNet is a better choice; from the model results, GoogLeNet's performance But more superior.

Little knowledge: GoogLeNet is a deep network structure researched by Google. Why is it called "GoogLeNet" instead of "GoogleNet". It is said to pay tribute to "LeNet", so it is named "GoogLeNet"

So, how does GoogLeNet further improve performance?
Generally speaking, the most direct way to improve network performance is to increase the depth and width of the network. Depth refers to the number of network layers and width refers to the number of neurons. However, this method has the following problems:
(1) There are too many parameters, and if the training data set is limited, it is easy to over-fit;
(2) The larger the network and the more parameters, the greater the computational complexity, and it is difficult to apply;
(3) ) The deeper the network, the more prone to gradient dispersion problems (the gradient goes back and disappears easily), making it difficult to optimize the model.
Therefore, some people ridicule that "deep learning" is actually "deep parameter tuning".
The way to solve these problems is of course to reduce the parameters while increasing the depth and width of the network. In order to reduce the parameters, it is natural to think of turning the full connection into a sparse connection. However, in terms of implementation, the actual calculation amount will not be qualitatively improved after the full connection becomes a sparse connection, because most hardware is optimized for dense matrix calculation. Although the sparse matrix has a small amount of data, the calculation time is very long. difficult to reduce.

So, is there a way to keep the sparsity of the network structure while taking advantage of the high computational performance of dense matrices. A large number of literatures show that sparse matrices can be clustered into denser sub-matrices to improve computing performance, just as the human brain can be regarded as a repeated accumulation of neurons. Therefore, the GoogLeNet team proposed the Inception network structure, which is to construct a A "basic neuron" structure is used to build a network structure with sparseness and high computing performance.

[Question] What is Inception?
Inception has gone through the development of V1, V2, V3, V4 and other versions, and is constantly improving. The following will introduce them one by one.

1. By designing a sparse network structure, Inception V1
can generate dense data, which can not only increase the performance of the neural network, but also ensure the efficiency of computing resources. Google proposed the basic structure of the most original Inception:
 
this structure stacks the convolution (1x1, 3x3, 5x5) and pooling operations (3x3) commonly used in CNN together (the size after convolution and pooling is the same, and the channels are compared with each other. Plus), on the one hand, it increases the width of the network, and on the other hand, it also increases the adaptability of the network to scale.
The network in the convolutional layer of the network can extract every detail of the input, and the 5x5 filter can also cover most of the input of the accepting layer. A pooling operation can also be performed to reduce the space size and reduce overfitting. Above these layers, a ReLU operation is done after each convolutional layer to increase the nonlinearity of the network.
However, in the original version of Inception, all convolution kernels are performed on all outputs of the previous layer, and the calculation amount required for the 5x5 convolution kernel is too large, resulting in a large thickness of the feature map. In order to avoid In this case, a 1x1 convolution kernel is added before 3x3, before 5x5, and after max pooling to reduce the thickness of the feature map, which also forms the network structure of Inception v1, as shown in the following figure:

What is the use of the 1x1 convolution kernel?
The main purpose of 1x1 convolution is to reduce dimensionality, and it is also used for rectified linear activation (ReLU). For example, the output of the previous layer is 100x100x128. After a 5x5 convolutional layer with 256 channels (stride=1, pad=2), the output data is 100x100x256, where the parameters of the convolutional layer are 128x5x5x256=819200. If the output of the previous layer passes through a 1x1 convolutional layer with 32 channels, and then passes through a 5x5 convolutional layer with 256 outputs, the output data is still 100x100x256, but the amount of convolution parameters has been reduced to 128x1x1x32 + 32x5x5x256= 204800, about a 4x reduction.

The network structure of GoogLeNet built based on Inception is as follows (22 layers in total):

The above figure is explained as follows:
(1) GoogLeNet adopts a modular structure (Inception structure), which is convenient for addition and modification;
(2) The network finally uses average pooling (average pooling) to replace the fully connected layer, the idea comes from NIN (Network in Network), which has been shown to improve accuracy by 0.6%. However, a fully connected layer is actually added at the end, mainly to facilitate the flexible adjustment of the output;
(3) Although the full connection is removed, Dropout is still used in the network; 
(4) In order to avoid the disappearance of the gradient, the network 2 additional auxiliary softmaxes are added for forward-conducting gradients (auxiliary classifier). The auxiliary classifier uses the output of a certain layer in the middle as a classification, and adds a smaller weight (0.3) to the final classification result, which is equivalent to model fusion and adds back-propagation gradients to the network. The signal also provides additional regularization, which is beneficial for the training of the entire network. In the actual test, these two extra softmax will be removed.

The details of the network structure diagram of GoogLeNet are as follows:
 
Note: "#3x3 reduce" and "#5x5 reduce" in the above table indicate the number of 1x1 convolutions used before the 3x3, 5x5 convolution operations.

The detailed table of GoogLeNet network structure is analyzed as follows:
0. Input
The original input image is 224x224x3, and the preprocessing operation of zero mean is performed (the mean value is subtracted from each pixel of the image).
1. The first layer (convolutional layer)
uses a 7x7 convolution kernel (sliding step size 2, padding is 3), 64 channels, and the output is 112x112x64. After convolution, the ReLU operation is performed
after 3x3 max pooling (step size is 2). ), the output is ((112 - 3+1)/2)+1=56, which is 56x56x64, and then ReLU operation
2, the second layer (convolutional layer)
uses a 3x3 convolution kernel (sliding step size is 1, padding is 1), 192 channels, the output is 56x56x192, and the ReLU operation is performed after convolution.
After 3x3 max pooling (step size is 2), the output is ((56 - 3+1)/2)+1=28, which is 28x28x192 , and then perform ReLU operation
3a, the third layer (Inception 3a layer)
is divided into four branches, using different scales of convolution kernels to process
(1) 64 1x1 convolution kernels, then RuLU, output 28x28x64
(2) 96 1x1 convolution kernels, as the dimension reduction before the 3x3 convolution kernel, become 28x28x96, then perform ReLU calculation, and then perform 128 3x3 convolutions (padding is 1), output 28x28x128
(3) 16 1x1 The convolution kernel, as the dimension reduction before the 5x5 convolution kernel, becomes 28x28x16. After the ReLU calculation, 32 5x5 convolutions (padding is 2) are performed to output 28x28x32
(4) The pool layer uses a 3x3 kernel ( padding is 1), output 28x28x192, and then perform 32 1x1 convolutions to output 28x28x32.
Connect the four results, parallel the third dimension of the four output results, that is, 64+128+32+32=256, and finally output 28x28x256
3b, the third layer (Inception 3b layer)
(1) 128 1x1 volumes Product kernel, then RuLU, output 28x28x128
(2) 128 1x1 convolution kernels, as the dimensionality reduction before the 3x3 convolution kernel, become 28x28x128, perform ReLU, and then perform 192 3x3 convolutions (padding is 1), Output 28x28x192
(3) 32 1x1 convolution kernels, as the dimension reduction before the 5x5 convolution kernel, become 28x28x32, after the ReLU calculation, perform 96 5x5 convolutions (padding is 2), output 28x28x96
(4 ) pool layer, use 3x3 kernel (padding is 1), output 28x28x256, and then perform 64 1x1 convolutions, output 28x28x64.
Connect the four results, and parallel the third dimension of the four output results, that is, 128+192+96+64=480, and the final output is 28x28x480

The fourth layer (4a, 4b, 4c, 4d, 4e), the fifth layer (5a, 5b)..., similar to 3a, 3b, will not be repeated here.

From the experimental results of GoogLeNet, the effect is obvious, and the error rate is lower than that of MSRA, VGG and other models. The comparison results are shown in the following table:

2. Inception V2
GoogLeNet has been studied and used by many researchers due to its excellent performance. Therefore, the GoogLeNet team has further explored and improved it, resulting in an upgraded version of GoogLeNet.
The original intention of GoogLeNet design is to be accurate and fast, and if it is just a simple stacking network, although the accuracy rate can be improved, it will lead to a significant decrease in computational efficiency, so how to improve the expressiveness of the network without increasing the amount of computation. became a problem.
The solution of the Inception V2 version is to modify the internal calculation logic of Inception and propose a special "convolution" calculation structure.

1. Factorizing Convolutions
Large-sized convolution kernels can bring a larger receptive field, but it also means that more parameters will be generated. For example, there are 25 parameters for a 5x5 convolution kernel and a 3x3 convolution kernel. There are 9 parameters, the former is 25/9=2.78 times of the latter. Therefore, the GoogLeNet team proposed that a single 5x5 convolutional layer can be replaced by a small network composed of 2 consecutive 3x3 convolutional layers, that is, the amount of parameters is reduced while maintaining the range of the receptive field, as shown in the figure below:
 
Then this alternative Will it cause a decline in expressiveness? Extensive experiments have shown that there is no loss of expression.
It can be seen that the large convolution kernel can be completely replaced by a series of 3x3 convolution kernels. Can it be decomposed into a smaller size? The GoogLeNet team considered an nx1 convolution kernel, as shown in the figure below, replacing the 3x3 convolution with three 3x1 convolutions:
 
Therefore, any nxn convolution can be replaced by a 1xn convolution followed by an nx1 convolution. The GoogLeNet team found that using this decomposition in the early stages of the network is not good, and it is better to use it on medium-sized feature maps (feature map sizes are recommended between 12 and 20).

2. Reduce the size of the feature map
In general, if you want to reduce the size of the image, there are two ways:
 
first pooling and then doing Inception convolution, or first doing Inception convolution and then pooling. However, method 1 (left image) first pooling (pooling) will cause the feature representation to encounter bottlenecks (feature missing), method 2 (right image) is a normal reduction, but the amount of calculation is very large. In order to maintain the feature representation and reduce the amount of computation at the same time, the network structure is changed to the following figure, and two parallelized modules are used to reduce the amount of computation (convolution and pooling are executed in parallel, and then merged)

Using Inception V2 as an improved version of GoogLeNet, the network structure diagram is as follows:
 
Note: Figure 5 in the above table refers to Inception without evolution, and Figure 6 refers to the small convolution version of Inception (use 3x3 convolution kernel instead of 5x5 convolution kernel) , Figure 7 refers to the asymmetric version of Inception (with 1xn, nx1 convolution kernels instead of nxn convolution kernels).

After experiments, the model results have been greatly improved compared with the old GoogleNet, as shown in the following table:

3.
Inception V3 One of the most important improvements in Inception V3 is Factorization, which decomposes 7x7 into two one-dimensional convolutions (1x7, 7x1), and the same is true for 3x3 (1x3, 3x1). This benefit can not only speed up For calculation, one convolution can be split into two convolutions, which further increases the network depth and increases the nonlinearity of the network (ReLU is required for each additional layer).
Also, the network input has changed from 224x224 to 299x299.

Fourth, Inception V4
Inception V4 studies the combination of Inception module and residual connection. The ResNet structure greatly deepens the network depth, greatly improves the training speed, and also improves the performance (for the technical principle of ResNet, see the previous article on this blog: Dahua Deep Residual Network ResNet ).
Inception V4 mainly uses the residual connection (Residual Connection) to improve the V3 structure to obtain Inception-ResNet-v1, Inception-ResNet-v2, Inception-v4 networks.
The residual structure of ResNet is as follows:
 
Combining this structure with Inception, it becomes the following figure:
 
Through the combination of 20 similar modules, Inception-ResNet is constructed as follows:

 

Wall Crack Advice

From 2014 to 2016, the GoogLeNet team published several classic papers on GoogLeNet: "Going deeper with convolutions", "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", "Rethinking the Inception Architecture for Computer Vision", "Inception" -v4, Inception-ResNet and the Impact of Residual Connections on Learning", the ideas and technical principles of Inception v1, Inception v2, Inception v3, Inception v4 and other ideas and technical principles are introduced in detail in these papers. It is recommended to read these papers for a comprehensive understanding GoogLeNet.

Follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab), and then reply to the keyword " paper " to read the content of this classic paper online .

 

Recommended related reading

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324455082&siteId=291194637