[Pytorch Neural Network Theory] 47 Image Classification Model: Inception Model

1 Inception series model

The Incepton series of models include V1, V2, V3, V4 and other versions, which mainly solve three problems of deep networks:

  • The training data set is limited, there are too many parameters, and it is easy to overfit;
  • The larger the network, the greater the computational complexity, and it is difficult to apply;
  • The deeper the network, the more backward the gradient is, the easier it is to disappear (gradient dispersion), and it is difficult to optimize the model.

1.1 Multi-branch structure

The original Inception model adopts a multi-branch structure (see Figure 1-1), which stacks 1×1 convolution, 3×3 convolution max-pooling together. This structure can not only increase the width of the network, but also enhance the adaptability of the network to different sizes.

The Inception model contains 3 convolutions of different sizes and 1 max pooling, which enhances the adaptability of the network to different sizes.

The Inception model can efficiently expand the depth and width of the network and improve the accuracy.

The Inceρtion model itself is like a small network in a large network, and its structure can be repeatedly stacked together to form a large network.

1.2 Global mean pooling (article source: Network in Network)

Global mean pooling refers to filtering features with filters of equal size in an average pooling layer. It is generally used to replace the last fully connected output layer in the deep network structure.

1.2.1 Specific usage of global mean pooling

After the convolution process, global mean pooling is performed on the entire picture of each feature map to generate a value, that is, each feature map is equivalent to an output feature, and this feature represents the feature of our output class.

That is, when doing 1000 classification tasks, the number of feature maps in the last layer should be 1000, so that the classification can be directly obtained.

1.2.2 Implementation in Network in Network

The author used it to classify 1000 objects, and finally designed a 4-layer NIN (Network In Network) + global mean pooling, as shown in Figure 1-2.

2 Inception V1 model

The Inception V1 model has made some improvements on the original Inception model. The reason is that in the Inception model,

3 Inception V2 model

The convolution kernel is calculated for the output result of the previous layer, so it will exist, the amount of calculation required by the 5×5 convolution kernel will be very large, and the generated feature map will be very thick.

In order to avoid this phenomenon, the InceptionV1 model adds a 1×1 convolution before the 3×3 convolution, before the 5×5 convolution, and after the 3×3 max pooling to reduce the thickness of the feature map ( The 1×1 convolution is mainly used for dimensionality reduction)

3.1 InceptionV1 model diagram

There are 4 branches in the InceptionV1 model.

3.1.1 Branch introduction

The first branch: 1×1 convolution is performed on the input. 1×1 convolution can not only organize information across channels, improve the expressive ability of the network, but also increase and reduce the dimension of the output channel.

The second branch: first use 1 × 1 convolution, and then use 3 × 3 convolution, which is equivalent to two feature transformations.

3rd branch: use 1×1 convolution first, then use 5×5 convolution.

The 4th branch: 1×1 convolution is used directly after 3×3 max pooling.

3.1.2 InceptionV1 model diagram

3.1.3 Features of InceptionV1

All four branches use 1×1 convolution, some branches only use 1×1 convolution, and some branches use 1×1 convolution and then use other size convolutions.

Because 1×1 convolution is very cost-effective, a layer of feature transformation and nonlinearity can be added with a small amount of computation. The 4 branches of the final lnceptionV1 model are merged through an aggregation operation (using the torch.cat function to aggregate in the dimension of the number of output channels).

4 Inception V2 model

4.1 Improvement measures of Inception V2 model

The Inception V2 model adds a BN layer after the convolution of the lnceptionV1 model, so that the output of each layer is normalized, reducing the internal covariate shift problem; at the same time, it also uses the gradient truncation technology to increase the stability of training .

The lnceptionV2 model also draws on the VGG model, replacing the 5×5 convolution in the InceptionV1 model with two 3×3 convolutions, reducing the number of parameters and improving the operation speed.

4.2 Inception V2 Model 

5 Inception V3 model

The Inception V3 model does not add other technologies, but the convolution kernel of Inception V2 becomes smaller

5.1 Specific practices of the Inception V3 model

Decompose the 3×3 of the nception V2 model into two one-dimensional convolutions (1×3, 3×1). This method is mainly based on the principle of linear algebra, that is, a matrix of [n, n] can be decomposed into matrix [n, 1] × matrix [1, n].

5.2 The actual effect calculation of the Inception V3 model

Assuming that there are 256 feature inputs and 256 feature outputs, assuming that the lnception layer can only perform 3 × 3 convolutions, that is, a total of 256 × 256 × 3 × 3 convolutions (589824 multiply-accumulate operations).

Suppose now we need to reduce the number of features to be convolutional to 64 (ie 256/4). In this case, a 1×1 convolution of features from 256→64 is performed first, followed by 64 convolutions on all branches of the Inception layer, and finally a 1×1 convolution of features from 64→256 is used.

256×64×1×1=16384

64×64×3×3 = 36864

64×256×1×1=16384

Compared with the previous 589824 times, there are now a total of 69632 (16384+36864+16384) computations.

In the actual test, this structure does not work well in the first few layers, but it has obvious effect on the middle layer with a feature map size of 12 to 20, and can also greatly increase the operation speed. In addition, the network input is changed from 224×224 to 299×299, and 35×35/17×17/8×8 modules are designed.

6 Inception V4 model

6.1 Inception V4 Model Improvement Methods

On the basis of the InceptionV3 model, the residual connection technology is combined to optimize and adjust the structure. Through the combination of the two, two excellent network models are obtained.

6.2 lnception V4 model

The Inception V4 model is only based on the InceptionV3 model from 4 convolution branches to 6 convolution branches, but does not use residual connections.

6.3 Inception-ResNet V2 Model

The Inception-ResNet V2 model is mainly based on the InceptionV3 model by adding the residual connection of the ResNet model, which is a combination of the lnceptionV3 model and the ResNet model.

Residual connections have the effect of improving the accuracy of the network in the Inception model without increasing the amount of computation.

6.4 Comparison between Inception-ResNetV2 model and lnception V4

In the case of similar network complexity, the Inception-ResNetV2 model is slightly better than the Inception V4 model.

By combining 3 inception models with residual connections and one lnceptionV4 model, an error rate of 3.08% can be obtained on lmagelNet.

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/123925124