Convolution-based image classification and recognition (4): GoogLeNet (V1~V4 & Xception)

本专栏介绍基于深度学习进行图像识别的经典和前沿模型,将持续更新,包括不仅限于:AlexNet, ZFNet,VGG,GoogLeNet,ResNet,DenseNet,SENet,MobileNet,ShuffleNet,EifficientNet,Vision Transformer,Swin Transformer,Visual Attention Network,ConvNeXt, MLP-Mixer,As-MLP,ConvMixer,MetaFormer



foreword

In 2014, GoogLeNet and VGGNet were the two heroes of the ImageNet Challenge (ILSVRC14). GoogLeNet won the first place in the image classification competition, followed by VGG. The common feature of these two types of model structures is that the network depth is deeper. VGG inherited some framework structures of LeNet and AlexNet, while GoogLeNet made a bolder network structure attempt. Although the depth is only 22 layers, its size is much smaller than that of AlexNet and VGG. GoogleNet has 5 million parameters and AlexNet has 5 million parameters. It is 12 times that of GoogleNet, and the parameters of VGGNet are 3 times that of AlexNet. Therefore, when memory or computing resources are limited, GoogleNet is a better choice; from the model results, the performance of GoogLeNet is even better.

Little knowledge: GoogLeNet is a deep network structure developed by Google (Google). Why not call it "GoogleNet" but "GoogLeNet". It is said to pay tribute to "LeNet", so it is named "GoogLeNet"


GoogLeNetV1论文名称:Going Deeper with Convolutions
GoogLeNetV1论文下载链接:
https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf
GoogLeNetV1 pytorch代码实现:https://github.com/Arwin-Yu/Deep-Learning-Classification-Models-Based-CNN-or-Attention

GoogLeNetV2论文名称:Batch normalization: Accelerating deep network training by reducing internal covariate shift
GoogLeNetV2论文下载链接:http://proceedings.mlr.press/v37/ioffe15.pdf

GoogLeNetV3论文名称:Rethinking the Inception Architecture for Computer Vision
GoogLeNetV3论文下载链接:
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf

GoogLeNetV4论文名称:Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
GoogLeNetV4论文下载链接:https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/viewPDFInterstitial/14806/14311

GoogLeNetV5论文名称:Xception: Deep learning with depthwise separable convolutions
GoogLeNetV5论文下载链接:
https://openaccess.thecvf.com/content_cvpr_2017/papers/Chollet_Xception_Deep_Learning_CVPR_2017_paper.pdf

1. GoogLeNet V1

1、Motivation

Generally speaking, the most direct way to improve network performance is to increase the depth and width of the network, or the size of the input data; but this method has the following problems: (1) Too many
parameters, if the training data set is limited, it is easy to produce overfitting
(2) The larger the network and the more parameters, the greater the computational complexity and it is difficult to apply; (
3) The deeper the network, the gradient dispersion problem is prone to occur (the gradient is easy to disappear as it traverses backwards), and it is difficult to optimize the model.
Therefore, some people ridicule that "deep learning" is actually "deep parameter tuning".

The article believes that the fundamental way to solve the above two shortcomings is to convert full connections and even general convolutions into sparse connections. On the one hand, the connection of the real biological nervous system is also sparse (that is, when the nervous system transmits some kind of information, only a small number of neurons are activated, and most of them are in a non-responsive state). A neural network can construct an optimal network layer by layer by analyzing the statistical properties of activation values ​​and clustering highly correlated outputs. This suggests that bloated sparse networks can be simplified without loss of performance.

Earlier, in order to break the symmetry of the network and improve the learning ability, on the other hand, the hardware equipment has limited capabilities. The traditional convolutional network (LeNet era) used random sparse connections (the feature maps obtained by each convolution are randomly Select a part and send it to the subsequent calculation). However, the computing efficiency of computer software and hardware for non-uniform sparse data is very poor, so the conventional convolution is re-enabled in AlexNet (the feature maps obtained by each convolution are all sent to subsequent calculations), the purpose is to better Optimize parallel operations.

So, the question now is whether there is a way to maintain the sparsity of the network structure and take advantage of the high computational performance of dense matrices. A large number of literatures show that sparse matrices can be clustered into denser sub-matrix sets to improve computing performance. Based on this paper, a structure called Inception is proposed to achieve this purpose.

2、Architectural Details

The inception structure proposed by the author is shown in the figure below, which clusters the convolution operations of four different convolution kernel sizes into a set.

inception

Specifically, four copies of the input information are copied and sent to four different branches respectively. The branches are convolution operations with different convolution kernel sizes. The feature maps calculated by the convolution of the four branches are merged in the channel dimension. Get a set of feature maps and send them to subsequent operations. As shown below:

inception

Explain the above figure as follows:

  • The size of the convolution kernel is a hyperparameter in the neural network. There is no strict mathematical theory to prove that the convolution kernel of that size is more suitable for extracting features, so GoogLeNet chose the way of adults: I want them all .
  • The use of convolution kernels of different sizes means different sizes of computing receptive fields, and the final splicing operation means the fusion of features of different scales;
  • The reason why the convolution kernel size is 1, 3 and 5 is mainly for the convenience of alignment. After setting the convolution step stride=1, as long as pad=0, 1, 2 are set respectively, then the features of the same dimension can be obtained after convolution, and then these features can be directly stitched together in the channel dimension;
  • The article says that pooling is very effective in many places, so the pooling operation is also embedded in Inception.
  • The further the network goes, the more abstract the features, and the larger the receptive field involved in each feature, so as the number of layers increases, the ratio of 3x3 and 5x5 convolutions also increases.

However, using a 5x5 convolution kernel still brings a relatively large amount of calculation. To this end, the author first uses a 1x1 convolution kernel for dimensionality reduction.
For example: the output data shape of the previous layer is (100x100x128), after passing through a 5x5 convolutional layer with 256 outputs (stride=1, pad=2), the output data shape is (100x100x256). Among them, the parameters of the convolutional layer are 128x5x5x256. If the output of the previous layer first passes through a 1x1 convolutional layer with 32 outputs, and then through a 5x5 convolutional layer with 256 outputs, the shape of the final output data is still (100x100x256), but the amount of convolution parameters has been reduced to 128x1x1x32 + 32x5x5x256, about a 4x reduction.
The improved network model substructure is as follows:

improved inception

The improved code implementation of the Inception module is shown below.

1.class Inception(nn.Module):    
2.    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):    
3.        super(Inception, self).__init__()    
4.    
5.        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=1)    
6.    
7.        self.branch2 = nn.Sequential(    
8.            BasicConv2d(in_channels, ch3x3red, kernel_size=1),    
9.            BasicConv2d(ch3x3red, ch3x3, kernel_size=3, padding=1)   # 保证输出大小等于输入大小    
10.        )    
11.    
12.        self.branch3 = nn.Sequential(    
13.            BasicConv2d(in_channels, ch5x5red, kernel_size=1),     
14.            BasicConv2d(ch5x5red, ch5x5, kernel_size=5, padding=2)   # 保证输出大小等于输入大小    
15.        )    
16.    
17.        self.branch4 = nn.Sequential(    
18.            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),    
19.            BasicConv2d(in_channels, pool_proj, kernel_size=1)    
20.        )    
21.    
22.    def forward(self, x):    
23.        branch1 = self.branch1(x)    
24.        branch2 = self.branch2(x)    
25.        branch3 = self.branch3(x)    
26.        branch4 = self.branch4(x)    
27.    
28.        outputs = [branch1, branch2, branch3, branch4]    
29.        return torch.cat(outputs, 1)   

The complete GoogLeNet is stacked by such inception blocks; as shown in the following figure:

improved inception
Explain the above figure as follows:
  • 1. Obviously GoogLeNet adopts a modular structure (stage, block, layer). The boxes circled in red, yellow and green in the above figure are different stages. Each stage is composed of many blocks (the block here is inception) stacked. , each block consists of many neural network layers. The advantage of such a modular structure is that it is convenient to add and modify the model structure;
  • 2. The network finally uses average pooling to replace the fully connected layer. It turns out that the TOP1 accuracy can be increased by 0.6%.Moreover, average pooling allows the network to accept images of different sizes as input, a fully connected layer was added at the end, mainly for the convenience of finetune in the future;
  • 3. Dropout is still used before the last fully connected layer;
  • 4. In order to avoid gradient disappearance, the network adds 2 additional softmax auxiliary classifiers (the second and fourth blocks of the yellow stage in the figure, which will be described in detail later), to help the model reverse the forward conduction gradient. In addition, during the actual test, these two additional softmax will be removed.

The following figure shows the detailed parameters of the network, and the three modules of red, yellow and blue correspond to the above figure.

improved inception

summary

The biggest feature of the GoogLeNetV1 model is the use of the Inception module, which can simultaneously use a variety of convolution kernels and pooling layers of different sizes to extract features, thereby increasing the expressiveness and accuracy of the network while reducing the number of parameters of the model. The Inception module contains multiple different convolution and pooling branches, each with a different convolution kernel size and step size. During training, the network automatically learns how to choose the optimal branches and how to combine them.

In addition, GoogLeNetV1 also uses global average pooling to replace the flattening operation, reduce the number of parameters of the model, and prevent overfitting. Global average pooling can average the entire feature map to obtain a feature vector as the final output.

GoogLeNetV1 also introduced a technique called "auxiliary classifier" to help the network converge faster. This technology adds two auxiliary classifiers to the network, which are classified in the middle layer and the end layer respectively, which can provide additional supervision signals during the training process, thereby promoting the training of the network.

GoogLeNetV1 has achieved excellent results in the ImageNet image classification competition, with an accuracy rate of 74.8%, and the number of model parameters is only 1/12 of AlexNet. This shows that GoogLeNetV1 has achieved a good balance between the number of parameters and classification accuracy.

2. GoogLeNetV2

The biggest contribution of GoogLeNetV2 is the proposed BatchNormalization data normalization method.

1、Motivation

First of all, during the same period when GoogLeNet V1 appeared, the performance was probably only VGGNet, and both of them have been successfully applied in many fields other than image classification. But in contrast, the computational efficiency of GoogLeNet is significantly higher than that of VGGNet, with only about 5 million parameters, which is only equivalent to 1/12 of Alexnet (GoogLeNet's caffemodel is about 50M, and VGGNet's caffemodel is more than 600M).

Moreover, the development directions of the two are different; from a certain perspective, it can be understood as follows:Vgg pursues network depth; GoogLeNet pursues network width

GoogLeNet performs very well, but if you want to build a larger network by simply enlarging the Inception structure, it will lead to numerical instability problems during the calculation process. In order to improve the training speed and robustness of GoogLeNetv1, it is proposed to normalize the part of the model structure, that is, to normalize each training mini-batch, calledBatch Normalization (BN). BN frequently appeared in subsequent network models and became an indispensable part of the neural network, the main benefits of BN are as follows:

  • BN allows the model to use a larger learning rate without paying special attention to optimization issues such as gradient explosion or disappearance;
  • BN reduces the dependence of the model effect on the initial weight;
  • BN can not only accelerate the convergence, but also play a role in regularization, improving the generalization of the model;

The author believes that the constant changes in the parameters during the network training process lead to changes in the input distribution of each subsequent layer, and the learning process must make each layer adapt to the input distribution, so we have to reduce the learning rate and initialize carefully. The author calls the change in distribution an internal covariate shift.

The way to solve this problem is to subtract the mean from the input when training the network, in order to speed up the training. Why subtracting the mean can speed up training? Here is a brief explanation:

First, image data is highly correlated, and the data distribution of similar images abstracted into high-dimensional space is close. Assume that its distribution is as shown in Figure a below (a point represents an image, simplified to 2 dimensions). Since our parameters are generally zero-mean during initialization, the initial fitting y=Wx+b basically passes near the origin, as shown in the red dotted line in Figure b. Therefore, the network needs to learn many times to gradually achieve the fitting as shown in the purple solid line, that is, the convergence is relatively slow. If we first subtract the mean value of the input data, as shown in Figure c, it can obviously speed up the learning.

internal covariate shift

Another visual explanation: BN is the data distribution of the input data of each layer of the neural network, and the normalization operation is performed from the irregular data distribution in the left picture below to the regular data distribution in the right picture. The arrow indicates the process of the model looking for the optimal solution. Obviously, the way on the right is more convenient and easier.

internal covariate shift

Finally, the formula for BN is as follows:

Input information: x \mathrm{x}The value of x exceeds the mini-batch:B = { x 1 ⋯ m } B=\left\{x_{1 \cdots m}\right\}B={ x1m} ;
Parameters to learn:γ , β \gamma , \betaγβ
输出信息: { y i = B N γ , β ( x i ) } 。  \left\{y_i=B N_{\gamma, \beta}\left(x_i\right)\right\}_{\text {。 }} { yi=BNc , b(xi)} 
µ B ← 1 m ∑ i = 1 mxi //dimensionality σ B 2 ← 1 m ∑ i = 1 m ( xi − µ B ) 2 //dimensionality X i ^ ← xi − μ B σ B + ϵ //Align yi ← γ ^ xi + β ≡ BN β , γ ( xi ) //Align the aligned \begin{aligned} & \mu_{\mathcal{B}} \leftarrow \frac{1}{m} \sum_ {i=1}^m x_i & \text { //value input } \\ \\ & \sigma_{\mathcal{B}}^2 \leftarrow \frac{1}{m} \sum_{i=1 }^m\left(x_i-\mu_{\mathcal{B}}\right)^2 & \text { //embedded } \\ \\ & \widehat{\mathrm{X}_{\mathrm{ i}}} \leftarrow \frac{\mathrm{x}_{\mathrm{i}}-\mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}+\epsilon} } & \text { //default }\\ \\ & y_i \leftarrow \hat{\gamma} x_i+\beta \equiv B N_{\beta, \gamma}\left(x_i\right) & \text { // unlock type}\\ & \end{aligned}mBm1i=1mxipB2m1i=1m(ximB)2Xi pB+ϵ ximByic^xi+bBNb , c(xi) // Minimum batch mean  // mini-batch difference  // normalize  // transform proportionally

Algorithmic process.

  1. Calculate the mean u \mathrm{u} of each batch along the channelu;
  2. Calculate the variance σ ∧ 2 \sigma \wedge 2 of each batch along the channelp2;
  3. for x \mathrm{x}x is normalized,x ′ = ( x − u ) / δ 2 + ε x^{\prime}=(x-\mathrm{u}) / \sqrt{\delta^2+\varepsilon}x=(xu)/d2+e
  4. Add scaling and translation variables γ \gammacb \betaβ , the normalized value,y = γ x ′ + β y=\gamma x^{\prime}+\betay=γx+The reason why β is added to the scaling and translation variables is that it is not necessarily a standard normal distribution every time, and may need to be shifted or stretched. It is guaranteed that after each data is normalized, the original learned features are retained, and at the same time, the normalization operation can be completed to speed up the training. These two parameters are used for learning parameters.

2、Architectural Details

The detailed parameters of the GoogLeNetV2 network are as follows. Except for BN, there are basically no major changes.

GoogLeNetV2 configuration

summary

The GoogLeNetV2 model introduces batch data normalization (Batch Normalization) technology, which can make the network more stable and converge faster. During the training process, Batch Normalization can standardize the data of each small batch, so that the input data is more stable and stable.

GoogLeNetV2 has achieved excellent results in the ImageNet image classification competition, with an accuracy rate of 78.8%. Its performance and efficiency are better than GoogLeNetV1.

三、GoogLeNetV3

GoogLeNet Inception V3 was proposed in "Rethinking the Inception Architecture for Computer Vision". The highlights of this paper are:

  • Propose four general network structure design criteria
  • Introduce convolution decomposition to improve efficiency (spatial separable convolution)
  • Introducing an efficient feature map dimensionality reduction method
  • Smoothed sample labels

1、Motivation

In the V1 version, the article did not give a clear description of the precautions for building the Inception structure. Therefore, in the article, the author first gives some general guidelines and optimization methods that have been proven effective for amplifying the network. These guidelines and methods apply to, but are not limited to, the Inception structure.

2 、General Design Principles

  • 准则1:模型设计者应避免在神经网络的前若干层产生特征表示的瓶颈。

The feature extraction process of neural networks consists of multiple layers of convolutions. An intuitive and common-sense understanding is that if the feature extraction process in front of the network is too rough, then detail information may be lost, and even if the subsequent structure is finer, feature representation and combination cannot be effectively performed.

For example, if the dimensionality reduction is directly sampled from 35×35×320 to 17×17×320 at the beginning, a large amount of feature details will be lost, even if the Inception structure is used to extract and combine various features, it will not help. Therefore, while reducing the dimensionality of the feature map, the dimensionality of the channel channel is generally increased.

Therefore, as the number of layers deepens, the size of the feature map should gradually decrease, but in order to ensure that the features can be effectively represented and combined, the number of channels will gradually increase. A simple understanding is: the convolution operation can perform feature extraction on the spatial dimension of the image, and transfer the extracted features to the channel dimension.

  • 准则2:在模型中增加卷积次数可以解耦更多特征,帮助网络的收敛。

When the output features are independent of each other, the input information can be decomposed more thoroughly, and the inter-correlation of sub-features is higher. It is easier to converge if features with strong correlation are gathered together. To put it simply: the more features extracted, the greater the help for downstream tasks. For example, it can be difficult to identify someone if only the eyes are known; But if all the features of the facial features can be understood, the problem will be easier to solve, thereby improving the recognition accuracy.

For a certain layer of the neural network, through more output branches, feature representations that are decoupled from each other can be generated, thereby generating more high-order sparse features and accelerating convergence. The specific method is as follows.

First of all, a small pre-knowledge: the 5×5 convolution kernel can be replaced by two 3×3 small convolution kernels. Because the receptive field of the 5×5 convolution kernel is 5×5=25, and when two 3×3 convolution kernels are stacked together, the receptive field of the first layer is 9, and the receptive field of the second layer is 25 , the two receptive fields are the same. Similarly, a 3×3 kernel size convolution kernel can be replaced by a 3×1 and a 1×3 small convolution kernel. The specific example is shown in the figure.

small conv kernel replace large conv kernel

Therefore, in order to add more convolutions in the network, GoogLeNetV3 made the following improvements to inception: first, replace the 5X5 convolution in inception with two 3X3 convolutions, and then use 1X3 and 3X1 convolutions instead. 3X3 convolution.

GoogLeNetV3 inceptions

It is worth mentioning that an n×n convolution kernel can be decomposed into two 1×n and n×1 convolutions connected in sequence. This operation is also called spatially separable convolution (a bit like matrix decomposition), If n=3, computing performance can be improved by 1-(3+3)/9=33%. Of course, its shortcomings are also obvious. Not all convolution kernels can be split into two multiplied forms of 1×n and n×1 convolution kernels. In fact, the author found that using this decomposition in the early stage of the network is not good, and it will be better only when used on a moderately sized feature map. (For feature maps of mxm size, it is recommended that m be between 12 and 20).

As for the convolution combination of 1 xn and nx 1 in Figure 7, it is in parallel because the author hopes that the model will become wider rather than deeper to solve the representational bottleneck. If the module is not widened but made deeper, the dimensionality will be reduced too much, resulting in information loss. Also explained in 1 and 2 of the General Design Principles.

  • 准则3:对模型的特征维度进行合理的压缩,可以减少计算量

The use of the 1×1 convolution kernel proposed in GoogLeNetV1 to first reduce the dimensionality of the feature dimension and then perform feature extraction is to use this criterion. This is because during dimensionality reduction, there is a strong correlation between adjacent units, so the information loss is much smaller when the output is used for spatial aggregation. Since these signals are easy to compress, dimensionality reduction can even help with faster learning.

  • 准则4:模型网络结构的深度和宽度(特征维度数)要做到平衡

Both depth and width are important parameters of neural networks. Depth is generally related to the abstraction ability of the network, while width is related to the capacity of the network. When designing a network, an appropriate balance between depth and width needs to be found so that the network can effectively learn complex patterns.

3. Optimize the auxiliary classifier

The auxiliary classifier in GoogLeNetV1 can help the network to return the gradient during training, and can play a regular role to a certain extent. However, GoogLeNetV3 found during training that there is a problem with the auxiliary classifier in GoogLeNetV1: the auxiliary classifier cannot accelerate the convergence in the early stages of training, and only slightly improves the network accuracy when the training is almost over. Therefore, in the GoogLeNetV3 version, the first auxiliary classifier was removed.

4. Optimize the downsampling operation

In general, if you want to shrink the image, there are two ways:

Pool methods

First pooling and then Inception convolution, or first Inception convolution and then pooling.
Method 1 (left picture) performing pooling (pooling) first will cause feature representation to encounter bottlenecks (missing features),
method 2 (right picture) is a normal reduction, but the amount of calculation is very large.

In order to maintain the feature representation and reduce the amount of calculation at the same time, the network structure is changed to the following figure, and two parallel modules are used to reduce the amount of calculation (convolution, pooling, parallel execution, and then merged):

improved Pool method

5. Optimize label

In deep learning, the One Hot vector is usually used as a classification label to indicate the unique result of the classifier. This label is similar to the pulse function in the signal and system, also known as "Dirac delta", that is, it only takes 1 in a certain position and 0 in other positions. This approach encourages the model to output scores that vary widely for different categories, or that the model trusts its judgment too much. However, for a data set marked by multiple people, the rules for different people's labeling may be different, and each person's labeling may also have some errors. A model that places too much trust in the labels can lead to overfitting.
Label smoothing (Label-Smoothing Regularization, LSR) is one of the effective ways to deal with this problem. Its specific idea is to reduce our trust in the label. For example, we can slightly reduce the target label from 1 to 0.9, or slightly from 0. up to 0.1. Converted to Python code, that is.

New_labels = (1.0 - label_smoothing) * one_hot_labels + label_smoothing / num_classes

When the network is implemented, let "label_smoothing = 0.1, num_classes = 1000". Label Smooth improves the network accuracy by 0.2%. Label Smooth smoothes the original abrupt "one_hot_labels" slightly, avoiding the disadvantages caused by the network's over-learning labels.

6、Architectural Details

GoogLeNetV3 Architectural Details

7. Summary

The GoogLeNetV3 model aims to improve the performance and efficiency of deep convolutional neural networks by redesigning the structure of the Inception module. GoogLeNetV3 summarizes four guidelines for designing networks, and optimizes downsampling operations, auxiliary classifiers, and label smoothing. Through these techniques and methods, GoogLeNetV3 has achieved good performance in computer vision tasks such as image classification, object detection and semantic segmentation.

4. GoogLeNetV4

The highlights of the article "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning" are: a better GoogLeNet Inception v4 network structure is proposed; it is fused with the residual network, and the effect is not inferior to v4 but the training speed is faster. Fast GoogLeNet Inception ResNet structure.

1. GoogLeNet Inception V4 network structure

GoogleNet Inception V4

2. GoogLeNet Inception Residual V4 network structure

GoogLeNet Inception Residual V4

All in all, the GoogLeNetV4 model has become more complex. The author believes that such a complex model design actually incorporates a lot of human prior knowledge; it is not as simple as a model like ResNet, which allows the model to learn and summarize knowledge by itself through a large amount of data. The knowledge learned by the model is often higher than the upper limit of the inductive bias given by humans. Therefore, these structures will not be introduced here.

3. Summary

GoogLeNet V4, also known as Inception V4, is a Convolutional Neural Network (CNN) model, which is an important extension of the Inception series of models. The Inception series of models were originally proposed by Google, aiming to optimize the structure of the network through complex Inception modules to improve efficiency and performance.

The Inception V4 model introduces many improvements, including:

  • Deeper and wider network: Inception V4 has more layers and wider layers, which allows it to learn more complex patterns. However, this also increases the computational complexity.
  • Introducing Residual Connections: In Inception V4, residual connections were introduced, a type of skip connection that helps gradients flow better through the network. This structure is inspired by ResNet (residual network), which has shown its powerful performance in deep learning.
  • Further optimized Inception module: Inception V4 further optimized the Inception module, including more branches and more complex structures. This allows the model to better balance width and depth, which improves performance.

Overall, Inception V4 improves the performance and efficiency of the network by introducing a series of optimizations and improvements. However, this also increases the complexity and computational burden of the model. Still, the complexity is worth it considering its excellent performance on a variety of tasks.

5. GoogLeNetV5

Depthwise Separable Convolution (Depthwise Separable Convolution) was originally proposed by Laurent Sifre in his doctoral thesis Rigid-Motion Scattering For Image Classification.
This article mainly discusses the relationship between Inception and depth-separable convolution from the perspective of the Inception module, and explains the depth-separable convolution from a new perspective. Combined with the classic residual network (see ResNet for details), a new architecture Xception came into being. Xception is taken from Extreme Inception, that is, Xception is an extreme Inception model.

1. Inception review

The core idea of ​​Inception is to divide the channel into several channels with different receptive field sizes. In addition to obtaining different receptive fields, Inception can also greatly reduce the number of parameters. Let's look at a simple version of the Inception model in Figure 1 below:

simple inception Figure 1

2. Inception improvement

For an input Feature Map, first obtain three sets of Feature Maps through three sets of 1X1 convolutions, which is completely equivalent to first using a set of 1X1 convolutions to obtain Feature Maps, and then dividing this set of Feature Maps into three groups. Assuming that the number of 1X1 convolution kernels in the figure is K1, the number of 3X3 convolution kernels is K2, and the number of channels input to the Feature Map is m, then the number of parameters in this simple version is: m × k 1
+ 3 × 3 × 3 × k 1 / 3 × k 2 / 3 = m × k 1 + 3 × k 1 × k 2 m \times k_1+3 \times 3 \times 3 \times k_1 / 3 \times k_2 / 3 =m \times k_1+3 \times k_1 \times k_2m×k1+3×3×3×k1/3×k2/3=m×k1+3×k1×k2

improved simple inception 图2

Compared with the normal convolution with the same number of channels but no grouping, the number of parameters of the normal convolution is:
m × k 1 + 3 × 3 × k 1 × k 2 m \times k_1+3 \times 3 \times k_1 \times k_2m×k1+3×3×k1×k2

That is, the number of parameters of ordinary convolution is about three times that of Inception.

3 、Xception

If Inception integrates 3x3 convolutions into 3 groups, then consider an extreme case, if we completely separate the Feature Map of k1 channels obtained by Inception's 1x1? That is to use k1 different 3x3 convolutions to perform convolution on each channel separately, and its number of parameters is m × k 1 + k 1 × 3 × 3 \mathrm{m} \times \mathrm{k}_1+\ mathrm{k}_1 \times 3 \times 3m×k1+k1×3×3

The number of parameters of this is 1/k of ordinary convolution. We call this form of Inception Extreme Inception, as shown in the figure.

improved simple inception 图3

4. Contrast Depth Separable Convolution

For a detailed introduction to depth-separable convolution, please move to the MobileNet series of articles. The following is a simplified diagram of the operation of depth-separable convolution:

Depthwise Separable Convolution

It can be seen that the two are very similar. The only difference is whether the execution order of 1X1 convolution is before or after Depthwise Conv. The proposed time of the two algorithms is similar, and there is no question of who is plagiarizing whom. They revealed the powerful role of depth-separable convolution from different angles. MobileNet's idea is to reduce the number of parameters by splitting ordinary convolution, and Xception is done by fully decoupling Inception.

5、Architectural Details

Xception Architectural Details

6. Summary of GoogLeNetV5

As the final chapter of the GoogLeNet series of articles, the GoogLeNetV5 model is oriented by experimental results and abandons the parallel structure of 1×1, 3×3, and 5×5 convolution kernels in GoogLeNetV1-V4. Compared with the complex model structure in GoogLeNetV4, the simple model structure of Xception has achieved better performance. This is why the author did not introduce the GoogLeNetV4 model in detail. In the author's opinion, the design of the GoogLeNetV4 model puts too much emphasis on the artificial way of thinking, trying to directly assign the human understanding of images to the model. But models and people have different ways of thinking, so it is more effective to design a simple structure and let the model learn autonomously from the data.

6. Code implementation

Here is the python code for building the V1 model (based on pytorch). The complete code is based on the image classification problem (including training and inference scripts, custom layers, etc.) See my GitHub for details: full code link

import torch.nn as nn
import torch
import torch.nn.functional as F


class BasicConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, **kwargs)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.conv(x)
        x = self.relu(x)
        return x

class Inception(nn.Module):
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super(Inception, self).__init__()

        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=1)

        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size=1),
            BasicConv2d(ch3x3red, ch3x3, kernel_size=3, padding=1)   # 保证输出大小等于输入大小
        )

        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size=1), 
            BasicConv2d(ch5x5red, ch5x5, kernel_size=5, padding=2)   # 保证输出大小等于输入大小
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            BasicConv2d(in_channels, pool_proj, kernel_size=1)
        )

    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        outputs = [branch1, branch2, branch3, branch4]
        return torch.cat(outputs, 1)


class InceptionAux(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(InceptionAux, self).__init__()
        self.averagePool = nn.AvgPool2d(kernel_size=5, stride=3)
        self.conv = BasicConv2d(in_channels, 128, kernel_size=1)  # output[batch, 128, 4, 4]

        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)

    def forward(self, x):
        # aux1: N x 512 x 14 x 14, aux2: N x 528 x 14 x 14
        x = self.averagePool(x)
        # aux1: N x 512 x 4 x 4, aux2: N x 528 x 4 x 4
        x = self.conv(x)
        # N x 128 x 4 x 4
        x = torch.flatten(x, 1)
        x = F.dropout(x, 0.5, training=self.training)
        # N x 2048
        x = F.relu(self.fc1(x), inplace=True)
        x = F.dropout(x, 0.5, training=self.training)
        # N x 1024
        x = self.fc2(x)
        # N x num_classes
        return x

class GoogLeNet(nn.Module):
    def __init__(self, num_classes=1000, aux_logits=False, init_weights=False):
        super(GoogLeNet, self).__init__()
        self.aux_logits = aux_logits

        self.conv1 = BasicConv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.conv2 = BasicConv2d(64, 64, kernel_size=1)
        self.conv3 = BasicConv2d(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = Inception(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        if self.aux_logits:
            self.aux1 = InceptionAux(512, num_classes)
            self.aux2 = InceptionAux(528, num_classes)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.4)
        self.fc = nn.Linear(1024, num_classes)
        if init_weights:
            self._initialize_weights()

    def forward(self, x):
        # N x 3 x 224 x 224
        x = self.conv1(x)
        # N x 64 x 112 x 112
        x = self.maxpool1(x)
        # N x 64 x 56 x 56
        x = self.conv2(x)
        # N x 64 x 56 x 56
        x = self.conv3(x)
        # N x 192 x 56 x 56
        x = self.maxpool2(x)

        # N x 192 x 28 x 28
        x = self.inception3a(x)
        # N x 256 x 28 x 28
        x = self.inception3b(x)
        # N x 480 x 28 x 28
        x = self.maxpool3(x)
        # N x 480 x 14 x 14
        x = self.inception4a(x)
        # N x 512 x 14 x 14
        if self.training and self.aux_logits:    # eval model lose this layer
            aux1 = self.aux1(x)

        x = self.inception4b(x)
        # N x 512 x 14 x 14
        x = self.inception4c(x)
        # N x 512 x 14 x 14
        x = self.inception4d(x)
        # N x 528 x 14 x 14
        if self.training and self.aux_logits:    # eval model lose this layer
            aux2 = self.aux2(x)

        x = self.inception4e(x)
        # N x 832 x 14 x 14
        x = self.maxpool4(x)
        # N x 832 x 7 x 7
        x = self.inception5a(x)
        # N x 832 x 7 x 7
        x = self.inception5b(x)
        # N x 1024 x 7 x 7

        x = self.avgpool(x)
        # N x 1024 x 1 x 1
        x = torch.flatten(x, 1)
        # N x 1024
        x = self.dropout(x)
        x = self.fc(x)
        # N x 1000 (num_classes)
        if self.training and self.aux_logits:   # eval model lose this layer
            return x, aux2, aux1
        return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)


def googlenet(num_classes):  
    model = GoogLeNet( num_classes=num_classes)
    return model

Summarize

In this paper, a series of GoogLeNet models (V1, V2, V3, V4, V5) are analyzed and explained in detail from the motivation of network model design and network model structure.

Guess you like

Origin blog.csdn.net/qq_39297053/article/details/130667442