Deep Learning Convolutional Neural Network CNN GoogLeNet Model Network Model Detailed Explanation (Super Detailed Theory)

1. GoogLeNet background
2. GoogLeNet improvement history
3. GoogLeNet model structure
4. Features (super detailed innovation, advantages and disadvantages and new knowledge points)

1. GoogLeNet background

GoogLeNet won the first place in the classification task in the ImageNet competition in   2014. It is a new convolutional neural network structure used in the large-scale visual challenge (ILSVRC-2014), and has an error rate of 6.65%. Models such as VGGNet won the championship of ILSVRC-2014 in the classification task, and published the paper "Going Deeper with Convolutions" in CVPR in 2015. GoogleNet is a deep network structure developed by Google, and it is called "GoogLeNet"? The answer is to pay tribute to LeNet, so it was named GoogLeNet because of the pun.

2. GoogLeNet improvement history

  It has been improving in the following two years, forming versions such as Inception V2, Inception V3, and Inception V4 . The following are the four papers and their addresses

1.inception-v1:Going deeper with convolutions
https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf

2.inception-v2:Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift
http://proceedings.mlr.press/v37/ioffe15.pdf

3.inception-v3: Rethinking the Inception Architecture for Computer Vision
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf

4.Inception-v4: Inception-ResNet and the Impact of Residual Connections on Learning
https://arxiv.org/abs/1602.07261

3. GoogLeNet model structure (the table is given in the paper)

insert image description here
insert image description here
insert image description here
insert image description here

insert image description here
insert image description here
Take the inception(3a) module as an example (the first Inception block) :

Input: 28x28x192
first branch: convolution kernel size 1x1, filters = 64, strides = 1, padding = 'same', outputsize = 28x28x64 second
branch:
convolution kernel size 1x1, filters = 96, strides = 1, padding = 'same', outputsize = 28x28x96
convolution kernel size 3x3, filters = 128, strides = 1, padding = 'same', outputsize = 28x28x128
third branch:
convolution kernel size 1x1, filters = 16, strides = 1 , padding = 'same', outputsize = 28x28x16
convolution kernel size 5x5, filters = 32, strides = 1, padding = 'same', outputsize = 28x28x32
The fourth branch:
pool_size = 3x3, strides = 1, padding = 'same ', outputsize = 28x28x192
convolution kernel size 1x1, filters = 32, strides = 1, padding = 'same', outputsize = 28x28x32
concatenation outputsize:
28x28x(64+128+32+32) = 28x28x256
insert image description here

 The &emsp GoogLeNet network has 22 layers deep (including the pool layer, which is 27 layers deep) . Before the classifier, the idea of ​​using the Average pool (average pooling) in Network in Network to replace the fully connected layer is adopted. After the avg pool, it is still A fully connected layer is added for everyone to do finetune (fine-tuning).
insert image description here

4. Features (Taking V1 as the main focus on innovation, advantages and disadvantages, and new knowledge points)

1. Propose the Inception structure
  Inception is to assemble multiple convolution or pooling operations together into a network module. When designing a neural network, the entire network structure is assembled in units of modules. The main idea of ​​the original version of Inception is to use different Large and small convolution kernels realize perception of different scales
  . As we said in the previous model structure, the way to improve network performance is to increase network depth and width, but increasing will bring many problems: 1) There are too many
parameters. If the training data set Limited, it is easy to overfit;
2) The larger the network, the more parameters, the greater the computational complexity, which is difficult to apply;
3) The deeper the network, the gradient dispersion problem is prone to occur (the gradient is easy to disappear as it traverses backwards), and it is difficult to Optimize the model.
We hope to reduce the parameters while increasing the depth and width of the network. In order to reduce the parameters, it is natural to think of changing the full connection into a sparse connection. The following is the Inception structure diagram in V1. The
insert image description here
         composition structure has four components: 1 1 convolution, 3 3 convolution, 5 5 convolution, 3 3 maximum pooling
  . In the previous network, our layer of serial execution here is a parallel string Up , such as convolution or pooling, and the convolution kernel size of the convolution operation is also a fixed size. However, in actual situations, different sizes of convolution kernels are required in pictures of different scales, so as to achieve the best performance, or, for the same picture, the performance of convolution kernels of different sizes is different. , because their receptive fields are different.

example

  Assuming that the input in the above figure is a 32×32×256 feature map, the feature map is copied into 4 parts and sent to the next 4 parts respectively. We assume that the step size of the sliding window corresponding to these 4 parts is 1, where the padding of the 1×1 convolutional layer is 0, the sliding window dimension is 1×1×256, and the required output feature map depth is 128; 3 The padding of the ×3 convolutional layer is 1, the dimension of the sliding window is 3×3×256, and the required output feature map depth is 192; the padding of the 5×5 convolutional layer is 2, and the dimension of the sliding window is 5×5×256. The required output feature map depth is 96; the Padding of the 3×3 maximum pooling layer is 1, and the sliding window dimension is 3×3×256. The feature maps output by these four parts are 32×32×128, 32×32×192, 32×32×96 and 32×32×256, and finally merged in the merge layer to obtain a feature map of 32×32×672 , the method of merging is to add the feature maps output by each part. Finally, the feature map output by this Naive Inception unit has a dimension of 32×32×672, and the total parameter amount is 1* 1* 256* 128+3* 3* 256 * 192+5* 5* 256* 96=1089536

2. 1 * 1 convolution kernel dimensionality reduction

  Reduce the amount of parameters to reduce the amount of calculation. Inspired by the model "Network in Network", the Inception unit (Inception V1) used in the GoogleNet model was developed. This method can be regarded as an additional 1*1 roll The product layer plus a ReLU layer.
insert image description here
  * The main purpose of using 1x1 convolution kernel is to compress dimensionality and reduce the amount of parameters, so as to make the network deeper and wider, and extract features better. This idea is also called Pointwise Conv, or PW for short.

3. The two auxiliary classifiers help the training
  in two ways. One is to avoid the disappearance of the gradient and to conduct the gradient forward. If one layer of derivation is 0 during backpropagation, the result of chain derivation is 0. The second is to use the output of a middle layer as a classification to play a role in model fusion.
  Auxiliary Function : From the perspective of information flow, the gradient disappears, because the energy of the gradient information decays during the BP process and cannot reach the shallow area, so an opening is opened in the middle, and an auxiliary loss function is added to directly make the shallow layer. There are two deep and shallow classifiers
  in the GoogLeNet network structure . In order to avoid gradient disappearance, the two auxiliary classifier structures are exactly the same, and their composition is shown in the figure below. The input of these two auxiliary classifiers comes from Inception (4a) and Inception (4d).

  

  The first layer of the auxiliary classifier is an average pooling down-sampling layer with a pooling kernel size of 5x5 and stride=3; the second layer is a convolutional layer with a convolution kernel size of 1x1 and stride=1 and convolution kernels The number is 128; the third layer is a fully connected layer, and the number of nodes is 1024; the fourth layer is a fully connected layer, and the number of nodes is 1000 (the number of categories corresponding to the classification).
  The auxiliary classifier is only used during training and will be removed during normal prediction. The auxiliary classifier promotes more stable learning and better convergence, and often near the end of training, the auxiliary branch network starts to surpass the accuracy of the network without any branch, reaching a higher level.

4. Discard the fully connected layer and use the average pooling layer (greatly reducing model parameters).
  Finally, the network uses average pooling (average pooling) to replace the fully connected layer. The idea comes from NIN (Network in Network). It turns out that this can Accuracy increased by 0.6%.

Guess you like

Origin blog.csdn.net/qq_55433305/article/details/129803089