Interpretation of GoogleNet paper on CNN classic network model

Table of contents

1. GoogleNet

1.1 Inception module

1.1.1 1x1 convolution

1.2 Auxiliary classifier structure

1.3 GoogleNet network structure diagram


1. GoogleNet

GoogleNet , also known as Inception-v1, is a deep convolutional neural network architecture proposed by the Google team in 2014, dedicated to image classification and feature extraction tasks. It has achieved excellent results in the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) competition, and introduced the "Inception" module, which is a multi-scale convolution kernel parallel structure, which can enhance the network's perception of different scale features.

1.1 Inception module

GoogleNet introduced the "Inception" module, which uses convolution kernels of different scales to simultaneously capture features of different scales. This helps the network adapt better to objects and structures of different sizes. Each Inception module contains multiple parallel convolutional and pooling layers, and then their outputs are concatenated in the channel dimension.

insert image description here

 The left picture is the original structure of inception proposed in the paper, and the right picture is the structure of inception plus dimensionality reduction function.

First look at the left picture, the inception structure has a total of 4 branches, that is to say, our input feature matrix obtains four outputs through these four branches in parallel, and then stitches these four outputs in the depth dimension (channel dimension) Get our final output (note that in order to allow the output of the four branches to be spliced ​​in the depth direction, it must be ensured that the height and width of the feature matrix output by the four branches are the same).

  • Branch 1 is a convolution layer with a convolution kernel size of 1x1, stride=1,
  • Branch 2 is a convolution layer with a convolution kernel size of 3x3, stride=1, padding=1 (to ensure that the height and width of the output feature matrix are equal to the input feature matrix),
  • Branch 3 is a convolution layer with a convolution kernel size of 5x5, stride=1, padding=2 (to ensure that the height and width of the output feature matrix are equal to the input feature matrix),
  • Branch 4 is the maximum pooling downsampling with a pooling kernel size of 3x3, stride=1, padding=1 (to ensure that the height and width of the output feature matrix are equal to the input feature matrix)

Looking at the picture on the right, compared with the picture on the left, a convolutional layer with a convolution kernel size of 1x1 is added to branches 2, 3, and 4. The purpose is to reduce dimensionality, reduce model training parameters, and reduce the amount of calculation.

Note: If you keep the input image size unchanged, padding=(convolution kernel size-1)/2 when the step size is 1.

1.1.1  1x1 convolution

1x1 convolution: 1x1 convolution is widely used in the Inception module, which is used to reduce the number of channels, thereby reducing the amount of calculation. The 1x1 convolution acts like a linear combination of features from different channels to create a composite feature representation.

Similarly, a feature matrix with a depth of 512 is convolved with 65 convolution kernels of size 5x5. If a 1x1 convolution kernel is not used for dimensionality reduction, a total of 819,200 parameters are required. If a 1x1 convolution kernel is used for dimensionality reduction, a total of 50688 parameters are required, which is obviously much less.

insert image description here

 

1.2 Auxiliary classifier structure

In order to solve the problem of gradient disappearance, GoogleNet adds auxiliary classifiers in some middle layers. These auxiliary classifiers help with gradient propagation during training, while also providing supervisory signals to intermediate layers of the network, helping to train the network faster.

There are two auxiliary classifiers, the structure of which is as follows:

insert image description here

 The inputs of these two auxiliary classifiers come from Inception (4a) and Inception (4d), respectively.

  • The first layer of the auxiliary classifier is an average pooling downsampling layer with a pooling kernel size of 5x5 and stride=3
  • The second layer is the convolution layer, the convolution kernel size is 1x1, stride=1, and the number of convolution kernels is 128
  • The third layer is a fully connected layer, the number of nodes is 1024
  • The fourth layer is a fully connected layer, and the number of nodes is 1000 (the number of categories corresponding to the classification)

1.3 GoogleNet network structure diagram

How to determine the number of convolution kernels for each convolutional layer? The following is the parameter list given in the original paper. For the Inception module we built, the required parameters are #1x1, #3x3reduce, #3x3, #5x5reduce , #5x5, poolproj, these 6 parameters correspond to the number of convolution kernels used.

insert image description here

 Among them, #1x1 corresponds to the number of 1x1 convolution kernels on branch 1, #3x3 reduce corresponds to the number of 1x1 convolution kernels on branch 2, #3x3 corresponds to the number of 3x3 convolution kernels on branch 2, #5x5 reduce Corresponds to the number of 1x1 convolution kernels on branch 3, #5x5 corresponds to the number of 5x5 convolution kernels on branch 3, and pool proj corresponds to the number of 1x1 convolution kernels on branch 4.

As shown below:

insert image description here

 The following is the overall network structure of GoogleNet as shown below:

insert image description here

 

Guess you like

Origin blog.csdn.net/qq_43649937/article/details/132191289