(4) Convolutional neural network model - GoogLeNet

Model introduction

In 2014, GoogLeNet and VGG were the two heroes of the ImageNet Challenge (ILSVRC14). GoogLeNet won the first place and VGG won the second place. The common feature of these two types of model structures is that the level is deeper. GoogleNet has a total of 22 layers, with 5 million parameters. The number of AlexNet parameters is 12 times that of GoogleNet, and the number of VGGNet parameters is 3 times that of AlexNet. Therefore, when memory or computing resources are limited, GoogleNet is a better choice; from the model results. Look, the performance of GoogLeNet is even better.

GoogLeNet is a new convolutional neural network structure used by Christian Szegedy and others in the large-scale visual challenge (ILSVRC-2014), and has achieved ILSVRC-2014 in the classification task with an error rate of 6.65% over models such as VGGNet The champion, published the paper "Going Deeper with Convolutions" in CVPR in 2015.

Prior to this, AlexNet, VGG and other structures obtained better training effects by increasing the depth (number of layers) of the network, but the increase in the number of layers would bring many negative effects, such as overfitting, gradient disappearance, gradient explosion, etc. , GoogLeNet made a bolder network structure attempt.

The proposal of Inception improves the training results from another perspective: it can use computing resources more efficiently, and can extract more features under the same amount of calculation, thereby improving the training results. The depth of GoogLeNet using the Inception structure is only 22 layer, its parameters are about 1/12 of AlexNet, and 1/3 of VGGNet in the same period.

GoogLeNet is a deep network structure proposed by Google (Google), why it is not called "GoogleNet", but "GoogLeNet", the reason is to pay tribute to the classic model "LeNet"

Introduction to ILSVRC

ILSVRC (ImageNet Large Scale Visual Recognition Challenge) is one of the most popular and authoritative academic competitions in the field of machine vision in recent years, representing the highest level in the image field.

The ImageNet dataset is a dataset used by the ILSVRC competition, led by Professor Li Feifei of Stanford University, and contains more than 14 million full-size labeled images. The ILSVRC competition will draw some samples from the ImageNet dataset every year. Taking 2012 as an example, the training set of the competition contains 1,281,167 pictures, the verification set contains 50,000 pictures, and the test set is 100,000 pictures.

The items of the ILSVRC competition mainly include the following questions:

(1) Image classification and target location (CLS-LOC)

(2) Target detection (DET)

(3) Video object detection (VID)

(4) Scene classification (Scene)

CVPR

CVPR is the abbreviation of IEEE Conference on Computer Vision and Pattern Recognition, that is, IEEE International Conference on Computer Vision and Pattern Recognition. The conference is the top conference in the field of computer vision and pattern recognition organized by IEEE.

How GoogLeNet Further Improves Performance

Generally speaking, the most direct way to improve network performance is to increase the depth and width of the network. Depth refers to the number of network layers, and width refers to the number of neurons. But there are following problems in this way:

(1) Too many parameters, if the training data set is limited, it is easy to produce overfitting;

(2) The larger the network, the more parameters, the greater the computational complexity, and it is difficult to apply;

(3) The deeper the network, the more prone to gradient dispersion problems (gradients tend to disappear as they traverse backwards), making it difficult to optimize the model.

Therefore, some people ridicule that "deep learning" is actually "deep parameter tuning".

The way to solve these problems is of course to reduce the parameters while increasing the depth and width of the network. In order to reduce the parameters, it is natural to think of changing the full connection into a sparse connection. However, in terms of implementation, the actual calculation amount will not be qualitatively improved after the full connection becomes a sparse connection, because most hardware is optimized for dense matrix calculations. Although the sparse matrix has a small amount of data, the calculation time is very long. Difficult to reduce.

So, is there a way to maintain the sparsity of the network structure and take advantage of the high computational performance of dense matrices. A large number of literatures show that sparse matrices can be clustered into denser sub-matrices to improve computing performance, just like the human brain can be regarded as repeated accumulation of neurons. Therefore, the GoogLeNet team proposed the Inception network structure, which is to construct a A "basic neuron" structure to build a network structure with sparseness and high computing performance.

What is Inception?

Inception basic structure

By designing a sparse network structure, but able to generate dense data, it can not only increase the performance of the neural network, but also ensure the efficiency of computing resources. Google proposed the basic structure of the most original Inception:
insert image description here
this structure stacks the convolution (1x1, 3x3, 5x5) and pooling operations (3x3) commonly used in CNN (the convolution and pooling have the same size, and the channel phase Plus), on the one hand, it increases the width of the network, and on the other hand, it also increases the adaptability of the network to the scale.

The network in the network convolutional layer can extract every detail of the input, and the 5x5 filter can also cover most of the input of the receptive layer. A pooling operation can also be performed to reduce the size of the space and reduce overfitting. Above these layers, a ReLU operation is performed after each convolutional layer to increase the nonlinearity of the network.

Inception v1 network structure

However, in the original version of Inception, all convolution kernels are performed on all outputs of the previous layer, and the amount of calculation required for the 5x5 convolution kernel is too large, resulting in a large thickness of the feature map. In order to avoid In this case, a 1x1 convolution kernel is added before 3x3, before 5x5, and after max pooling to reduce the thickness of the feature map, which also forms the network structure of Inception v1, as shown in the following figure:
insert image description here
The main purpose of 1x1 convolution is to reduce dimensionality, and it is also used for rectified linear activation (ReLU). For example, the output of the previous layer is 100x100x128, after passing through the 5x5 convolutional layer with 256 channels (stride=1, pad=2), the output data is 100x100x256, where the parameters of the convolutional layer are 128x5x5x256= 819200. And if the output of the previous layer first passes through the 1x1 convolutional layer with 32 channels, and then through the 5x5 convolutional layer with 256 outputs, then the output data is still 100x100x256, but the convolution parameters have been reduced to 128x1x1x32 + 32x5x5x256= 204800, about 4 times less.

Detailed explanation of GoogLeNet model

A thumbnail of the GoogLeNet architecture is given below, with more detailed and labeled pictures at the end of the article. Compared with the previous convolutional neural network structure, in addition to extending the depth, the width of the network is also extended. The entire network is composed of many block sub-networks stacked, and this sub-network is the Inception module.
insert image description here
Highlights of the model:

1. Modular design (stem, stacked inception module, axuiliary function and classifier) ​​is adopted to facilitate the addition and modification of layers.

(1) Stem part: The paper points out that the Inception module is better used in the middle of the network, so the first half of the network is still replaced by the traditional convolutional layer.

(2) Auxiliary Function: From the perspective of information flow, the gradient disappears, because the energy of the gradient information decays during the BP process and cannot reach the shallow area, so an opening is opened in the middle, and an auxiliary loss function is directly added as shallow layer.

(3) Classifier part: From the papers of VGGNet and NIN, it can be seen that the fc layer has a large number of layers, so average pooling is used instead of fc to reduce the number of parameters to prevent overfitting. Add dropout between fc before softmax, p=0.7, to further prevent overfitting.

2. Use a 1x1 convolution kernel for dimensionality reduction and mapping processing (although it is also available in the VGG network, but this paper introduces it in more detail).

3. Introduced the Inception structure (integrating feature information of different scales).

4. Discard the fully connected layer and use the average pooling layer to greatly reduce the model parameters.

5. In order to avoid the disappearance of the gradient, the network adds 2 additional auxiliary softmax for forward conduction gradient (auxiliary classifier). The auxiliary classifier is to use the output of a certain layer in the middle as a classification, and add it to the final classification result with a small weight (0.3), which is equivalent to model fusion, and at the same time adds the gradient of backpropagation to the network The signal also provides additional regularization, which is beneficial for the training of the entire network. In the actual test, these two additional softmax will be removed.

Inception module

The Inception module used in GoogLeNet is named Inception v1. In fact, during 2014-2016, the Google team continuously improved GoogLeNet to form the Inception v1-v4 and Xception structures.
insert image description here
The left picture is the initial inception structure (native inception) designed by the author of GoogleNet. The idea is to use multiple different types of convolution kernels (1 × 1, 3 × 3, 5 × 5, 3 × 3 pool) to stack together (volume The size of product and pooling is the same, and the channels are added) instead of a 3x3 small convolution kernel. The advantage is that the extracted features can be diversified, and the co-relationship between the features will not be very large. Finally, use Concatenate all the feature maps to make the network very wide, and then stack the Inception Module to make the network deeper. But simply doing this would explode the computation of one layer.

All the convolution kernels in native inception are done on all the outputs of the previous layer, and the calculation required for the 5x5 convolution kernel is too large, resulting in a large thickness of the feature map. In order to avoid this situation , before 3x3, before 5x5, and after max pooling, a 1x1 convolution kernel is added to reduce the thickness of the feature map, which forms the network structure of Inception v1 (right picture).
Assuming that the size of the input feature map is 28 × 28 × 256 and the size of the output feature map is 28 × 28 × 480, the calculation amount of the native Inception Module is 854M. The calculation process is as follows

insert image description here
As can be seen from the above figure, the amount of calculation mainly comes from the convolution operation of the high-dimensional convolution kernel, so before each convolution, use a 1 × 1 convolution kernel to reduce the feature map dimension of the input image first, and perform information compression , using a 3x3 convolution kernel for feature extraction operations, under the same circumstances, the calculation amount of Inception v1 is only 358M.
insert image description here
The Inception structure has a total of 4 branches. The input feature map passes through these four branches in parallel to obtain four outputs, and then these four outputs are concatenated (concate) in the depth dimension (channel dimension) to obtain our final output (note , in order to allow the output of the four branches to be spliced ​​in the depth direction, it must be ensured that the height and width of the feature matrix output by the four branches are the same), so the parameters of the inception structure are:

branch1:Conv1×1, stride=1
branch2:Conv3×3, stride=1, padding=1
branch3:Conv5×5, stride=1, padding=2
branch4:MaxPool3×3, stride=1, padding=1

Nine Inception v1 modules are used in GoogLeNet, named inception(3a), inception(3b), inception(4a), inception(4b), inception(4c), inception(4d), inception(4e), inception (5a), inception (5b).

auxiliary classifier

There are two deep and shallow classifiers in the GoogLeNet network structure. The structure of the two auxiliary classifiers is exactly the same. Its composition is shown in the figure below. The inputs of these two auxiliary classifiers come from Inception (4a) and Inception (4d) respectively.
insert image description here
The first layer of the auxiliary classifier is an average pooling down-sampling layer with a pooling kernel size of 5x5 and stride=3; the second layer is a convolutional layer with a convolution kernel size of 1x1 and stride=1 and convolution kernels The number is 128; the third layer is a fully connected layer, and the number of nodes is 1024; the fourth layer is a fully connected layer, and the number of nodes is 1000 (the number of categories corresponding to the classification).

The loss function during model training follows:
insert image description here

L0 is the final classification loss. In the test phase, the auxiliary classifier is removed, and only the final classification loss is recorded.

GoogLeNet network structure

Based on Inception, the network structure of GoogLeNet is constructed as follows (a total of 22 layers):
insert image description here
The description of the above figure is as follows:

(1) GoogLeNet adopts a modular structure (Inception structure), which is convenient for addition and modification;

(2) The network finally uses average pooling (average pooling) to replace the fully connected layer. The idea comes from NIN (Network in Network). It has been proved that this can increase the accuracy rate by 0.6%. However, a fully connected layer is actually added at the end, mainly for the convenience of flexible adjustment of the output;

(3) Although the full connection is removed, Dropout is still used in the network;

(4) In order to avoid the disappearance of the gradient, the network adds 2 additional auxiliary softmax for forward conduction gradient (auxiliary classifier). The auxiliary classifier is to use the output of a certain layer in the middle as a classification, and add it to the final classification result with a small weight (0.3), which is equivalent to model fusion, and at the same time adds the gradient of backpropagation to the network The signal also provides additional regularization, which is beneficial for the training of the entire network. In the actual test, these two additional softmax will be removed.

GoogLeNet network structure diagram details

How to determine the number of convolution kernels for each convolutional layer? The following is the parameter list given in the original paper. For the Inception module we built, the required parameters are #1x1, #3x3reduce, #3x3, #5x5reduce , #5x5, poolproj, these 6 parameters correspond to the number of convolution kernels used.

insert image description here

Note: "#3x3 reduce" and "#5x5 reduce" in the above table indicate the number of 1x1 convolutions used before the 3x3, 5x5 convolution operation.

Analysis of GoogLeNet network structure list

0, input

The original input image is 224x224x3, and has been pre-processed with zero mean (subtracting the mean from each pixel of the image).

1. The first layer (convolutional layer)

Using a 7x7 convolution kernel (sliding step size 2, padding is 3), 64 channels, the output is 112x112x64, and the ReLU operation is performed after convolution. After 3x3 max pooling (step size is 2), the output is ((112 - 3+ 1) / 2) +1 = 56, which is 56x56x64, and then perform ReLU operation.

2. The second layer (convolutional layer)

Use 3x3 convolution kernel (sliding step size is 1, padding is 1), 192 channels, the output is 56x56x192, after convolution, perform ReLU operation after 3x3 max pooling (step size is 2), the output is ((56 - 3 +1)/2)+1=28, which is 28x28x192, and then perform the ReLU operation.

3a, the third layer (Inception 3a layer)

Divided into four branches, using convolution kernels of different scales for processing

(1) 64 1x1 convolution kernels, then RuLU, output 28x28x64

(2) 96 1x1 convolution kernels, as the dimension reduction before the 3x3 convolution kernel, become 28x28x96, then perform ReLU calculation, and then perform 128 3x3 convolutions (padding is 1), output 28x28x128

(3) 16 1x1 convolution kernels, as the dimension reduction before the 5x5 convolution kernel, become 28x28x16, after the ReLU calculation, then perform 32 5x5 convolutions (padding is 2), output 28x28x32

(4) The pool layer uses a 3x3 kernel (padding is 1), outputs 28x28x192, and then performs 32 1x1 convolutions to output 28x28x32.
Connect the four results, and parallelize the third dimension of the output results of these four parts, that is, 64+128+32+32=256, and finally output 28x28x256

3b, the third layer (Inception 3b layer)

(1) 128 1x1 convolution kernels, then RuLU, output 28x28x128

(2) 128 1x1 convolution kernels, as the dimension reduction before the 3x3 convolution kernel, become 28x28x128, perform ReLU, and then perform 192 3x3 convolutions (padding is 1), output 28x28x192

(3) 32 1x1 convolution kernels, as the dimension reduction before the 5x5 convolution kernel, become 28x28x32, after the ReLU calculation, then perform 96 5x5 convolutions (padding is 2), output 28x28x96

(4) The pool layer uses a 3x3 kernel (padding is 1), outputs 28x28x256, and then performs 64 1x1 convolutions to output 28x28x64.
Connect the four results, and connect the third dimension of the output results of these four parts in parallel, that is, 128+192+96+64=480, and the final output is 28x28x480

The fourth layer (4a, 4b, 4c, 4d, 4e), the fifth layer (5a, 5b)... are similar to 3a, 3b and will not be repeated here.

Comparison of GoogLeNet experimental results

From the experimental results of GoogLeNet, the effect is obvious, and the error rate is lower than that of MSRA, VGG and other models. The comparison results are shown in the following table:
insert image description here

Code built by pytorch

import torch
import torch.nn as nn
import torch.nn.functional as F


class GoogLeNet(nn.Module):
    def __init__(self, num_classes=1000, aux_logits=True, init_weights=False):
        super().__init__()
        self.aux_logits = aux_logits

        self.conv1 = BasicConv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.pool1 = nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)
        self.conv2 = BasicConv2d(64, 64, kernel_size=1)
        self.conv3 = BasicConv2d(64, 192, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)

        self.inception3a = Inception(192, 64, 96, 128, 16, 32, 32) 
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.pool3 = nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 64) 
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64) 
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.pool4 = nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True) 

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        if aux_logits:
            self.aux1 = InceptionAux(512, num_classes)
            self.aux2 = InceptionAux(528, num_classes)

        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(1, 1))
        self.dropout = nn.Dropout(p=0.2)
        self.fc = nn.Linear(in_features=1024, out_features=num_classes)

        if init_weights:
            self._init_weights()

    def forward(self, x):
        x = self.conv1(x)  # [None, 3, 224, 224] -> [None, 64, 112, 112]
        x = self.pool1(x)  # [None, 64, 112, 112] -> [None, 64, 56, 56]
        x = self.conv2(x)
        x = self.conv3(x)  # [None, 64, 112, 112] -> [None, 192, 56, 56]
        x = self.pool2(x)  # [None, 192, 56, 56] -> [None, 192, 28, 28]

        x = self.inception3a(x) # [None, 192, 28, 28] -> [None, 256, 28, 28]
        x = self.inception3b(x)  # [None, 256, 28, 28] -> [None, 480, 28, 28]
        x = self.pool3(x)  # [None, 480, 28, 28] -> [None, 480, 14, 14]
        x = self.inception4a(x) # [None, 480, 14, 14] -> [None, 512, 14, 14]
        if self.training and self.aux_logits:  # eval mode discards this layer
            aux1 = self.aux1(x)

        x = self.inception4b(x)
        x = self.inception4c(x)
        x = self.inception4d(x) # [None, 512, 14, 14] -> [None, 528, 14, 14]
        if self.training and self.aux_logits:
            aux2 = self.aux2(x)

        x = self.inception4e(x)  # [None, 528, 14, 14] -> [None, 832, 14, 14]
        x = self.pool4(x) # [None, 832, 14, 14] -> [None, 832, 7, 7]
        x = self.inception5a(x)
        x = self.inception5b(x)  # [None, 832, 7, 7] -> [None, 1024, 7, 7]

        x = self.avgpool(x)
        x = torch.flatten(x, start_dim=1)
        x = self.dropout(x)
        x = self.fc(x)
        if self.training and self.aux_logits:
            return x, aux2, aux1
        return x

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_uniform_(m.weight, mode='fan_out', nonlinearity='leaky_relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.constant_(m.weight, 0.01)
                nn.init.constant_(m.bias, 0)


class BasicConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(num_features=out_channels, eps=0.001)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        return F.relu(x, inplace=True)


class Inception(nn.Module):
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super().__init__()

        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=1)

        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size=1),
            BasicConv2d(ch3x3red, ch3x3, kernel_size=3, padding=1)
        )

        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size=1),
            BasicConv2d(ch5x5red, ch5x5, kernel_size=5, padding=2)
        )

        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            BasicConv2d(in_channels, pool_proj, kernel_size=1)
        )

    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        outputs = [branch1, branch2, branch3, branch4]
        return torch.cat(outputs, dim=1)


class InceptionAux(nn.Module):
    def __init__(self, in_channels, num_classes):
        super().__init__()
        # self.avgpool = nn.AvgPool2d(kernel_size=5, stride=3)
        self.avgpool = nn.AdaptiveAvgPool2d(output_size=(4, 4))
        self.conv = BasicConv2d(in_channels, 128, kernel_size=1)  # output size [batch, 128, 4, 4]

        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)

    def forward(self, x):
        # aux1: N x 512 x 14 x 14, aux2: N x 528 x 14 x 14
        x = self.avgpool(x)
        # aux1: N x 512 x 4 x 4, aux2: N x 528 x 4 x 4
        x = self.conv(x)
        # N x 128 x 4 x 4
        x = torch.flatten(x, start_dim=1)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.fc1(x)
        x = F.relu(x, inplace=True)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.fc2(x)
        return x

reference

https://my.oschina.net/u/876354/blog/1637819
https://blog.csdn.net/weixin_44772440/article/details/122943095
https://blog.csdn.net/L888666Q/article/details/124496381

Guess you like

Origin blog.csdn.net/ximu__l/article/details/129476654