Image Classification: Summary of LeNet/AlexNet/VGG/GoogleNet/DarkNet

1.LeNet(1998)

LeNet is one of the earliest convolutional neural networks [1], which was proposed to recognize handwritten digits and machine-printed characters. In 1998, Yann LeCun applied the LeNet convolutional neural network to image classification for the first time, and achieved great success in the task of handwritten digit recognition. The algorithm explains that the correlation between pixel features in the image can be extracted by the convolution operation of parameter sharing, and the combined structure of convolution, downsampling (pooling) and nonlinear mapping is currently the most popular The foundation of deep image recognition networks.

1.1 LeNet model structure

LeNet extracts image features by continuously using a combination of convolution and pooling layers. Its architecture is  shown in Figure 1. Here is the LeNet-5 model used in the MNIST handwritten digit recognition task:

  • The first module: Contains 5×5 6-channel convolution and 2×2 pooling. The convolution extracts the feature patterns contained in the image (the activation function uses Sigmoid), and the image size is reduced from 28 to 24. After the pooling layer, the sensitivity of the output feature map to the spatial position can be reduced, and the image size is reduced to 12.

  • The second module: the same size as the first module, the number of channels is increased from 6 to 16. The convolution operation reduces the image size to 8, which becomes 4 after pooling.

  • The third module: Contains 4×4 120-channel convolution. The size of the image after convolution is reduced to 1, but the number of channels is increased to 120. Input the feature map extracted by the third convolution to the fully connected layer. The number of output neurons in the first fully connected layer is 64, the number of output neurons in the second fully connected layer is the number of categories of classification labels, and the number of categories for handwritten digit recognition is 10. Then use the Softmax activation function to calculate the predicted probability of each category.


hint:

How can the output feature map of the convolutional layer be used as the input of the fully connected layer?

The output data format of the convolutional layer is [�,�,�,�][N,C,H,W]. When inputting the fully connected layer, the data will be automatically flattened.

That is, for each sample, it is automatically converted into a vector of length �K,

Among them, �=�×�×�K=C×H×W, the data dimension of a mini-batch becomes a two-dimensional vector of �×�N×K.


1.2 Implementation of LeNet model

The implementation code of the LeNet network is as follows:

 
 

python

copy code

#导入需要的包 import paddle import numpy as np from paddle.nn import Conv2D, MaxPool2D, Linear ##组网 import paddle.nn.functional as F #定义 LeNet 网络结构 class LeNet(paddle.nn.Layer): def __init__(self, num_classes=1): super(LeNet, self).__init__() # 创建卷积和池化层 # 创建第1个卷积层 self.conv1 = Conv2D(in_channels=1, out_channels=6, kernel_size=5) self.max_pool1 = MaxPool2D(kernel_size=2, stride=2) # 尺寸的逻辑:池化层未改变通道数;当前通道数为6 # 创建第2个卷积层 self.conv2 = Conv2D(in_channels=6, out_channels=16, kernel_size=5) self.max_pool2 = MaxPool2D(kernel_size=2, stride=2) # 创建第3个卷积层 self.conv3 = Conv2D(in_channels=16, out_channels=120, kernel_size=4) # 尺寸的逻辑:输入层将数据拉平[B,C,H,W] -> [B,C*H*W] # 输入size是[28,28],经过三次卷积和两次池化之后,C*H*W等于120 self.fc1 = Linear(in_features=120, out_features=64) # 创建全连接层,第一个全连接层的输出神经元个数为64, 第二个全连接层输出神经元个数为分类标签的类别数 self.fc2 = Linear(in_features=64, out_features=num_classes) #网络的前向计算过程 def forward(self, x): x = self.conv1(x) # 每个卷积层使用Sigmoid激活函数,后面跟着一个2x2的池化 x = F.sigmoid(x) x = self.max_pool1(x) x = F.sigmoid(x) x = self.conv2(x) x = self.max_pool2(x) x = self.conv3(x) #尺寸的逻辑:输入层将数据拉平[B,C,H,W] -> [B,C*H*W] x = paddle.reshape(x, [x.shape[0], -1]) x = self.fc1(x) x = F.sigmoid(x) x = self.fc2(x) return x

1.3 LeNet model features

  • The convolutional network uses a 3-layer sequence combination: convolution, downsampling (pooling), nonlinear mapping (the most important feature of LeNet-5, which lays the foundation for the current deep convolutional network);
  • Use convolution to extract spatial features;
  • Downsampling using the mapped spatial mean;
  • Use ��ℎtanh or ��������sigmoid for nonlinear mapping;
  • Multi-layer neural network (MLP) as the final classifier;
  • A sparse connection matrix between layers avoids huge computational overhead.

1.4 LeNet model indicators

LeNet-5 has carried out model training and testing on the MNIST handwritten digit recognition task. The model indicators provided in the paper are  shown in Figure 2. After using the distortions method, the error rate can reach 0.8%.

  • references

[1] Gradient-based learn- ing applied to document recognition.

2.AlexNet(2012)

AlexNet[1] is the champion model of the ImageNet competition in 2012. Its authors are Hinton, one of the three giants in the field of neural networks, and his student Alex Krizhevsky.

AlexNet led the second place in the 2012 ImageNet competition with a great advantage, which brought a great impact to the academic and industrial circles at that time. Since then, more and deeper neural networks have been proposed, such as excellent VGG, GoogLeNet, ResNet, etc.

2.1 AlexNet model structure

Compared with the previous LeNet, AlexNet has a deeper network structure, including 5 layers of convolution and 3 layers of full connection. The specific structure is  shown in Figure 1.

1) The first module: For a color image of 224×224224×224, it is first convolved with 96 convolution kernels of 11×11×311×11×3 to extract the feature patterns contained in the image (the step size is 4 , the padding is 2, and 96 convolution results (feature maps) of 54×5454×54 are obtained; then pooling is performed with a size of 2×22×2, and 96 feature maps of size 27×2727×27 are obtained;

2) The second module: Contains 256 convolutions of 5×55×5 and 2×22×2 pooling. After the convolution operation, the image size remains unchanged. After pooling, the image size becomes 13×1313×13;

3) The third module: Contains 384 convolutions of 3×33×3, and the image size remains unchanged after the convolution operation;

4) The fourth module: Contains 384 convolutions of 3×33×3, and the image size remains unchanged after the convolution operation;

5) The fifth module: Contains 256 convolutions of 3×33×3 and pooling of 2×22×2. After the convolution operation, the image size remains unchanged, and becomes 256 6×66×6 after pooling feature map of .

The feature map extracted by the fifth convolution is input to the fully connected layer to obtain the vector representation of the original image. The number of output neurons in the first two fully connected layers is 4096, and the number of output neurons in the third fully connected layer is the number of classification labels (the number of classification categories in the ImageNet competition is 1000), and then use the Softmax activation function The predicted probability of each category can be calculated.

2.2 AlexNet model implementation

Based on the Paddle framework, the specific implementation code of AlexNet is as follows:

 
 

python

copy code

#-*- coding:utf-8 -*- #导入需要的包 import paddle import numpy as np from paddle.nn import Conv2D, MaxPool2D, Linear, Dropout ##组网 import paddle.nn.functional as F #定义 AlexNet 网络结构 class AlexNet(paddle.nn.Layer): def __init__(self, num_classes=1): super(AlexNet, self).__init__() # AlexNet与LeNet一样也会同时使用卷积和池化层提取图像特征 # 与LeNet不同的是激活函数换成了‘relu’ self.conv1 = Conv2D(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=5) self.max_pool1 = MaxPool2D(kernel_size=2, stride=2) self.conv2 = Conv2D(in_channels=96, out_channels=256, kernel_size=5, stride=1, padding=2) self.max_pool2 = MaxPool2D(kernel_size=2, stride=2) self.conv3 = Conv2D(in_channels=256, out_channels=384, kernel_size=3, stride=1, padding=1) self.conv4 = Conv2D(in_channels=384, out_channels=384, kernel_size=3, stride=1, padding=1) self.conv5 = Conv2D(in_channels=384, out_channels=256, kernel_size=3, stride=1, padding=1) self.max_pool5 = MaxPool2D(kernel_size=2, stride=2) self.fc1 = Linear(in_features=12544, out_features=4096) self.drop_ratio1 = 0.5 self.drop1 = Dropout(self.drop_ratio1) self.fc2 = Linear(in_features=4096, out_features=4096) self.drop_ratio2 = 0.5 self.drop2 = Dropout(self.drop_ratio2) self.fc3 = Linear(in_features=4096, out_features=num_classes) def forward(self, x): x = self.conv1(x) x = F.relu(x) x = self.max_pool1(x) x = self.conv2(x) x = F.relu(x) x = self.max_pool2(x) x = self.conv3(x) x = F.relu(x) x = self.conv4(x) x = F.relu(x) x = self.conv5(x) x = F.relu(x) x = self.max_pool5(x) x = paddle.reshape(x, [x.shape[0], -1]) x = self.fc1(x) x = F.relu(x) # 在全连接之后使用dropout抑制过拟合 x = self.drop1(x) x = self.fc2(x) x = F.relu(x) # 在全连接之后使用dropout抑制过拟合 x = self.drop2(x) x = self.fc3(x) return x

2.3 Features of AlexNet model

AlexNet contains several relatively new technical points, and for the first time successfully applied Tricks such as ReLU, Dropout and LRN in CNN. At the same time, AlexNet also uses GPU for computing acceleration.

AlexNet carried forward the idea of ​​LeNet and applied the basic principles of CNN to a deep and wide network. The new technologies mainly used by AlexNet are as follows:

  • Successfully used ReLU as the activation function of CNN, and verified that its effect exceeded Sigmoid in deeper networks, and successfully solved the gradient dispersion problem of Sigmoid in deeper networks. Although the ReLU activation function was proposed a long time ago, it was not carried forward until the emergence of AlexNet.
  • During training, Dropout is used to randomly ignore some neurons to avoid overfitting of the model. Although Dropout has a separate paper, AlexNet put it into practical use and proved its effect through practice. In AlexNet, Dropout is mainly used in the last few fully connected layers.
  • Using overlapping max pooling in CNN . Previously, average pooling was commonly used in CNN, and AlexNet all used maximum pooling to avoid the fuzzy effect of average pooling. And in AlexNet, it is proposed that the step length is smaller than the size of the pooling kernel, so that there will be overlap and coverage between the outputs of the pooling layer, which improves the richness of features.
  • The LRN local response normalization layer is proposed to create a competition mechanism for the activity of local neurons, so that the value with a relatively large response becomes relatively larger, and inhibits other neurons with smaller feedback, enhancing the generalization ability of the model .
  • Use CUDA to accelerate the training of deep convolutional networks, and use the powerful parallel computing capabilities of GPU to handle a large number of matrix operations during neural network training. AlexNet uses two GTX 580 GPUs for training, and a single GTX 580 has only 3GB of video memory, which limits the maximum size of the network that can be trained. Therefore, the author distributes AlexNet on two GPUs, and stores half of the neuron parameters in the memory of each GPU. Because the communication between GPUs is convenient, and the video memory can be accessed without going through the host memory, it is also very efficient to use multiple GPUs at the same time. At the same time, the design of AlexNet allows the communication between GPUs to be carried out only in certain layers of the network, which controls the performance loss of communication. 
  • Using data augmentation , a region of size 224×224224×224 (and a mirror image flipped horizontally) is randomly cut from the original image of size 256×256256×256, which is equivalent to an increase of 2×(256−224)2=20482×( 256−224)2=2048 times the amount of data. If there is no data enhancement, only relying on the original data volume, CNN with many parameters will fall into overfitting. After using data enhancement, overfitting can be greatly reduced and the generalization ability can be improved. When making predictions, a total of 5 positions are taken from the four corners of the picture plus the middle, and flipped left and right to obtain a total of 10 pictures, predict them and calculate the average of the 10 results. At the same time, the AlexNet paper mentioned that PCA processing will be performed on the RGB data of the image, and a Gaussian perturbation with a standard deviation of 0.1 will be made on the principal component to add some noise. This Trick can reduce the error rate by another 1%.

2.4 AlexNet model indicators

As the champion algorithm of the ImageNet 2012 competition, AlexNet achieved a top-5 error rate of 15.3% on the ImageNet test set, far exceeding the 26.2% of the second place (SIFT+FVs). As shown in Figure  2  .

  • references

[1] Imagenet classification with deep convolutional neural networks.

3.VGG(2012)

After AlexNet shined in the ImageNet competition in 2012, the convolutional neural network entered a stage of rapid development. In 2014, the VGG [1] network proposed by Simonyan and Zisserman achieved runner-up results on ImageNet. The name of VGG comes from the Visual Geometry Group, the laboratory of the author of the paper. It has improved the convolutional neural network, explored the relationship between network depth and performance, and achieved better results with smaller convolution kernels and deeper network structures. The good effect has become a more important network in the history of CNN development. VGG uses a series of 3x3 small-sized convolution kernels and pooling layers to construct a deep convolutional neural network. It is widely welcomed by researchers because of its simple structure and strong applicability, especially its network structure design method. , providing a direction for building deep neural networks.

3.1 VGG model structure

Figure 1  is a schematic diagram of the network structure of VGG-16, with 13 layers of convolution and 3 layers of fully connected layers. The design of the VGG network strictly uses a 3×33×3 convolutional layer and a pooling layer to extract features, and uses three fully connected layers at the end of the network, and uses the output of the last fully connected layer as a classification prediction.

There is also a notable feature in VGG: the size of the feature map is doubled after each pass through the pooling layer (maxpooling), and the number of channels is doubled (except for the last pooling layer).

In VGG, each layer of convolution will use ReLU as the activation function, and dropout will be added after the fully connected layer to suppress overfitting. Using a small convolution kernel can effectively reduce the number of parameters, making training and testing more efficient. For example, using two 3×33×3 convolutional layers can obtain a feature map with a receptive field of 5, which requires fewer parameters than using a 5×55×5 convolutional layer. Since the convolution kernel is relatively small, more convolution layers can be stacked to deepen the depth of the network, which is beneficial for image classification tasks. The success of the VGG model proves that increasing the depth of the network can better learn the feature patterns in the image.

3.2 Implementation of VGG model

Based on the Paddle framework, the specific implementation of VGG is shown in the following code:

 
 

python

copy code

#-*- coding:utf-8 -*- #VGG模型代码 import numpy as np import paddle #from paddle.nn import Conv2D, MaxPool2D, BatchNorm, Linear from paddle.nn import Conv2D, MaxPool2D, BatchNorm2D, Linear #定义vgg网络 class VGG(paddle.nn.Layer): def __init__(self): super(VGG, self).__init__() in_channels = [3, 64, 128, 256, 512, 512] # 定义第一个卷积块,包含两个卷积 self.conv1_1 = Conv2D(in_channels=in_channels[0], out_channels=in_channels[1], kernel_size=3, padding=1, stride=1) self.conv1_2 = Conv2D(in_channels=in_channels[1], out_channels=in_channels[1], kernel_size=3, padding=1, stride=1) # 定义第二个卷积块,包含两个卷积 self.conv2_1 = Conv2D(in_channels=in_channels[1], out_channels=in_channels[2], kernel_size=3, padding=1, stride=1) self.conv2_2 = Conv2D(in_channels=in_channels[2], out_channels=in_channels[2], kernel_size=3, padding=1, stride=1) # 定义第三个卷积块,包含三个卷积 self.conv3_1 = Conv2D(in_channels=in_channels[2], out_channels=in_channels[3], kernel_size=3, padding=1, stride=1) self.conv3_2 = Conv2D(in_channels=in_channels[3], out_channels=in_channels[3], kernel_size=3, padding=1, stride=1) self.conv3_3 = Conv2D(in_channels=in_channels[3], out_channels=in_channels[3], kernel_size=3, padding=1, stride=1) # 定义第四个卷积块,包含三个卷积 self.conv4_1 = Conv2D(in_channels=in_channels[3], out_channels=in_channels[4], kernel_size=3, padding=1, stride=1) self.conv4_2 = Conv2D(in_channels=in_channels[4], out_channels=in_channels[4], kernel_size=3, padding=1, stride=1) self.conv4_3 = Conv2D(in_channels=in_channels[4], out_channels=in_channels[4], kernel_size=3, padding=1, stride=1) # 定义第五个卷积块,包含三个卷积 self.conv5_1 = Conv2D(in_channels=in_channels[4], out_channels=in_channels[5], kernel_size=3, padding=1, stride=1) self.conv5_2 = Conv2D(in_channels=in_channels[5], out_channels=in_channels[5], kernel_size=3, padding=1, stride=1) self.conv5_3 = Conv2D(in_channels=in_channels[5], out_channels=in_channels[5], kernel_size=3, padding=1, stride=1) # 使用Sequential 将全连接层和relu组成一个线性结构(fc + relu) # 当输入为224x224时,经过五个卷积块和池化层后,特征维度变为[512x7x7] self.fc1 = paddle.nn.Sequential(paddle.nn.Linear(512 * 7 * 7, 4096), paddle.nn.ReLU()) self.drop1_ratio = 0.5 self.dropout1 = paddle.nn.Dropout(self.drop1_ratio, mode='upscale_in_train') # 使用Sequential 将全连接层和relu组成一个线性结构(fc + relu) self.fc2 = paddle.nn.Sequential(paddle.nn.Linear(4096, 4096), paddle.nn.ReLU()) self.drop2_ratio = 0.5 self.dropout2 = paddle.nn.Dropout(self.drop2_ratio, mode='upscale_in_train') self.fc3 = paddle.nn.Linear(4096, 1) self.relu = paddle.nn.ReLU() self.pool = MaxPool2D(stride=2, kernel_size=2) def forward(self, x): x = self.relu(self.conv1_1(x)) x = self.relu(self.conv1_2(x)) x = self.pool(x) x = self.relu(self.conv2_1(x)) x = self.relu(self.conv2_2(x)) x = self.pool(x) x = self.relu(self.conv3_1(x)) x = self.relu(self.conv3_2(x)) x = self.relu(self.conv3_3(x)) x = self.pool(x) x = self.relu(self.conv4_1(x)) x = self.relu(self.conv4_2(x)) x = self.relu(self.conv4_3(x)) x = self.pool(x) x = self.relu(self.conv5_1(x)) x = self.relu(self.conv5_2(x)) x = self.relu(self.conv5_3(x)) x = self.pool(x) x = paddle.flatten(x, 1, -1) x = self.dropout1(self.relu(self.fc1(x))) x = self.dropout2(self.relu(self.fc2(x))) x = self.fc3(x) return x

3.3 Features of VGG model

  • The entire network uses the same size convolution kernel size 3×33×3 and maximum pooling size 2×22×2.
  • The significance of 1×11×1 convolution mainly lies in linear transformation, while the number of input channels and output channels remain unchanged, and no dimensionality reduction occurs.
  • Two 3×33×3 convolutional layers connected in series are equivalent to a 5×55×5 convolutional layer, and the receptive field size is 5×55×5. Similarly, the effect of concatenating three 3×33×3 convolutional layers is equivalent to a 7×77×7 convolutional layer. This connection method makes the network parameters smaller, and the multi-layer activation function makes the network more capable of learning features.
  • VGGNet has a little trick during training. First train the shallow simple network VGG11, and then reuse the weight of VGG11 to initialize VGG13. Repeated training and initialization of VGG19 can make the convergence speed of training faster.
  • In the training process, multi-scale transformation is used to enhance the original data, so that the model is not easy to overfit.

3.4 VGG model indicators

VGG won the second place in the ImageNet competition in 2014, and the specific indicators are  shown in Figure 2. The first line in Figure 2  is the index in the ImageNet competition. The error rate of the test set reached 7.3%. In the paper, the author optimized the algorithm to a certain extent, and finally reached an error rate of 6.8%.

  • references

[1] Very deep convolutional networks for large-scale image recognition.

4.GoogLeNet(2014)

GoogLeNet [1] is the champion of the ImageNet competition in 2014. Its main feature is that the network not only has depth, but also has "width" in the horizontal direction. From the name GoogLeNet, we can know that this is a network structure designed by Google engineers, and the name GoogLeNet is a tribute to LeNet. The core part of GoogLeNet is its internal sub-network structure Inception, which is inspired by NIN (Network In Network).

4.1 GoogLeNet model structure

Due to the huge difference in the spatial size of image information, how to choose a suitable convolution kernel to extract features is more difficult. Image information with a wider spatial distribution is suitable for extracting its features with a larger convolution kernel; while image information with a smaller spatial distribution is suitable for extracting its features with a smaller convolution kernel. In order to solve this problem, GoogLeNet proposed a scheme called Inception module. As shown in Figure  1  :


illustrate:

  • In order to pay tribute to LeNet, Google researchers specially named the model GoogLeNet.

  • The word Inception comes from the movie "Inception".

Figure 1(a) is the design idea of ​​the Inception module. Three convolution kernels of different sizes are used to perform convolution operations on the input image, and additional maximum pooling is added. The outputs of these four operations are spliced ​​along the dimension of the channel. The resulting output feature map will contain features extracted by convolution kernels of different sizes, so as to achieve the effect of capturing information of different scales. The Inception module adopts a multi-path design form, each branch uses a convolution kernel of different sizes, and the number of channels of the final output feature map is the sum of the number of output channels of each branch, which will result in the output channel The number becomes very large, especially when using multiple Inception modules to operate in series, the amount of model parameters will become very large.

In order to reduce the amount of parameters, the Inception module uses the design method in Figure (b). Before each 3x3 and 5x5 convolutional layer, add a 1x1 convolutional layer to control the number of output channels; increase after the maximum pooling layer A 1x1 convolutional layer reduces the number of output channels. Based on this design idea, the structure shown in (b) above is formed. The following program is the specific implementation of the Inception block, which can be read together with the code as shown in Figure (b).


hint:

Some readers may ask, will the image size not be reduced after 3x3 maximum pooling? Why can it be spliced ​​with the feature maps output by the other 3 convolutions? This is because the pooling operation can specify the window size ℎ=��=3kh​=kw​=3, stride=1 and padding=1, and the output feature map size can remain unchanged.


The specific implementation of the Inception module is shown in the following code:

 
 

python

copy code

#GoogLeNet模型代码 import numpy as np import paddle from paddle.nn import Conv2D, MaxPool2D, AdaptiveAvgPool2D, Linear ##组网 import paddle.nn.functional as F #定义Inception块 class Inception(paddle.nn.Layer): def __init__(self, c0, c1, c2, c3, c4, **kwargs): ''' Inception模块的实现代码, c1,图(b)中第一条支路1x1卷积的输出通道数,数据类型是整数 c2,图(b)中第二条支路卷积的输出通道数,数据类型是tuple或list, 其中c2[0]是1x1卷积的输出通道数,c2[1]是3x3 c3,图(b)中第三条支路卷积的输出通道数,数据类型是tuple或list, 其中c3[0]是1x1卷积的输出通道数,c3[1]是3x3 c4,图(b)中第一条支路1x1卷积的输出通道数,数据类型是整数 ''' super(Inception, self).__init__() # 依次创建Inception块每条支路上使用到的操作 self.p1_1 = Conv2D(in_channels=c0,out_channels=c1, kernel_size=1) self.p2_1 = Conv2D(in_channels=c0,out_channels=c2[0], kernel_size=1) self.p2_2 = Conv2D(in_channels=c2[0],out_channels=c2[1], kernel_size=3, padding=1) self.p3_1 = Conv2D(in_channels=c0,out_channels=c3[0], kernel_size=1) self.p3_2 = Conv2D(in_channels=c3[0],out_channels=c3[1], kernel_size=5, padding=2) self.p4_1 = MaxPool2D(kernel_size=3, stride=1, padding=1) self.p4_2 = Conv2D(in_channels=c0,out_channels=c4, kernel_size=1) def forward(self, x): # 支路1只包含一个1x1卷积 p1 = F.relu(self.p1_1(x)) # 支路2包含 1x1卷积 + 3x3卷积 p2 = F.relu(self.p2_2(F.relu(self.p2_1(x)))) # 支路3包含 1x1卷积 + 5x5卷积 p3 = F.relu(self.p3_2(F.relu(self.p3_1(x)))) # 支路4包含 最大池化和1x1卷积 p4 = F.relu(self.p4_2(self.p4_1(x))) # 将每个支路的输出特征图拼接在一起作为最终的输出结果 return paddle.concat([p1, p2, p3, p4], axis=1)

The architecture of GoogLeNet is  shown in Figure 2. Five modules (blocks) are used in the main convolution part, and a 3 × 3 maximum pooling layer with a stride of 2 is used between each module to reduce the output height and width.

  • The first module uses a 64-channel 7 × 7 convolutional layer.
  • The second module uses 2 convolutional layers: first a 1 × 1 convolutional layer with 64 channels, followed by a 3 × 3 convolutional layer with 3 times the channels.
  • The third module connects 2 complete Inception blocks in series.
  • The fourth module connects 5 Inception blocks in series.
  • The fifth module connects 2 Inception blocks in series.
  • The fifth module is followed by the output layer, using the global average pooling layer to change the height and width of each channel to 1, and finally connecting a fully connected layer whose output number is the number of label categories.

Explanation:  The two auxiliary classifiers softmax1 and softmax2 shown in the figure are added to the original author's paper. As shown in the figure below, the loss functions of the three classifiers are weighted and summed during training to alleviate the gradient disappearance phenomenon.

4.2 Implementation of GoogLeNet model

The specific implementation of GoogLeNet is shown in the following code:

 
 

python

copy code

#GoogLeNet模型代码 import paddle from paddle import ParamAttr import paddle.nn as nn import paddle.nn.functional as F from paddle.nn import Conv2D, BatchNorm, Linear, Dropout from paddle.nn import AdaptiveAvgPool2D, MaxPool2D, AvgPool2D from paddle.nn.initializer import Uniform import math #全连接层参数初始化 def xavier(channels, filter_size, name): stdv = (3.0 / (filter_size**2 * channels))**0.5 param_attr = ParamAttr(initializer=Uniform(-stdv, stdv), name=name + "_weights") return param_attr class ConvLayer(nn.Layer): def __init__(self, num_channels, num_filters, filter_size, stride=1, groups=1, act=None, name=None): super(ConvLayer, self).__init__() self._conv = Conv2D( in_channels=num_channels, out_channels=num_filters, kernel_size=filter_size, stride=stride, padding=(filter_size - 1) // 2, groups=groups, weight_attr=ParamAttr(name=name + "_weights"), bias_attr=False) def forward(self, inputs): y = self._conv(inputs) return y #定义Inception块 class Inception(nn.Layer): def __init__(self, input_channels, output_channels, filter1, filter3R, filter3, filter5R, filter5, proj, name=None): ''' Inception模块的实现代码, c1,图(b)中第一条支路1x1卷积的输出通道数,数据类型是整数 c2,图(b)中第二条支路卷积的输出通道数,数据类型是tuple或list, 其中c2[0]是1x1卷积的输出通道数,c2[1]是3x3 c3,图(b)中第三条支路卷积的输出通道数,数据类型是tuple或list, 其中c3[0]是1x1卷积的输出通道数,c3[1]是3x3 c4,图(b)中第一条支路1x1卷积的输出通道数,数据类型是整数 ''' super(Inception, self).__init__() # 依次创建Inception块每条支路上使用到的操作 self._conv1 = ConvLayer(input_channels, filter1, 1, name="inception_" + name + "_1x1") self._conv3r = ConvLayer(input_channels, filter3R, 1, name="inception_" + name + "_3x3_reduce") self._conv3 = ConvLayer(filter3R, filter3, 3, name="inception_" + name + "_3x3") self._conv5r = ConvLayer(input_channels, filter5R, 1, name="inception_" + name + "_5x5_reduce") self._conv5 = ConvLayer(filter5R, filter5, 5, name="inception_" + name + "_5x5") self._pool = MaxPool2D(kernel_size=3, stride=1, padding=1) self._convprj = ConvLayer(input_channels, proj, 1, name="inception_" + name + "_3x3_proj") def forward(self, inputs): # 支路1只包含一个1x1卷积 conv1 = self._conv1(inputs) # 支路2包含 1x1卷积 + 3x3卷积 conv3r = self._conv3r(inputs) conv3 = self._conv3(conv3r) # 支路3包含 1x1卷积 + 5x5卷积 conv5r = self._conv5r(inputs) conv5 = self._conv5(conv5r) # 支路4包含 最大池化和1x1卷积 pool = self._pool(inputs) convprj = self._convprj(pool) # 将每个支路的输出特征图拼接在一起作为最终的输出结果 cat = paddle.concat([conv1, conv3, conv5, convprj], axis=1) cat = F.relu(cat) return cat class GoogLeNet(nn.Layer): def __init__(self, class_dim=1000): super(GoogLeNetDY, self).__init__() # GoogLeNet包含五个模块,每个模块后面紧跟一个池化层 # 第一个模块包含1个卷积层 self._conv = ConvLayer(3, 64, 7, 2, name="conv1") # 3x3最大池化 self._pool = MaxPool2D(kernel_size=3, stride=2) # 第二个模块包含2个卷积层 self._conv_1 = ConvLayer(64, 64, 1, name="conv2_1x1") self._conv_2 = ConvLayer(64, 192, 3, name="conv2_3x3") # 第三个模块包含2个Inception块 self._ince3a = Inception(192, 192, 64, 96, 128, 16, 32, 32, name="ince3a") self._ince3b = Inception(256, 256, 128, 128, 192, 32, 96, 64, name="ince3b") # 第四个模块包含5个Inception块 self._ince4a = Inception(480, 480, 192, 96, 208, 16, 48, 64, name="ince4a") self._ince4b = Inception(512, 512, 160, 112, 224, 24, 64, 64, name="ince4b") self._ince4c = Inception(512, 512, 128, 128, 256, 24, 64, 64, name="ince4c") self._ince4d = Inception(512, 512, 112, 144, 288, 32, 64, 64, name="ince4d") self._ince4e = Inception(528, 528, 256, 160, 320, 32, 128, 128, name="ince4e") # 第五个模块包含2个Inception块 self._ince5a = Inception(832, 832, 256, 160, 320, 32, 128, 128, name="ince5a") self._ince5b = Inception(832, 832, 384, 192, 384, 48, 128, 128, name="ince5b") # 全局池化 self._pool_5 = AvgPool2D(kernel_size=7, stride=7) self._drop = Dropout(p=0.4, mode="downscale_in_infer") self._fc_out = Linear( 1024, class_dim, weight_attr=xavier(1024, 1, "out"), bias_attr=ParamAttr(name="out_offset")) self._pool_o1 = AvgPool2D(kernel_size=5, stride=3) self._conv_o1 = ConvLayer(512, 128, 1, name="conv_o1") self._fc_o1 = Linear( 1152, 1024, weight_attr=xavier(2048, 1, "fc_o1"), bias_attr=ParamAttr(name="fc_o1_offset")) self._drop_o1 = Dropout(p=0.7, mode="downscale_in_infer") self._out1 = Linear( 1024, class_dim, weight_attr=xavier(1024, 1, "out1"), bias_attr=ParamAttr(name="out1_offset")) self._pool_o2 = AvgPool2D(kernel_size=5, stride=3) self._conv_o2 = ConvLayer(528, 128, 1, name="conv_o2") self._fc_o2 = Linear( 1152, 1024, weight_attr=xavier(2048, 1, "fc_o2"), bias_attr=ParamAttr(name="fc_o2_offset")) self._drop_o2 = Dropout(p=0.7, mode="downscale_in_infer") self._out2 = Linear( 1024, class_dim, weight_attr=xavier(1024, 1, "out2"), bias_attr=ParamAttr(name="out2_offset")) def forward(self, inputs): x = self._conv(inputs) x = self._pool(x) x = self._conv_1(x) x = self._conv_2(x) x = self._pool(x) x = self._ince3a(x) x = self._ince3b(x) x = self._pool(x) ince4a = self._ince4a(x) x = self._ince4b(ince4a) x = self._ince4c(x) ince4d = self._ince4d(x) x = self._ince4e(ince4d) x = self._pool(x) x = self._ince5a(x) ince5b = self._ince5b(x) x = self._pool_5(ince5b) x = self._drop(x) x = paddle.squeeze(x, axis=[2, 3]) out = self._fc_out(x) x = self._pool_o1(ince4a) x = self._conv_o1(x) x = paddle.flatten(x, start_axis=1, stop_axis=-1) x = self._fc_o1(x) x = F.relu(x) x = self._drop_o1(x) out1 = self._out1(x) x = self._pool_o2(ince4d) x = self._conv_o2(x) x = paddle.flatten(x, start_axis=1, stop_axis=-1) x = self._fc_o2(x) x = self._drop_o2(x) out2 = self._out2(x) return [out, out1, out2]

4.3 Features of GoogLeNet model

  • The use of convolution kernels of different sizes means different sizes of receptive fields, and finally the fusion of features of different scales is achieved through splicing;
  • The reason why the convolution kernel size is 1, 3 and 5 is mainly for the convenience of alignment. After setting the convolution step stride=1, just set pad=0, 1, 2 respectively, then the features of the same dimension can be obtained after convolution, and then these features can be directly spliced ​​together;
  • The further the network goes, the more abstract the features, and the larger the receptive field involved in each feature, so as the number of layers increases, the ratio of 3x3 and 5x5 convolutions also increases. However, using a 5x5 convolution kernel still brings a huge amount of computation. To this end, the article uses a 1x1 convolution kernel for dimensionality reduction.

4.4 GoogLeNet Model Indicators

GoogLeNet won the championship in the 2014 ImageNet competition, and the specific indicators are  shown in Figure 3. The Error rate reached 6.67% on the test set.

  • references

[1] Going deeper with convolutions.

5.DarkNet(YOLOv2、3)

In the YOLO series of algorithms in the field of target detection, the author set up and trained the DarkNet network as the backbone network in order to achieve better classification results. Among them, YOLOv2[1] proposed the DarkNet network for the first time. Since it has 19 convolutional layers, it is also called DarkNet19. Later in YOLOv3[2], the author continued to absorb the ideas of current excellent algorithms, such as residual network and feature fusion, etc., and proposed a backbone network DarkNet53 with 53 convolutional layers. The author conducted experiments on ImageNet and found that compared with ResNet-152 and ResNet-101, DarkNet53 has achieved a leading computational speed under the premise of similar classification accuracy.

5.1 DarkNet model structure

5.1.1 DarkNet19

In DarkNet19, the experience of many excellent algorithms is borrowed, such as: borrowing the idea of ​​​​VGG, using more 3×33×3 convolutions, doubling the number of channels after each pooling operation; drawing on the network in The idea of ​​​​the network uses global average pooling (global average pooling) to make predictions, and places a 1×11×1 convolution kernel between 3×33×3 convolution kernels to compress features; at the same time, use The batch normalization layer stabilizes model training, speeds up convergence, and plays a role in regularization. The network structure of DarkNet19  is shown in Figure 1.

The accuracy of DarkNet19 is comparable to that of the VGG network, but the amount of floating point operations is only about 1551​, so the operation speed is extremely fast.

5.1.2 DarkNet53

On the previous basis, DarkNet53 draws on the idea of ​​ResNet and uses a large number of residual connections in the network, so the network structure can be designed very deep, and the problem of gradient disappearance in training is alleviated, making the model easier to converge. At the same time, the convolutional layer with a stride of 2 is used instead of the pooling layer to achieve downsampling. The network structure of DarkNet53  is shown in Figure 2.

Considering that the Darknet19 network is currently used less frequently, the following will mainly focus on the implementation and explanation of the Darknet53 network.

5.2 DarkNet model implementation

Based on the Paddle framework, the specific implementation code of DarkNet53 is as follows:

 
 

python

copy code

import paddle from paddle import ParamAttr import paddle.nn as nn import paddle.nn.functional as F from paddle.nn import Conv2D, BatchNorm, Linear, Dropout from paddle.nn import AdaptiveAvgPool2D, MaxPool2D, AvgPool2D from paddle.nn.initializer import Uniform import math #将卷积和批归一化封装为ConvBNLayer,方便后续复用 class ConvBNLayer(nn.Layer): def __init__(self, input_channels, output_channels, filter_size, stride, padding, name=None): # 初始化函数 super(ConvBNLayer, self).__init__() # 创建卷积层 self._conv = Conv2D( in_channels=input_channels, out_channels=output_channels, kernel_size=filter_size, stride=stride, padding=padding, weight_attr=ParamAttr(name=name + ".conv.weights"), bias_attr=False) # 创建批归一化层 bn_name = name + ".bn" self._bn = BatchNorm( num_channels=output_channels, act="relu", param_attr=ParamAttr(name=bn_name + ".scale"), bias_attr=ParamAttr(name=bn_name + ".offset"), moving_mean_name=bn_name + ".mean", moving_variance_name=bn_name + ".var") def forward(self, inputs): # 前向计算 x = self._conv(inputs) x = self._bn(x) return x #定义残差块 class BasicBlock(nn.Layer): def __init__(self, input_channels, output_channels, name=None): # 初始化函数 super(BasicBlock, self).__init__() # 定义两个卷积层 self._conv1 = ConvBNLayer( input_channels, output_channels, 1, 1, 0, name=name + ".0") self._conv2 = ConvBNLayer( output_channels, output_channels * 2, 3, 1, 1, name=name + ".1") def forward(self, inputs): # 前向计算 x = self._conv1(inputs) x = self._conv2(x) # 将第二个卷积层的输出和最初的输入值相加 return paddle.add(x=inputs, y=x) class DarkNet53(nn.Layer): def __init__(self, class_dim=1000): # 初始化函数 super(DarkNet, self).__init__() # DarkNet 每组残差块的个数,来自DarkNet的网络结构图 self.stages = [1, 2, 8, 8, 4] # 第一层卷积 self._conv1 = ConvBNLayer(3, 32, 3, 1, 1, name="yolo_input") # 下采样,使用stride=2的卷积来实现 self._conv2 = ConvBNLayer( 32, 64, 3, 2, 1, name="yolo_input.downsample") # 添加各个层级的实现 self._basic_block_01 = BasicBlock(64, 32, name="stage.0.0") # 下采样,使用stride=2的卷积来实现 self._downsample_0 = ConvBNLayer( 64, 128, 3, 2, 1, name="stage.0.downsample") self._basic_block_11 = BasicBlock(128, 64, name="stage.1.0") self._basic_block_12 = BasicBlock(128, 64, name="stage.1.1") # 下采样,使用stride=2的卷积来实现 self._downsample_1 = ConvBNLayer( 128, 256, 3, 2, 1, name="stage.1.downsample") self._basic_block_21 = BasicBlock(256, 128, name="stage.2.0") self._basic_block_22 = BasicBlock(256, 128, name="stage.2.1") self._basic_block_23 = BasicBlock(256, 128, name="stage.2.2") self._basic_block_24 = BasicBlock(256, 128, name="stage.2.3") self._basic_block_25 = BasicBlock(256, 128, name="stage.2.4") self._basic_block_26 = BasicBlock(256, 128, name="stage.2.5") self._basic_block_27 = BasicBlock(256, 128, name="stage.2.6") self._basic_block_28 = BasicBlock(256, 128, name="stage.2.7") # 下采样,使用stride=2的卷积来实现 self._downsample_2 = ConvBNLayer( 256, 512, 3, 2, 1, name="stage.2.downsample") self._basic_block_31 = BasicBlock(512, 256, name="stage.3.0") self._basic_block_32 = BasicBlock(512, 256, name="stage.3.1") self._basic_block_33 = BasicBlock(512, 256, name="stage.3.2") self._basic_block_34 = BasicBlock(512, 256, name="stage.3.3") self._basic_block_35 = BasicBlock(512, 256, name="stage.3.4") self._basic_block_36 = BasicBlock(512, 256, name="stage.3.5") self._basic_block_37 = BasicBlock(512, 256, name="stage.3.6") self._basic_block_38 = BasicBlock(512, 256, name="stage.3.7") # 下采样,使用stride=2的卷积来实现 self._downsample_3 = ConvBNLayer( 512, 1024, 3, 2, 1, name="stage.3.downsample") self._basic_block_41 = BasicBlock(1024, 512, name="stage.4.0") self._basic_block_42 = BasicBlock(1024, 512, name="stage.4.1") self._basic_block_43 = BasicBlock(1024, 512, name="stage.4.2") self._basic_block_44 = BasicBlock(1024, 512, name="stage.4.3") # 自适应平均池化 self._pool = AdaptiveAvgPool2D(1) stdv = 1.0 / math.sqrt(1024.0) # 分类层 self._out = Linear( 1024, class_dim, weight_attr=ParamAttr( name="fc_weights", initializer=Uniform(-stdv, stdv)), bias_attr=ParamAttr(name="fc_offset")) def forward(self, inputs): x = self._conv1(inputs) x = self._conv2(x) x = self._basic_block_01(x) x = self._downsample_0(x) x = self._basic_block_11(x) x = self._basic_block_12(x) x = self._downsample_1(x) x = self._basic_block_21(x) x = self._basic_block_22(x) x = self._basic_block_23(x) x = self._basic_block_24(x) x = self._basic_block_25(x) x = self._basic_block_26(x) x = self._basic_block_27(x) x = self._basic_block_28(x) x = self._downsample_2(x) x = self._basic_block_31(x) x = self._basic_block_32(x) x = self._basic_block_33(x) x = self._basic_block_34(x) x = self._basic_block_35(x) x = self._basic_block_36(x) x = self._basic_block_37(x) x = self._basic_block_38(x) x = self._downsample_3(x) x = self._basic_block_41(x) x = self._basic_block_42(x) x = self._basic_block_43(x) x = self._basic_block_44(x) x = self._pool(x) x = paddle.squeeze(x, axis=[2, 3]) x = self._out(x) return x

5.3 DarkNet Model Features

  • The DarkNet53 model uses a large number of residual connections, which alleviates the problem of gradient disappearance during training and makes the model easier to converge.
  • The DarkNet53 model uses a convolutional layer with a stride of 2 instead of a pooling layer to achieve downsampling.

5.4 DarkNet Model Indicators

In the YOLOv3 paper, the author compared the accuracy and speed of the DarkNet network and the ResNet network on the ImageNet dataset, as shown in Figure 3. It can be seen that the top-5 accuracy of DarkNet53 can reach 93.8%, and the speed also significantly exceeds ResNet101 and ResNet152.

For more articles, please pay attention to the official account: Ting, artificial intelligence

  • references

[1] YOLO9000: Better, Faster, Stronger

[2] YOLOv3: An Incremental Improvement

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131144482