1 Introduction to the PNASNet model

The PNASNet model is a model generated by Google's AutoML architecture automatic search. It uses progressive network architecture search technology and iterative self-learning to find the optimal network structure. That is, using machines to design machine learning algorithms so that they can better serve the data provided by users. The model has a Top-1 accuracy rate of 82.9% and a Top-5 accuracy rate of 96.2% on the ImageNet dataset, which is one of the best image classification models at present.
The main structures of the PNASNet model are Normal Cell and Reduction Cell (see the paper numbered 1712.00559 on the arXⅳ website). The main structure of the NASNet model uses residual structure and multi-branch convolution technology, and also adds depthwise separable convolution (group volume A special form of product) and atrous convolution.

2 sets of convolution

Group convolution refers to grouping the original input data first, and then performing the convolution operation. It can not only enhance the diagonal correlation between the convolution kernels but also reduce the training parameters, and it is not easy to overfit, similar to the regularization effect. The AlexNet model uses the group convolution technique.

2.1 Operation rules of group convolution

2.1.1 The difference between ordinary convolution and group convolution

The biggest difference between ordinary convolution and group convolution is the operation of the convolution kernel after convolution on different channels.

Ordinary convolution is to use a convolution kernel to perform convolution and summation on each channel, and each feature map obtained will contain the feature information of each previous channel.

The group convolution is a convolution fusion operation performed according to the grouping. After ordinary convolution is performed between each group, the fusion generated feature map only contains the feature information of all channels in the corresponding group.

2.2 Code to implement group convolution

2.2.1 Code Implementation

import torch

input1 = torch.ones([1,12,5,5])

groupsconv = torch.nn.Conv2d(in_channels=12,out_channels=6,kernel_size=3,groups=3) # 定义组卷积，输入输出通道必须是groups的整数倍
Group_convolution = groupsconv(input1)
print("查看组卷积的卷积核的形状",groupsconv.weight.size()) # torch.Size([6, 4, 3, 3])
print("查看组卷卷积的结果形状：",Group_convolution.size()) # torch.Size([1, 6, 3, 3])


conv = torch.nn.Conv2d(in_channels=12,out_channels=6,kernel_size=3,groups=1) # 定义普通卷积
Ordinary_convolution = conv(input1)
print("查看普通卷积的卷积核的形状",conv.weight.size()) # torch.Size([6, 12, 3, 3])
print("查看普通卷积的结果形状：",Ordinary_convolution.size()) # torch.Size([1, 6, 3, 3])

2.2.2 Code Explanation

The group convolution uses six 4-channel convolution kernels. The processing process is as follows:
1. Divide the 12 channels of the input data into 3 groups, each with 4 channels.
2. Convolve the 4 channels of the first group in the input data with the first 4-channel convolution kernel respectively, and add the convolution results of the 4 channels to obtain the feature map of the first channel.
3. Convolve the 4 channels of the first group in the input data with the second 4-channel convolution kernel respectively, and add the results of step (3) according to the method of step (2) to obtain the second 4-channel convolution kernel. feature map of each channel.
4. Operate the 4 channels of the 2nd group in the input data with the 3rd and 4th 4-channel convolution kernels according to steps (2)~(3) respectively to obtain the feature maps of the 3rd and 4th channels.
5. Operate the 4 channels of the 3rd group in the input data with the 5th and 6th 4-channel convolution kernels respectively according to steps (2)~(3) to obtain the feature maps of the 5th and 6th channels.
6. Finally, the group convolution result of 6 channels is obtained.

Ordinary convolution directly convolves the convolution kernel of 12 channels with the input data of 12 channels, and adds the results to obtain the feature map of the first channel. Repeat this operation 5 times to complete the entire convolution process.

2.3 Advantages and disadvantages of group convolution

2.3.1 Advantages of group convolution

The advantage of group convolution is that the number of parameters and the amount of computation can be reduced, and the group size in group convolution can be selected to improve the classification accuracy of DNN.

2.3.2 Disadvantages of group convolution

In group convolution, arbitrarily choosing the group size can lead to an imbalance between computational complexity and degree of data reuse, affecting the computational efficiency.

3 Depthwise Separable Convolutions

Depthwise separable convolution refers to convolution with different convolution kernels for each input channel.

3.1 Depth Separable Convolution Article Source

The Xception model is the general name of the Inception series of models. The main purpose of using depthwise separable convolution is to decouple the channel correlation and the plane space dimension correlation, so that the convolution operations on the channel relationship and the plane space relationship are independent of each other. to achieve better results. (See paper number 1610.02357 on the arXIV website).

3.2 Code Implementation: Depthwise Separable Convolution

3.2.1 Code Brief

In the depthwise separable convolution, the parameter k is used to define the number of convolution kernels corresponding to each input channel, and the number of output channels is (k×number of input channels).

3.2.2 Code Implementation: Depthwise Separable Convolution

# 案例：实现了k为2的深度可分离卷积，在对输入通道为4的数据进行深度可分离卷积操作时，为其每个通道匹配2个1通道卷积核进行卷积操作。

# 深度可分离卷积在实现时，直接将组卷积中的groups参数设为与输入通道in_channels相同即可。
import  torch
input1 = torch.ones([1,4,5,5])
conv = torch.nn.Conv2d(in_channels=4,out_channels=8,kernel_size=3) # 定义普通卷积
depthwise_conv = torch.nn.Conv2d(in_channels=4,out_channels=8,kernel_size=3,groups=4) # 定义一个k为2的深度可分离卷积，out_channels/in_channels

Ordinary_convolution = conv(input1) # 普通卷积
print("查看普通卷积的卷积核的形状",conv.weight.size()) # torch.Size([8, 4, 3, 3])
print("查看普通卷积的结果形状：",Ordinary_convolution.size()) # torch.Size([1, 8, 3, 3])

Depthwise_convolution = depthwise_conv(input1) # 可分离深度卷积
print("查看深度可分离卷积的卷积核的形状",depthwise_conv.weight.size()) # torch.Size([8, 1, 3, 3])
print("查看深度可分离卷积的结果形状：",Depthwise_convolution.size()) # torch.Size([1, 8, 3, 3])

4 Atrous convolution

4.1 The meaning of atrous convolution

Atrous convolution is a convolution idea proposed for downsampling in the problem of image semantic segmentation, which reduces image resolution and loses information. By adding holes to expand the receptive field, the original 3x3 convolution kernel has a 5x5 or larger receptive field under the same amount of parameters and calculation, so there is no need for downsampling.

4.2 Diagram of atrous convolution

4.3 Code Implementation of Atrous Convolution

4.3.1 Overview of Atrous Convolutional Codes

Atrous convolution can also be implemented directly through the dialogue parameter of the convolution class. The diiation parameter represents the interval between each element in the convolution kernel, and the default is 1 for the common class.

4.3.2 Code Implementation of Atrous Convolution

import torch
# 1.0 准备数据
arr = torch.tensor(range(1,26),dtype=torch.float32) # 生成5×5的模拟数据
arr = arr.reshape([1,1,5,5]) # 对模拟数据进行变形
print("模拟数据：",arr)
# 模拟数据：tensor([[[[ 1.,  2.,  3.,  4.,  5.],
#               [ 6.,  7.,  8.,  9., 10.],
#               [11., 12., 13., 14., 15.],
#               [16., 17., 18., 19., 20.],
#               [21., 22., 23., 24., 25.]]]])

# 1.1 普通卷积部分
Ordinary_Convolution = torch.nn.Conv2d(1,1,3,stride=1,bias=False,dilation=1) # 普通卷积
torch.nn.init.constant_(Ordinary_Convolution.weight,1) # 对Ordinary_Convolution的卷积核初始化
print("Ordinary_Convolution的卷积核：",Ordinary_Convolution.weight.size())
# 输出 Ordinary_Convolution的卷积核：torch.Size([1, 1, 3, 3])
ret_Ordinary = Ordinary_Convolution(arr)
print("普通卷积的结果：",ret_Ordinary)
# 输出 普通卷积的结果：tensor([[[[ 63.,  72.,  81.], [108., 117., 126.],[153., 162., 171.]]]], grad_fn=<ThnnConv2DBackward0>)

# 1.2 空洞卷积部分
Atrous_Convolution = torch.nn.Conv2d(1,1,3,stride=1,bias=False,dilation=2) # 空洞卷积
torch.nn.init.constant_(Atrous_Convolution.weight,1) # 对Atrous_Convolution的卷积核初始化
print("Atrous_Convolution的卷积核：",Atrous_Convolution.weight.size())
# 输出 Atrous_Convolution的卷积核：torch.Size([1, 1, 3, 3])
ret_Atrous = Atrous_Convolution(arr)
print("空洞卷积的结果：",ret_Atrous)
# 输出 空洞卷积的结果：tensor([[[[117.]]]], grad_fn=<SlowConvDilated2DBackward0>)

4.4 Theoretical Implementation of Atrous Convolution

Dilated/Atroused convolutions introduce a new parameter to the convolutional layer called the "dilation rate", which defines the spacing of the values as the kernel processes the data.

In other words, compared with the original standard convolution, the dilated convolution has an additional parameter of the dilatation rate, which refers to the number of intervals before each point of the kernel.

4.5 Comparison of Atrous Convolution and Ordinary Convolution

4.5.1 Ordinary 3×3 Convolution

4.5.2 Atrous convolution (3×3 convolution with dilation rate=2)

The figure below is a 3×3 convolution kernel with an expansion rate of 2, the receptive field is the same as the 5×5 convolution kernel, and only 9 parameters are required. You can think of it as a 5×5 convolution kernel that removes a row or column every other row or column.

Under the same computational conditions, atrous convolution provides a larger receptive field. Atrous convolutions are often used in real-time image segmentation. When the network layer needs a large receptive field, but the computing resources are limited and the number or size of the convolution kernel cannot be increased, atrous convolution can be considered.

4.6 Two major advantages of atrous convolution:

4.6.1 Expand the receptive field

In deep net, in order to increase the receptive field and reduce the amount of calculation, downsampling is always performed, which can increase the receptive field, but leads to a reduction in spatial resolution. In order not to lose resolution and still expand the receptive field, atrous convolution can be used. This is useful for atrous convolution in detection and segmentation tasks. On the one hand, expanding the receptive field can detect and segment large targets, on the other hand, high resolution can accurately locate the target.

4.6.2 Capturing Multiscale Contextual Information

The hole convolution has a parameter to set the dilation rate, which means filling the dilation rate 0 in the convolution kernel. Therefore, when different dilation rates are set, the receptive field will be different, that is, multi-scale information is obtained. Multi-scale information is very important in visual tasks.

Using atrous convolution instead of downsampling/upsampling can well preserve the spatial characteristics of the image without losing image information. When the network layer needs a larger receptive field, but the number or size of convolution kernels cannot be increased due to limited computing resources, atrous convolution can be considered.

4.7 Problems with atrous convolution:

4.7.1 Grid effects

When the 3*3 kernel with the expansion rate of 2 is stacked multiple times, the following problems will occur:

　　
4.7.2 Long-range information may be irrelevant

If the light uses a large dilation rate convolution may only be effective for some large object segmentation. The key to designing atrous convolutional layers is how to deal with the relationship between objects of different sizes at the same time.

[Pytorch Neural Network Theory] 32 PNASNet model: deep separable convolution + group convolution + hole convolution