The difference between Depthwise+Pointwise convolution and conventional convolution

Reprinted Zhihu's article to prevent him from deleting one day, I can't find it haha. The explanation is too clear

Link: https://zhuanlan.zhihu.com/p/80041030

Depthwise (DW) convolution and Pointwise (PW) convolution are collectively called Depthwise Separable Convolution (see Google’s Xception). This structure is similar to conventional convolution operations and can be used to extract features, but compared to conventional convolution Operation, its parameter quantity and operation cost are low. Therefore, this structure such as MobileNet will be encountered in some lightweight networks.

Conventional convolution operation

For a 5×5 pixel, three-channel color input picture (shape is 5×5×3). After the convolution layer of the 3×3 convolution kernel (assuming the number of output channels is 4, the shape of the convolution kernel is 3×3×3×4), and finally output 4 Feature Maps, if there is the same padding, the size is the same as the input layer Same (5×5), if not, the size becomes 3×3. 

At this time, the convolutional layer has a total of 4 Filters, each Filter contains 3 Kernels, and the size of each Kernel is 3×3. Therefore, the number of parameters of the convolutional layer can be calculated with the following formula: 
N_std = 4 × 3 × 3 × 3 = 108

Depthwise Separable Convolution

Depthwise Separable Convolution is to decompose a complete convolution operation into two steps, namely Depthwise Convolution and Pointwise Convolution.

Depthwise Convolution

Unlike conventional convolution operations, one convolution kernel of Depthwise Convolution is responsible for one channel, and one channel is convolved by only one convolution kernel. In the above-mentioned conventional convolution, each convolution kernel is to operate each channel of the input picture at the same time.

Also for a 5×5 pixel, three-channel color input image (shape is 5×5×3), Depthwise Convolution first undergoes the first convolution operation. Unlike the conventional convolution above, DW is completely in two dimensions. In-plane. The number of convolution kernels is the same as the number of channels in the previous layer (the channel and the convolution kernel correspond one-to-one). Therefore, a three-channel image generates 3 Feature maps (if there is the same padding, the size is the same as the input layer and is 5×5), as shown in the figure below. 

One of the Filters only contains a Kernel with a size of 3×3, and the number of parameters in the convolution part is calculated as follows: 
N_depthwise = 3 × 3 × 3 = 27

The number of Feature maps after the completion of Depthwise Convolution is the same as the number of channels in the input layer, and the Feature map cannot be expanded. Moreover, this operation independently performs convolution operations on each channel of the input layer, and does not effectively use the feature information of different channels at the same spatial position. Therefore, Pointwise Convolution is needed to combine these Feature maps to generate a new Feature map.

Pointwise Convolution

The operation of Pointwise Convolution is very similar to the conventional convolution operation. The size of its convolution kernel is 1×1×M, and M is the number of channels in the upper layer. Therefore, the convolution operation here will weighted and combine the map of the previous step in the depth direction to generate a new Feature map. There are several output Feature maps with several convolution kernels. As shown below. 

Since the method of 1×1 convolution is adopted, the number of parameters involved in the convolution in this step can be calculated as: 
N_pointwise = 1 × 1 × 3 × 4 = 12

After Pointwise Convolution, 4 Feature maps are also output, which is the same as the output dimension of conventional convolution.

Parameter comparison

To recap, the number of parameters for conventional convolution is: 
N_std = 4 × 3 × 3 × 3 = 108

The parameters of Separable Convolution are obtained by adding two parts: 
N_depthwise = 3 × 3 × 3 = 27 
N_pointwise = 1 × 1 × 3 × 4 = 12 
N_separable = N_depthwise + N_pointwise = 39

With the same input, 4 Feature maps are also obtained, and the number of parameters of the Separable Convolution is about 1/3 of that of the conventional convolution. Therefore, under the premise of the same amount of parameters, the number of layers of the neural network using Separable Convolution can be deeper

https://zhuanlan.zhihu.com/p/80041030

Guess you like

Origin blog.csdn.net/gbz3300255/article/details/108749709