Convolution (2)

Transpose Convolution

Transposed convolution presents background

Normally, when performing a convolution operation on an image, after multiple layers of convolution operations, the size of the output image will become very small, that is, the image will be reduced. For some specific tasks (such as image segmentation, GAN), we need to restore the image to its original size before performing further calculations. This operation of restoring the image size and mapping the image from small resolution to large resolution is called upsampling, as shown in the figure below.

Insert image description here

There are many methods of upsampling. Common ones include: Nearest neighbor interpolation, Bi-Linear interpolation, etc. However, these upsampling methods are all designed based on people’s prior experience. Many scenes are not very effective. Therefore, we hope to let the neural network learn how to interpolate better by itself, which is the transpose convolution method to be introduced next.

Transposed convolution and its applications

Transpose Convolution (Transpose Convolution) is also called deconvolution (Deconvolution) in some literature. In transposed convolution, no preset interpolation method is used. It has learnable parameters and allows the network to learn by itself to obtain the optimal upsampling method. Transposed convolution is widely used in some specific fields, such as:

  • In DCGAN, the generator will use random values ​​to convert into a full-size image. At this time, transposed convolution is needed.

  • In semantic segmentation, a convolutional layer is used to extract features in the encoder, and then restored to the original size in the decoder layer, so that each pixel of the original image can be classified. This process also requires the use of transposed convolution. Classic methods such as: FCN and Unet.

  • Visualization of CNN: Restore the feature map obtained in CNN to pixel space through transposed convolution to observe which modes of images a specific feature map is sensitive to.

The difference between transposed convolution and standard convolution

The operation of standard convolution is actually to perform a pixel-by-pixel product and sum of the elements in the convolution kernel and the elements at the corresponding positions on the input matrix. Then use the convolution kernel to slide on the input matrix in steps until all positions of the input matrix are traversed.

Here is a simple example to demonstrate the specific operation process. Assume that the input is a 4×4 matrix, calculated using a 3×3 standard convolution without padding, and the stride is set to 1. The final output result should be a 2×2 matrix, as shown in the figure below.

Insert image description here
In the example above, the 3×3 value in the upper right corner of the input matrix will affect the value in the upper right corner of the output matrix. This actually corresponds to the concept of receptive field in standard convolution. Therefore, we can say that the 3×3 standard convolution kernel establishes the corresponding relationship between 9 values ​​​​in the input matrix and 1 value in the output matrix.

To sum up, we can also think that the standard convolution operation actually establishes a many-to-one relationship.

For transposed convolution, we actually want to establish a reverse operation, that is, establish a one-to-many relationship. For the above example, what we want to establish is actually the relationship between 1 value in the output convolution and 9 values ​​in the input convolution, as shown in the figure below.

Insert image description here
Of course, from the perspective of information theory, the convolution operation is irreversible, so the transposed convolution does not use the output matrix and convolution kernel to calculate the original input matrix, but calculates a matrix that maintains the relative position relationship.

Mathematical derivation of transposed convolution

Define a size of 4 × 4 4\times{4}4×4 input matrixinput inputinput i n p u t = [ x 1 x 2 x 3 x 4 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 x 17 ] input=\left[\begin{array}{ccc} x_1 & x_2 & x_3 & x_4 \\ x_6 & x_7 & x_8 & x_9 \\ x_{10} & x_{11} & x_{12} & x_{13} \\ x_{14} & x_{15} & x_{16} & x_{17} \end{array}\right] input= x1x6x10x14x2x7x11x15x3x8x12x16x4x9x13x17 A size of 3 × 3 3\times{3}3×3 -dimensional arraykernel = [ w 0 , 0 w 0 , 1 w 0 , 2 w 1 , 0 w 1 , 1 w 1 , 2 w 2 , 0 w 2 , 1 w 2 , 2 ] kernel=\left [\begin{array}{ccc} w_{0.0} & w_{0.1} & w_{0.2} \\ w_{1.0} & w_{1.1} & w_{1.2 } \\ w_{2,0} & w_{2,1} & w_{2,2} \end{array}\right]kernel= w0,0w1,0w2,0w0,1w1,1w2,1w0,2w1,2w2,2 Let stride=1, padding=0, and follow the calculation method of the output feature map o = i + 2 p − ks + 1 o = \frac{i + 2p - k}{s} + 1o=si+2pk+1 , we can get the output matrix output with size 2×2: output = [ y 0 y 1 y 2 y 3 ] output=\left[\begin{array}{ccc} y_0 & y_1 \\ y_2 & y_3 \end {array}\right]output=[y0y2y1y3] Here, we change the expression, we will input the input matrixinput inputin p u t and output matrixoutput outputo u tp u t expands into column vectorXXX and column vectorYYY , then the vectorXXX and vectorYYThe dimensions of Y are16 × 1 16\times{1}16×1 and4 × 1 4\times{1}4×1 , which can be expressed by the following formulas respectively:\begin{array}{ccc} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_6 \\ x_7 \\ x_8 \\ x_9 \\ x_{10} \\ x_{11} \\ x_{12} \\ x_{13} \\ x_{14} \\ x_{15} \\ x_{16} \\ x_{17} \end{array}\right]X= x1x2x3x4x6x7x8x9x10x11x12x13x14x15x16x17 Y = [ y 0 y 1 y 2 y 3 ] Y=\left[\begin{array}{ccc} y_0 \\ y_1 \\ y_2 \\ y_3 \end{array}\right]Y= y0y1y2y3 We then use matrix operations to describe standard convolution operations, here we use matrix CCC represents the new convolution kernel matrix: Y = CXY = CXY=After derivation of CX , we can get this sparse matrixCCC , its dimensions are4 × 16 4\times{16}4×16 C = [ w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 ] \scriptsize{ C=\left[\begin{array}{ccc} w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} & 0 & 0 & 0 & 0 & 0 \\ 0 & w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} & 0 \\ 0 & 0 & 0 & 0 & 0 & w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} \end{array}\right] } C= w0,0000w0,1w0,000w0,2w0,1000w0,200w1,00w0,00w1,1w1,0w0,1w0,0w1,2w1,1w0,2w0,10w1,20w0,2w2,00w1,00w2,1w2,0w1,1w1,0w2,2w2,1w1,2w1,10w2,20w1,200w2,0000w2,1w2,000w2,2w2,1000w2,2 Here, we use Figure 4 to intuitively demonstrate the above matrix operation process.

Insert image description here
The transposed convolution is actually the reverse operation of this process, that is, X is obtained through C and Y: X = CTYX = C^TYX=CT YAt this time, the new sparse matrix becomes16 × 4 16\times{4}16×4CTC ^TCT , here we use the following figure to visually show you an example of the transposed convolution matrix operation. Here, the weight matrix used for transposed convolution does not necessarily come from the original convolution matrix. It is just that the shape of the weight matrix is ​​the same as the transposed convolution matrix.

Insert image description here
We then reorder the 16×1 output results, so that we can get an output matrix of size 4×4 from the input matrix of size 2×2.

Transposed convolution output feature map size

Transposed convolution with stride=1

We also use the convolution kernel matrix CC mentioned aboveC

C = [ w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 ] \scriptsize{ C=\left[\begin{array}{ccc}w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} & 0 & 0 & 0 & 0 & 0 \\ 0 & w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} & 0 \\ 0 & 0 & 0 & 0 & 0 & w_{0,0} & w_{0,1} & w_{0,2} & 0 & w_{1,0} & w_{1,1} & w_{1,2} & 0 & w_{2,0} & w_{2,1} & w_{2,2} \end{array}\right]} C= w0,0000w0,1w0,000w0,2w0,1000w0,200w1,00w0,00w1,1w1,0w0,1w0,0w1,2w1,1w0,2w0,10w1,20w0,2w2,00w1,00w2,1w2,0w1,1w1,0w2,2w2,1w1,2w1,10w2,20w1,200w2,0000w2,1w2,000w2,2w2,1000w2,2 The corresponding output matrix output is:

o u t p u t = [ y 0 y 1 y 2 y 3 ] output=\left[\begin{array}{ccc}y_0 & y_1 \\ y_2 & y_3\end{array}\right] output=[y0y2y1y3] We expand the output matrix into a column vectorYYY Y = [ y 0 y 1 y 2 y 3 ] Y=\left[\begin{array}{ccc}y_0 \\ y_1 \\y_2 \\ y_3\end{array}\right]Y= y0y1y2y3 Bringing into the transposed convolution calculation formula mentioned above, the calculation result of transposed convolution is:CT y ′ = [ w 0 , 0 y 0 w 0 , 1 y 0 + w 0 , 0 y 1 w 0 , 2 y 0 + w 0 , 1 y 1 w 0 , 2 y 1 w 1 , 0 y 0 + w 0 , 0 y 2 w 1 , 1 y 0 + w 1 , 0 y 1 + w 0 , 1 y 2 + w 0 , 0 y 3 w 1 , 2 y 0 + w 1 , 1 y 1 + w 0 , 2 y 2 + w 0 , 1 y 3 w 1 , 2 y 1 + w 0 , 2 y 3 w 2 , 0 y 0 + w 1 , 0 y 2 w 2 , 1 y 0 + w 2 , 0 y 1 + w 1 , 1 y 2 + w 1 , 0 y 3 w 2 , 2 y 0 + w 2 , 1 y 1 + w 1 , 2 y 2 + w 1 , 1 y 3 w 2 , 2 y 1 + w 1 , 2 y 3 w 2 , 0 y 2 w 2 , 1 y 2 + w 2 , 0 y 3 w 2 , 2 y 2 + w 2 , 1 y 3 w 2 , 2 y 3 ] \scriptsize{ C^Ty'= \left[\begin{array}{ccc} w_{0,0}y_0 & w_{0,1}y_0+w_{0,0}y_1 & w_{0,2}y_0+w_{0,1}y_1 & w_{0,2}y_1 \\ w_{1,0}y_0+w_{0,0}y_2 & w_{1,1}y_0+w_{1,0}y_1+w_{0,1}y_2+ w_{0,0}y_3 & w_{1,2}y_0+w_{1,1}y_1+w_{0,2}y_2+w_{0,1}y_3 & w_{1,2}y_1+w_{ 0,2}y_3 \\ w_{2,0}y_0+w_{1,0}y_2 & w_{2,1}y_0+w_{2,0}y_1+w_{1,1}y_2+w_{1 ,0}y_3 & w_{2,2}y_0+w_{2,1}y_1+w_{1,2}y_2+w_{1,1}y_3 & w_{2,2}y_1+w_{1,2}y_3 \\ w_{2,0}y_2 & w_{2,1}y_2+w_{2,0}y_3 & w_{2,2}y_2+w_{2,1}y_3 & w_{2,2}y_3 \end{array}\right] }CTy= w0,0y0w1,0y0+w0,0y2w2,0y0+w1,0y2w2,0y2w0,1y0+w0,0y1w1,1y0+w1,0y1+w0,1y2+w0,0y3w2,1y0+w2,0y1+w1,1y2+w1,0y3w2,1y2+w2,0y3w0,2y0+w0,1y1w1,2y0+w1,1y1+w0,2y2+w0,1y3w2,2y0+w2,1y1+w1,2y2+w1,1y3w2,2y2+w2,1y3w0,2y1w1,2y1+w0,2y3w2,2y1+w1,2y3w2,2y3 Default key padding=2,set: input = [ 0 0 0 0 0 0 0 0 0 0 0 0 0 y 0 y 1 0 0 0 0 y 2 y 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ] input=\left[\begin{array}{ccc} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & y_0 & y_1 & 0 & 0 \\ 0 & 0 & y_2 & y_3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \ end{ array}\right]input= 00000000000000y0y20000y1y300000000000000 At the same time, the standard convolution kernel is transposed:

k e r n e l ′ = [ w 2 , 2 w 2 , 1 w 2 , 0 w 1 , 2 w 1 , 1 w 1 , 0 w 0 , 2 w 0 , 1 w 0 , 0 ] kernel'=\left[\begin{array}{ccc} w_{2,2} & w_{2,1} & w_{2,0} \\ w_{1,2} & w_{1,1} & w_{1,0} \\ w_{0,2} & w_{0,1} & w_{0,0} \end{array}\right] kernel= w2,2w1,2w0,2w2,1w1,1w0,1w2,0w1,0w0,0 The result of the subsequent standard convolution, the operation process is shown in the figure below:

Insert image description here

For a standard convolution with a convolution kernel size of k, stride = 1, and padding = 0, the equivalent transposed convolution is in size i ′ i'iPerform operations on the input matrix of , and output the size of the feature map o ′ o'o为:o ′ = i ′ + ( k − 1 ) o' = i'+(k-1)o=i+(k1 ) At the same time, the input matrix of the transposed convolution needs to bepadded ′ = k − 1 padding'=k−1padding=k1 padding.

Transposed convolution with stride>1

In actual use, most of the time we use transposed convolution with stride>1 to obtain a larger upsampling ratio. Here, we set the input size to 5×5, set the standard convolution kernel as above, stride=2, and padding=0. After the standard convolution operation, the output size is 2×2. Y = [ y 0 y 1 y 2 y 3 ] Y=\left[\begin{array}{ccc}y_0 \\ y_1 \\ y_2 \\ y_3\end{array}\right]Y= y0y1y2y3 Here, stride = 2 stride=2stride=2 , the converted sparse matrix size becomes25 × 4 25\times{4}25×4. Since the matrix is ​​too large, it will not be expanded and listed here. Then the result of transposed convolution is:CT y ′ = [ w 0 , 0 y 0 w 0 , 1 y 0 w 0 , 2 y 0 + w 0 , 0 y 1 w 0 , 1 y 1 w 0 , 2 y 1 w 1 , 0 y 0 w 1 , 1 y 0 w 1 , 2 y 0 + w 1 , 0 y 1 w 1 , 1 y 1 w 1 , 2 y 1 w 2 , 0 y 0 + w 0 , 0 y 2 w 2 , 1 y 0 + w 0 , 1 y 2 w 2 , 2 y 0 + w 2 , 0 y 1 + w 0 , 2 y 2 + w 0 , 0 y 3 w 2 , 1 y 1 + w 0 , 1 y 3 w 2 , 2 y 1 + w 0 , 2 y 3 w 1 , 0 y 2 w 1 , 1 y 2 w 1 , 2 y 2 + w 1 , 0 y 3 w 1 , 1 y 3 w 1 , 2 y 3 w 2 , 0 y 2 w 2 , 1 y 2 w 2 , 2 y 2 + w 2 , 0 y 3 w 2 , 1 y 3 w 2 , 2 y 3 ] \scriptsize{ C^Ty'=\left[\begin{array}{ ccc} w_{0,0}y_0 & w_{0,1}y_0 & w_{0,2}y_0+w_{0,0}y_1 & w_{0,1}y_1 & w_{0,2}y_1\ \ w_{1,0}y_0 & w_{1,1}y_0 & w_{1,2}y_0+w_{1,0}y_1 & w_{1,1}y_1 & w_{1,2}y_1\\ w_{2,0}y_0+w_{0,0}y_2 & w_{2,1}y_0+w_{0,1}y_2 & w_{2,2}y_0+w_{2,0}y_1+w_{ 0,2}y_2+w_{0,0}y_3 & w_{2,1}y_1+w_{0,1}y_3 & w_{2,2}y_1+w_{0,2}y_3\\ w_{1,0}y_2 & w_{1,1}y_2 & w_{1,2}y_2+w_{1,0}y_3 & w_{1,1}y_3 & w_{1,2}y_3\\ w_{2,0}y_2 & w_{2,1}y_2 & w_{2,2}y_2+w_{2,0}y_3 & w_{2,1}y_3 & w_{2,2}y_3\ \end{array}\right] }CTy= w0,0y0w1,0y0w2,0y0+w0,0y2w1,0y2w2,0y2w0,1y0w1,1y0w2,1y0+w0,1y2w1,1y2w2,1y2w0,2y0+w0,0y1w1,2y0+w1,0y1w2,2y0+w2,0y1+w0,2y2+w0,0y3w1,2y2+w1,0y3w2,2y2+w2,0y3w0,1y1w1,1y1w2,1y1+w0,1y3w1,1y3w2,1y3w0,2y1w1,2y1w2,2y1+w0,2y3w1,2y3w2,2y3  At this time, it is equivalent to adding holes and padding to the input matrix, and the result of transposing the standard convolution kernel. The operation process is shown in the figure below.

Insert image description here
For a standard convolution with a convolution kernel size of k, stride>1, and padding=0, the equivalent transposed convolution is in size i ′ i'iPerform operations on the input matrix of , and output the size of the feature map o ′ o'o为:o ′ = s ( i ′ − 1 ) + k o' = s(i'-1)+ko=s(i1)+k At the same time, the input matrix of the transposed convolution needs to bepadded ′ = k − 1 padding'=k−1padding=kFor filling of 1 , the size of the holes between adjacent elements is s−1. Therefore, the upsampling rate can be controlled by controlling the step size.

Dilated Convolution

Atrous convolution presents background

In pixel-level prediction problems (such as semantic segmentation, here we take FCN as an example), the image is input into the network, and FCN first performs convolution and pooling calculations on the image like the traditional CNN network, while reducing the size of the feature map. Increase the receptive field. However, since image segmentation is a pixel-level prediction problem, we use transpose convolution (Transpose Convolution) for upsampling to keep the size of the output image consistent with the original input image. In summary, in this pixel-level prediction problem, there are two key steps: first, use convolution or pooling operations to reduce the image size and increase the receptive field; second, use upsampling to expand the image size. However, using convolution or pooling operations for downsampling will lead to a very serious problem: image detail information is lost, and small object information cannot be reconstructed (assuming there are 4 pooling layers with a stride of 2, any smaller than 2 4 2^424 pixel object information will theoretically be impossible to reconstruct).

Atrous convolution and its applications

Dilated Convolution, also known as Atrous Deconvolution in some literature, is a method proposed to address the image resolution reduction and information loss problems caused by downsampling in image semantic segmentation. New convolution ideas. Atrous convolution allows a convolution kernel of the same size to obtain a larger receptive field by introducing the parameter Dilation Rate. Correspondingly, it is also possible to make dilated convolutions have fewer parameters than ordinary convolutions under the same receptive field size.

Atrous convolution has very wide applications in certain specific fields, such as:

  • The field of semantic segmentation: DeepLab series and DUC. In the DeepLab v3 algorithm, the last few blocks of ResNet are replaced with atrous convolutions, making the output size much larger. Without increasing the amount of computation, the resolution is maintained and a denser feature response is obtained, resulting in better details when restoring to the original image.

  • Target detection field: RFBNet. In the RFBNet algorithm, dilated convolution is used to simulate the influence of pRF's eccentricity in the human visual cortex, and the RFB module is designed to enhance the effect of the lightweight CNN network. A detector based on RFB network is proposed, which brings significant performance gains by replacing the top convolutional layer of SSD with RFB while still keeping the computational cost under control.

  • Speech synthesis field: WaveNet and other algorithms.

The difference between atrous convolution and standard convolution

For a standard convolution with a size of 3×3, the convolution kernel size is 3×3, and the convolution kernel contains a total of 9 parameters. During the convolution calculation, the elements in the convolution kernel will correspond to the positions on the input matrix. The elements are multiplied pixel by pixel and summed. Compared with standard convolution, dilated convolution has an additional parameter called expansion rate. The expansion rate controls the distance between adjacent elements in the convolution kernel. The change in the expansion rate can control the size of the receptive field of the convolution kernel. The dilated convolutions when the size is 3×3 and the expansion rates are 1, 2, and 4 are as shown in the figure below.

Insert image description here
When the dilation rate is 1, atrous convolution is calculated in the same way as standard convolution.

Insert image description here
3*3 atrous convolution with dilation rate 2.
Insert image description here

3*3 atrous convolution with dilation rate 4. When the expansion rate is greater than 1, holes will be injected based on standard convolution, and all values ​​in the holes will be filled with 0.

Receptive field of dilated convolution

For standard convolution, when the standard convolution kernel size is 3×3, we continuously perform two standard convolution calculations on the input matrix to obtain two feature maps. We can observe the receptive field sizes of convolution kernels of different layers, as shown in the figure below:

Insert image description here

Among them, 3 × 3 3\times33×The receptive field size corresponding to 3 convolutions is 3 × 3 3\times33×3 , and through two layers3 × 3 3\times33×After 3 convolutions, the size of the receptive field will increase to5 × 5 5\times55×5

The receptive field calculation method of dilated convolution is similar to that of standard convolution. Since atrous convolution can actually be seen as filling '0' in the standard convolution kernel, we can imagine it as a standard convolution kernel with an enlarged size, and thus use the standard convolution kernel to calculate the receptive field. The receptive field size of atrous convolution. For dilated convolution with convolution kernel size k and expansion rate r, the calculation formula of receptive field F is: F = k + ( k − 1 ) ( r − 1 ) F = k + (k-1)(r -1)F=k+(k1)(r1 ) When the convolution kernel size k=3 and the expansion rate r=2, the calculation method is as shown in the figure below.

Insert image description here
Among them, after passing one layer of atrous convolution, the size of the receptive field is 5×5, and after passing two layers of atrous convolution, the size of the receptive field will increase to 9×9.

Group Convolution

Grouped convolution presents background

Group Convolution first appeared in AlexNet. Limited by the hardware resources at the time, it was difficult to put the entire network on one GPU for training during AlexNet network training. Therefore, the author divided the convolution operation into multiple GPUs for calculation respectively, and finally combined the results of multiple GPUs. Perform fusion. Therefore, the concept of grouped convolution came into being.

The difference between grouped convolution and standard convolution

For dimensions H 1 × W 1 × C 1 H_1×W_1×C_1H1×W1×C1The input matrix, when the size of the standard convolution kernel is h 1 × w 1 × C 1 h_1×w_1×C_1h1×w1×C1, a total of C 2 C_2C2When a standard convolution kernel is used, the standard convolution will operate on the complete input data, and the final output matrix size is H 2 × W 2 × C 2 H_2×W_2×C_2H2×W2×C2. Here we assume that the size of the feature map before and after the convolution operation remains unchanged, then the above process can be shown as the figure below.

Insert image description here

Considering that the above process is completely run on the same device, this also places higher requirements on the performance of the device.

Grouped convolution improves this process. In grouped convolution, by specifying the group number ggg to determine the number of groups and divide the input data intoggg group. It should be noted that the grouping here refers to grouping in depth, and the input width and height remain unchanged, that is, everyC 1 g \frac {C_1} {g}gC1The data of each channel is divided into one group. Because the input data changes, the corresponding convolution kernel also needs to be changed accordingly, that is, the number of input channels of each convolution kernel becomes C 1 g \frac {C_1} {g}gC1, and the size of the convolution kernel does not need to be changed. At the same time, the number of convolution kernels in each group is also changed from the original C 2 C_2C2Becomes C 2 g \frac {C_2} {g}gC2. For the convolution operation within each group, the standard convolution operation calculation method is also used, so that gg can be obtainedThe size of the g group isH 2 × W 2 × C 2 g H_2×W_2×\frac {C_2} {g}H2×W2×gC2The output matrix will eventually be this ggThe final result can be obtained by splicing g groups of output matrices. After the splicing is completed, the final output size can remain unchanged, stillH 2 × W 2 × C 2 H_2×W_2×C_2H2×W2×C2. The operation process of grouped convolution is shown in the figure below.

Insert image description here
Since we split the entire standard convolution process into g groups of smaller sub-operations to be performed in parallel, the requirements for running equipment are ultimately reduced. At the same time, through group convolution, the number of parameters can also be reduced. In the above standard convolution, the parameter quantity is: h 1 × w 1 × C 1 × C 2 h_1 \times w_1 \times C_1 \times C_2h1×w1×C1×C2After using grouped convolution, the parameter amount becomes: h 1 × w 1 × C 1 g × C 2 g × g = h 1 × w 1 × C 1 × C 2 × 1 g h_1 \times w_1 \times \ frac{C_1}{g} \times \frac{C_2}{g} \times g = h_1 \times w_1 \times C_1 \times C_2 \times \frac{1}{g}h1×w1×gC1×gC2×g=h1×w1×C1×C2×g1

Application examples

For example, for the size H × W × 64 H × W × 64H×W×64 input matrix, when the size of the standard convolution kernel is3 × 3 × 64 3×3×643×3×64 , when there are 64 standard convolution kernels, the following figure shows the group convolution calculation method when the number of groups is g=2.

Insert image description here
At this time, the number of input channels in each group becomes 32, and the number of convolution kernel channels also becomes 32. Therefore, the parameter amount corresponding to standard convolution is 3×3×64×64=36864, while the parameter amount of grouped convolution becomes 3×3×32×32×2=18432, which is reduced by half.

Guess you like

Origin blog.csdn.net/weixin_49346755/article/details/127484811