Several scenarios and methods for converting convolution calculations to matrix multiplication calculations

The default convolution input and output data format in this paper is NHWC.

Why convert convolution to matrix multiplication calculation

There are several reasons, 1. Because matrix multiplication optimization has been studied for decades, there are rich research results, and a BLAS acceleration library with good performance is available. 2. Matrix multiplication optimization is simpler than convolution. This is mainly because matrix multiplication has fewer parameters, mainly M, N, and K. In addition, a batch can be added with 4 parameters. The convolution input has [N, Hi, Wi, Ci], the filter has [Hf, Wf, Ci, Co], and other parameters such as stride. So the kind of convolution far exceeds the kind of matrix multiplication, so optimization is often more difficult.

Of course, it is not necessary to convert convolution into matrix multiplication. Converting to matrix multiplication is only one of the means of convolution optimization. Some scenarios do not necessarily need to be converted to matrix multiplication, such as depthwise conv.

1x1 convolution

The input shape is [N, H, W, C], and the filter is [Hf, Wf, Ci, Co]

FH, FW are both 1, directly reshape the input shape to [N, H * W, C], filter reshape to [[Hf * Wf * Ci, Co], and then perform matrix multiplication to get [N, H * W, Co ], and then reshape to the output shape of the convolution.

kernel shape=strides convolution

Similar to 1x1 convolution, this kind of convolution is characterized by the fact that there is no overlap between the input data blocks calculated by each convolution, and can be simply processed as matrix multiplication in combination with transpose:

Assuming that the input format is [N, H, W, C], it can be reinterpreted as [N, H1*H0, W1*W0, C], H0, W0 is the kernel_shape size, H1 and W1 are the image width of the convolution output , and the filter format is [Hf, Wf, Ci, Co]

Convert the convolution input from [N, H1*H0, W1*W0, C] reshape and transpose to [N, H1*W1, H0*W0*C], and then with the filter [Hf*Wf*Ci, Co] Just do matrix multiplication, and the output is [N, Ho*Wo, Co], and reshape is the shape of the convolution output.

Explicit matrix multiplication convolution (explicit GEMM convolution)

Also called Im2Col or im2row. This requires splitting the convolution into two operators, Im2Col and matrix multiplication.

The idea of ​​Im2Col is very simple, that is, to expand the part of the data of FH*FW*Ci that is covered by each filter into a row, as the K of the matrix multiplication, and the Co of the filter as the N of the matrix multiplication.

The image width of the entire Hi*Wi of the input data needs to be calculated for the Ho*Wo convolution calculation, so it is used as the M part of the matrix multiplication. Therefore, the input data becomes [N, Ho*Wo, FH*FW*Ci] after Im2Col The tensor of convolution, and the filter reshape of convolution is the tensor of [FH*FW*Ci, Co]. The matrix multiplication of the two results in [N, Ho*Wo, Co], and then Reshape can be used as the output of convolution.

A huge disadvantage of this method is that the temporary data after Im2Col has a huge improvement compared to the input of the convolution, which will take up a lot of memory, especially in the case of stride=1. For example, if the kernel_shape is 3x3 and stride=1, then the tensor data after Im2Col is 9 times that of the convolution input.

Implicit GEMM convolution

It is the same as the Im2Col method, but there is no need to split the convolution into two independent Im2Col and matrix multiplication parts, but to implement im2col by using a specific data reading method when reading the data for matrix multiplication calculations.

The implicit GEMM method prefers that the input data format is NC1HWC0, where the input C=C1*C0, C0 is usually 4, 8, 16, etc.

Specific calculation method reference

Convolution: Understanding from the perspective of inference engine optimization and hardware optimization - Programmer Sought

The Indirect Convolution Algorithm

The upper part is the principle of im2col. The following is the actual matrix multiplication. The A matrix reads the same column tiles of several rows at a time, and the kernel reads several columns of the same row of tiles, and loops along the k direction.

Winograd convolution

Although this method can only handle several scenarios that compare specific kernel shapes and strides, its performance is usually better than the Im2Col method. Winograd converts convolution into matrix multiplication calculation through input and weight transform, and finally obtains the convolution result through output transform. For specific principles, please refer to:

Winograd algorithm realizes convolution principle_winograd convolution_Luchang-Li's Blog-CSDN Blog

winograd convolution practice_Luchang-Li's Blog-CSDN Blog

Ref

Convolution: Understanding from the perspective of inference engine optimization and hardware optimization - Programmer Sought

Convolution optimization techniques in OpenPPL - Programmer Sought

The Indirect Convolution Algorithm

Guess you like

Origin blog.csdn.net/u013701860/article/details/130192231