winograd convolution practice

Winograd convolution basic principle reference:

Winograd algorithm realizes convolution principle

Winograd convolution process diagram:

Note that the input and output channels are hidden in this picture. In fact, each spatial dimension also includes batch and in/out channel dimensions. 

The input data format is [n, h, w, c], and the format of the original paper [Fast Algorithms for Convolutional Neural Networks] after input transform is [ho/2*w0/2, 16, n, ci], which is actually Commonly used [n, ho/2*w0/2, 16, 1, ci]. ho, wo are the output height and width dimensions, divided by 2 because two adjacent elements share a 4x4 matrix.

The input transformation is the space dimension multiplied by the transformation matrix. Since each space dimension element actually corresponds to a 2-dimensional vector, it is actually equivalent to performing various multiplication and addition operations between the vectors corresponding to each space dimension element. As shown below.

The original format of weight is [c0, ci, h, w], where h = w = 3, first processed as [h, w, ci, co], and then transformed into [4*4, ci, co] after weight transform.

The converted input and weight are multiplied by batch matrix to obtain a data format of [n, ho/2*w0/2, 4*4, 1, co].

Finally, do the output transform to get the format [n, ho/2*w0/2, 2*2, co], that is, [n, ho, wo, co]

Input transformation calculation process, assuming that the 4x4 space dimension content of input x is marked as ap (you can use Matlab symbol calculation to get the calculation relationship between the conversion result matrix and x):

transform input to output formula

filter_trans =
[               a,                                     a/2 + b/2 + c/2,                                     a/2 - b/2 + c/2,               c]
[ a/2 + d/2 + g/2, a/4 + b/4 + c/4 + d/4 + e/4 + f/4 + g/4 + h/4 + i/4, a/4 - b/4 + c/4 + d/4 - e/4 + f/4 + g/4 - h/4 + i/4, c/2 + f/2 + i/2]
[ a/2 - d/2 + g/2, a/4 + b/4 + c/4 - d/4 - e/4 - f/4 + g/4 + h/4 + i/4, a/4 - b/4 + c/4 - d/4 + e/4 - f/4 + g/4 - h/4 + i/4, c/2 - f/2 + i/2]
[               g,                                     g/2 + h/2 + i/2,                                     g/2 - h/2 + i/2,               i]
=
[             a,                                (a + b + c)/2,                               (a - b + c)/2,              c]
[ (a + d + g)/2,   ((a + b + c) + (d + e + f) + (g + h + i))/4, ((a + d + g) + (c + f + i) - (b + e + h))/4,, (c + f + i)/2]
[ (a - d + g)/2,   ((a + b + c) - (d + e + f) + (g + h + i))/4, ((a - d + g) + (c - f + i) - (b - e + h))/4,, (c - f + i)/2]
[             g,                                (g + h + i)/2,                               (g - h + i)/2,              i]


data_trans =
[ a - c - i + k, b + c - j - k, c - b + j - k, b - d - j + l]
[ e - g + i - k, f + g + j + k, g - f - j + k, f - h + j - l]
[ g - e + i - k, j - g - f + k, f - g - j + k, h - f + j - l]
[ e - g - m + o, f + g - n - o, g - f + n - o, f - h - n + p]

out_trans =
[ a + b + c + e + f + g + i + j + k, b - c - d + f - g - h + j - k - l]
[ e + f + g - i - j - k - m - n - o, f - g - h - j + k + l - n + o + p]
=
[ a + b + c + (e + f + g) + (i + j + k), b - c - d + (f - g - h) + (j - k - l)]
[ (e + f + g) - (i + j + k )- m - n - o, (f - g - h) - (j - k - l) - n + o + p]

The end-side reasoning engine such as MNN usually has no or only a few shared mem, and the data exchange capability between threads is relatively weak. Usually, the entire calculation process is divided into three steps according to this process instead of being written in a complete kernel: input transform, matmul, output transform. The weight is constant during inference, and the weight transform can be done in advance through constant folding.

For NHWC data, usually each thread reads 4x4, and the input data with a channel depth of 4 is used for input transform.

The shape of the matrix multiplication part is [n, ho/2*w0/2, 16, 1, ci] * [16, ci, co], and the innermost [1, ci] * [ci, co] can be considered through each Each thread calculates 1x1*1x4 tile size matrix multiplication (the principle is the same as the commonly used 4x1*1x4 or 8x1*1x8 tile, refer to [under construction] CUDA GEMM theoretical performance analysis and kernel optimization - Zhihu ). But this matrix multiplied by shape is too small, resulting in a small amount of calculation for each thread, and too many threads need to be created.

Another method is that the shape format output by input transform is [n, 16, ho/2*w0/2, ci], and then do matrix multiplication with [16, ci, co] after weight transform, so that the innermost The matrix multiplication size is significantly increased to [ho/2*w0/2, ci]*[ci, co], which is more conducive to performance optimization. The matrix multiplication output format is [n, 16, ho/2*w0/2, co], and the output transform becomes [n, ho/2*w0/2, 2*2, co]. It is equivalent to including a transpose operation in the input and output transform.

Compared with the realization of convolution through im2row+matmul, since adjacent convolution frames have repeated data, for 3x3 stride=1, im2row reads data at each position once and writes it 9 times, so im2row writes back and matmul The amount of read data has increased by 9 times.

Winograd input transform is equivalent to 4x4 kernel, stride = 2, input transform writes back and matmul reads data, and the data volume is 4 times larger than the input.

Thinking: Compared with NHWC or NCHW input format, will NCHW4 input format help winograd convolution?

The im2row in this article is relative to im2col. im2col expands the input data covered by a convolution kernel into a column of the matrix as the input 1 of the matrix multiplication, and the weight expansion is the input 0 of the matrix multiplication. And im2row is to expand the input data covered by the convolution kernel into a row of the matrix as the input 0 of the matrix multiplication, and the weight expansion is the input 1 of the matrix multiplication. The performance of writing back continuous addresses in the im2row method is better, and the inference engine usually considers the input 1 of matrix multiplication instead of input 0 to be a constant. Of course, the input data obtained by im2col is not absolute as the input 1 of matrix multiplication after expansion. It can also be directly used as matrix multiplication input 0 and marked trans_a=1, because the optimization of matrix multiplication usually also transposes the input 0, which is more conducive to performance. optimization (unlike the good old "common sense" that transpose b is better).

Other article references:

Im2col and winograd optimization of MegEngine Inference convolution optimization - Programmer Sought

Detailed explanation of NCNN winograd (1) bzdww

Guess you like

Origin blog.csdn.net/u013701860/article/details/128083047
Recommended