Winograd, Overview of GEMM Algorithm (Efficient Convolution Implementation in CNN) (Part 2)

 

 

               CNN high-efficiency convolution implementation algorithm and application overview (below)

Paper analysis

1. Fast Algorithms for Convolutional Neural Networks

The first article analyzed was CVPR in 16 years. Since it was the first time the Winograd algorithm was introduced in CVPR, it introduced the principle of the Winograd algorithm, matrix operation and transformation, without any optimization, and directly operated according to the matrix formula. Which compares with FFT. It now appears that under the pursuit of fast and high FLOPS indicators, the fewer MACC operations that complete the same task, the more efficient. This is an important indicator in CNN's network Pruning. At the same time, increase FLOPS and network Parameter Size for the purpose. FFT is not dominant. The author compares the TFLOPS of cuDNN and Winograd algorithm. Among them, Winograd algorithm F(2x2,3x3) has analyzed the theory in detail in the last review. If you don’t understand, please check the previous review article ( Overview of efficient convolution implementation algorithms and applications (on )). https://blog.csdn.net/qq_32998593/article/details/86177151

Comparison of processing speed and TFLOPS
Comparison of processing time and TFLOPS

It can be seen from the above table that the TFLOPS of Winograd algorithm has higher computational performance than cuDNN, and the same network has lower processing time.

As can be seen from the above figure, Winograd algorithm has higher computing power and lower number of operations than cuDNN's GEMM algorithm.

 

2. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

The following is the FPGA implementation of Winograd algorithm in Shangtang, a paper published in FCCM in 17 years.

Use more additions instead of multiplications. The traditional convolution implementation requires 6 layers of loops, while using the Winograd algorithm, only 4 layers of loops are needed.

Using the original convolution operation requires 6 layers of for loop operations. With the Winograd algorithm, in the two layers inside, you only need to call the calculation module Winograd (X, F, Y) to directly calculate the convolution parameters. More than 60% less multiplication operations. In FPGA implementation, first avoid direct calculation of multiplication and try to use shift instead. On FPGAs with such expensive multiplication resources, this Winograd algorithm can reduce multiplication operations, which is of course very popular in the industry. Brought a large number of researchers dedicated to the efficient implementation of Winograd algorithm in FPGA.

The author uses the line buffer architecture to cache the input data, and at the same time take each set of kernels out of the cache and put them into the designed PE module for Winograd matrix multiplication.

The specific purpose here is to divide each function according to the module, and then use the on-chip line buffer, which is BlockRAM, to cache, take Winograd convolution every 6 lines, and then Stride is 4. PE is a special Winograd convolution module. The following figure shows the specific operation:

The matrix required for the conversion of each matrix is ​​fixed (A, B, G) is offline, the matrix multiplication is divided into 4 steps, in order to save the on-chip BRAM, first convert the input and filter, which is used here LUT is performed, due to the characteristics of the conversion matrix B and G, only bit operations are required. The second step is to perform EWMM calculation, and the third step is to calculate the conversion matrix of Y, and finally, accumulate according to the channel to obtain the corresponding output.

The indicators for comparison include GOP/s and energy efficiency GOP/s/W. This can be evaluated through the FPGA development platform.

Comparing with AlexNet, compared with other platforms, the sales volume on ZYNQ has increased by more than 10x.

In contrast to the implementation of VGG, compared with other original implementations, the implementation of the Winograd convolution algorithm is improved by nearly 8.7x.

Compared with GPU, there is also a certain improvement in efficiency. (PS: Compared with peak computing, throughput, TOPs, TOPs is not better than GPU, but once it involves power consumption, Watt number is much higher than GPU. Power efficiency is not an order of magnitude.)

3. Efficient Sparse-Winograd Convolutional Neural Networks

This is a paper directed by MIT Han Song and published at the ICLR 2018 conference. It continues his style, pruning, and sparse the network, reducing the number of multiplications in the calculation process. Introduce sparsity to convolution conversion.

The sparsity after pruning is used in the realization of Winograd convolution, which fully reduces the number of multiplications, which is nearly 10.4x, 6.8x and 10.8x reduction. If ReLU and Prune are performed on the traditional algorithm, the Activation Layer and Kernel Layer are sparse before the B^T (input)B, G(filter)G^Ttransformation. After the transformation, when doing EWMM, the same is the non-sparse multiplication, without reducing the number of multiplications. The author will change before ReLU and Prune, so that the matrix after the sparse operation is the matrix that can be directly used for the Winograd algorithm, reducing the number of multiplications.

The specific content can be viewed in the paper. This figure explains very clearly, the same sparse method added in different places will have different effects on the results.

The formula for sparsification is:

In the final comparison of the accuracy of VGG on CIFAR10, it can be seen that in the case of high-density pruning, the final classification accuracy does not drop too much. And the 13.3x calculation load is reduced. Reduce the number of weights and reduce the number of multiplications.

For comparisons on other data sets, ImageNet, CIFAR100, Resnet implementation, etc., you can check the paper.

 

4. Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

This is a paper published at the top meeting of FPGA 2018. The author extended the 2D convolution of Winograd algorithm to 3D convolution and implemented it on FPGA. A unified calculation template has been established. The specific schematic diagram is shown in the figure.

The conversion of the matrix and the operation of convolution are all as a whole, which will reduce the number of multiplications and save the multiplier resources on the FPGA.

Comparing the implementation of this convolution with the second paper, that is, the implementation diagram of Shangtang on FPGA, it can be seen that the number of layers of the for loop in the two papers is the same. With more processing, the original 2D matrix multiplication can be extended to 3D. The realization process is more complicated.

The specific implementation process is shown in the figure above. Transpose the matrix and multiply the matrix, first transform in the template, and finally perform the final element-wise matrix multiplication of Winograd in PE. In fact, it can be seen that this module is very similar to the article by SenseTime, and the whole flow and processing structure are very similar.

It can be seen from the final evaluation results that the performance has been greatly improved. Went above 1.0 TOPS. More content can be viewed in the paper.

 

For the analysis of the Winograd algorithm, GEMM and other efficient convolution implementations in the previous review, please refer to the previous review. The address of the last review article. https://blog.csdn.net/qq_32998593/article/details/86177151

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_32998593/article/details/86181663
Recommended