MobileNet V1 based on FPGA, FPGA deep learning accelerator design CNN Accelerators based on FPGAs

Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs

Recently, an article has been published on arXiv, using FPGA to implement MobileNet V1, and it does not use off-chip resources at all. It uses on-chip memory instead of off-chip RAM. The entire model is implemented on the limited internal resources of FPGA. It can make the frame rate at 3000 FPS, which is a very fast realization speed that I have seen recently. The whole network adopts multi-precision realization, and it is a combination of software and hardware. The entire implementation process is shown in the figure.

Figure 1: Design flow of MobileNet V1 accelerator based on FPGA

 

The article also mentions the conventional algorithm, the method used to implement it on FPGA, namely the expansion of calculation, called roll-unroll, flatten. This article uses flattened streaming cores for calculation. The calculation of the entire convolution is implemented using registers, which can be operated efficiently. The amount of this project is still very large. The following figure shows the implementation process of conventional convolution. A cycle calculates the value of each point on the c'channels of each layer of output. It is realized by shifting or multiplier. The second sub-picture shows the calculation process of depthwise convolution, using a separate layer of filter for each layer. This representation is also the principle explanation of standard convolution and depthwise convolution.

Figure 2: The realization process of the network

 

For the execution of multi-precision and multi-bit numbers, set the value of activation to 8 bit fixed-point numbers. It is just that the integers and decimals of fixed-point quantization can vary from convolution layer. Each layer is different, and it is found by a greedy search algorithm. 

Figure 3: N-bit fixed-point number representation, p represents the number of decimal places

At the same time, the author also quantifies the BN layer separately, here is a 16-bit fixed-point number.

Figure 4: The quantization process of the BN layer

The overall implementation is very difficult, and you need to carefully look at the implementation details. In the actual implementation process, there are still many problems. Various configuration files. C++ write some algorithms and codes that are difficult to optimize and implement at the upper level. Verilog writes convolution template files and separable convolutions. These rules are very strong. With a large number of for loop calculation modules, Verilog is highly efficient. The implementation of mixed precision requires the use of registers to give full play to its advantages, while the operation of floating-point data consumes more resources. At this time, it needs to be implemented with an upper level similar to HLS. The entire pipeline design requires close clock calculation and design on each layer, and the network output results of the upper layer are directly connected to the calculation of the next layer in what way. After the entire pipeline is drained, the narrated 3000fps can be reached. It should not be possible to process a picture alone.

Figure 5: Flowing Reality Process

 There are many details in the specific implementation process. The author's optimal weight search algorithm, how to correspond to different input sizes, model sizes, expand, and be configurable, these are an advantage in FPGA implementation. Because other FPGA implementation methods are all done after writing one, it is inconvenient to rewrite the size of the model, because it involves timing and alignment of pipeline. This software defines the model, then synthesizes the circuit in the hardware, and then converts it into a bit stream, which can speed up the realization of various model prototypes. Overall good. If you want to take a closer look at the various details inside, check the original text.

 

论文链接:Automatic Generation of Multi-precision Multi-arithmetic CNN Accelerators for FPGAs https://arxiv.org/pdf/1910.10075v1.pdf 

Guess you like

Origin blog.csdn.net/qq_32998593/article/details/103188225