Deep learning FPGA development method

 

https://blog.csdn.net/weixin_35729512/article/details/79763952

Overview of the direction of FPGA deep learning

      Traditional CNN (Tensorflow, caffe) is performed on GPU and CPU, but because of its high power consumption, poor heat dissipation, and expensive. However, deep learning algorithms are often difficult to develop on pure FPGA chips, and the development cycle is long, which is not suitable for the realization of CNN algorithms.

The steps of the CNN algorithm are divided, training (PC) + heterogeneous platform (SOC).

      Therefore, various heterogeneous platform SOCs, such as FPGA+ARM architecture, have appeared. Intel Xilinx has its own heterogeneous chips. Intel mainly promotes its own OpenCL, and Xilinx mainly promotes its own HLS to compete with each other. In fact, they all use higher-level development languages ​​(C/C++ but there are still some differences) for development. The corresponding two companies have their own development environments. The environment can be mapped to a specific hardware structure. At present, most companies now use Xilinx FPGAs for deep learning related development work. First make a deep learning Demo, and optimize the specific algorithm, which is specifically related to the hardware structure of the FPGA.

Xilinx’s flagship Revison development stack hopes that more software engineers will master FPGA development methods and define hardware with software.

It is summarized as follows:

Today's deep learning hardware platform:

(1)GPU /CPU

(2)SOC

(3) Shenjian's public DPU architecture (ZYNQ)

(4) Google's TPU architecture

The main push for deep learning FPGA development: software and hardware co-design

Experience: FPGA as a whole is developing in the direction of SOC, which can be seen from the latest hardware platform architectures launched by Intel and Xilinx.

Xilinx's FPGA heterogeneous platform SOC is mainly used in two aspects.

1. Data center acceleration (Huawei, Microsoft, Tencent, Amazon, Baidu, Ali) (Utra scale MPSoc), cloud computing.

Included in the protocol stack

    DNN-Xilinx's Deep Neural Network (DNN) library is a highly optimized library for building deep learning inference applications. After careful design, it can achieve the highest computational efficiency with 16-bit and 8-bit integer data types.

    GEMM — Xilinx's General Matrix Multiplication (GEMM) library is based on 3-level basic linear algebra subroutines (BLAS), which not only achieves optimized performance for 16-bit and 8-bit integer data types, but also supports any matrix of any size.

    HEVC decoder and encoder—HEVC/H.265 is the latest video compression standard introduced by MPEG and ITU standards bodies. It is the follow-up standard to H.264, which can reduce bandwidth by up to 50%. Xilinx offers two encoders — high-quality, highly flexible real-time encoders that support most video data center workloads, and alternative solutions for non-camera-generated content. The decoder supports all applications for these two encoders.

    Data Mover (SQL) — The SQL Data Mover library facilitates the use of Xilinx FPGAs to accelerate data analysis workloads. The data mover library can send data blocks from the database list to the on-chip memory of the FPGA accelerator card via PCIe, thereby coordinating the standard connection to the SQL database. The library is optimized to maximize the use of PCIe bandwidth between the host CPU and accelerator functions through the FPGA device.

    Computing Kernel (SQL)-A library that accelerates a large number of kernel SQL functions (such as decimal type, data type, scanning, comparison, filtering, etc.) on FPGA hardware. The calculation function is optimized to take advantage of the large-scale hardware parallelism of FPGA.

2. Machine learning (acceleration), deep learning (acceleration), embedded vision, computer vision (ZYNQ), face recognition, industrial control system (motor control, frequency conversion control).

1. Autonomous driving system: real-time object detection based on Zynq

Machine vision technology can be applied to autopilot systems to collect and detect road condition information. Among them, CNN (Convolutional Neural Network) has an overwhelming advantage in image processing in machine vision systems. The CNN network can be divided into many "network layers" according to its complexity. A typical convolutional neural network can be divided into a convolutional layer, a pooling layer and a fully connected layer. It can continuously reduce the dimensionality of a large number of image recognition problems. Processing, and finally trained for efficient image recognition.

 

Based on Zynq's CNN (Convolutional Neural Network) system, and transplanted to the car for road testing (shown in the video below), through the video we can see that it can achieve real-time road pedestrians, cars, animals, road signs, etc. Detection and identification.

ATUS’s solution uses the Xilinx Zynq-7020 SoC processor and transplants the YOLO image processing algorithm. YOLO (YouOnly Look Once) is an object detection deep network based on GoogleNet. Real-time and effective are the biggest features of the YOLO network. And advantages, these two points are also necessary for autonomous driving and ADAS systems. The video stream speed of the system can reach 46.7fps (@416x234).

The integrated Zynq Z7045 SoC device uses programmable logic resources to implement the CNN image recognition algorithm (as shown in the figure above). The camera is responsible for image collection. For the collected video stream, the system can identify pedestrians, cars, road signs, railings, etc. Ten different objects. The programmable logic part of the Zynq Z7045 SoC has a working clock frequency of 200MHz, and the power consumption of the entire system is only 10.432W, which is 10% of the power consumption of the CNN solution using CPU or GPU.

 

Xilinx Zynq-7000 series devices are equipped with dual-core ARM Cortex-A9 processors and 28nm programmable logic resources. Its excellent performance-to-power ratio and maximum design flexibility have been welcomed by engineers since its introduction. Zynq Z7045 belongs to the most high-end of the series. The device integrates logic cells up to 6.25M. With the continuous growth of computing requirements and performance of various applications, the high performance brought by the parallel computing features of FPGA makes it more and more widely used in application scenarios such as data center, deep learning, image compression and decoding.

2. Multi-sensor fusion

Mario Bergeron, a senior FPGA/DSP design engineer at xilinx partner Avnet, showed everyone a demo that uses dual cameras to capture images and achieve fusion. The hardware platform uses Avnet PicoZed SOM (System on Chip), the core is integrated with Xilinx ZynqZ-7030 SoC, and modules such as the FMC expansion board of the PicoZed Embedded Development Kit are also used. The two cameras used are the Python-1300-C color image sensor with FMC interface, with a resolution of 1280x1024, flexible configuration, high sensitivity, and high performance. They are mainly used in industrial image acquisition applications and are also designed by Avnet. produced. The other camera uses a FLIR infrared thermal imaging sensor, which outputs an infrared video stream with a resolution of 60x80, and communicates and transmits data with the PicoZed SoM through the Pmod interface.

 

                                            Figure 1: Image fusion

 

 

                          Figure 2: Target detection

          The ARM Cortex-A9 processor of Zynq Z-7030 SoC reaches 30 frames.

 The Zynq series of chips improve a variety of smart industrial solutions, from industrial smart cameras, smart logistics, EtherCAT bus solutions, multi-sensor fusion, pedestrian smart detection, multi-axis motor drive, and face recognition (Queens Technology).

Deep learning (algorithm) + SOC (heterogeneous hardware platform) == FPGA-based heterogeneous computing acceleration (call it an embedded application of deep learning for the time being).

The SDK provided by Intel or the BSP board support package provided by Xilinx are required to carry out high-level development and algorithm optimization. In-depth research on FPGA, algorithm, and ARM is required.

FPGA heterogeneous computing

        Heterogeneous computing is a hot vocabulary at the moment. From chip manufacturers, complete system manufacturers to ordinary engineers and students, all have shown great interest in it. Heterogeneous Computing (Heterogeneous Computing), also translated as heterogeneous computing, mainly refers to a way of using different types of instruction sets and architectures of computing units to form a system to coordinate computing functions. Heterogeneous computing is still a relatively new term for most people, but this is not a new concept. As early as the mid-1980s, heterogeneous computing technology was born. It mainly refers to a special computing method that uses computing units of different types of instruction sets and architectures to form a hybrid system. Common computing unit categories that can be used to build heterogeneous computing systems include: CPU (central processing unit), GPU (graphics processing unit), coprocessor, DSP (signal processor), ASIC (application-specific integrated circuit), FPGA (field Programmable gate array) and so on.

        Why do I need to use a heterogeneous computing system? The reason is simple: we need more powerful and more efficient computing systems. In the past, with the advancement of semiconductor technology and the increase in frequency, most computer applications do not require structural changes or the use of specific hardware accelerators to continuously improve performance. With Moore's Law approaching its realization bottleneck, limitations in memory, power consumption, and heat dissipation have made it impossible for computing systems to achieve performance improvements through simple expansion. At this time, introducing specific units and allowing each different type of computing unit to perform their best tasks through reasonable task allocation and management has become an inevitable choice to further enhance the computing power of the computing system.
————————————————
Copyright Statement: This article is the original article of the CSDN blogger "Ning Weiyang". It follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and This statement.
Original link: https://blog.csdn.net/weixin_35729512/article/details/79763952

Guess you like

Origin blog.csdn.net/u010451780/article/details/114874303