Depth understanding of a heterogeneous computing chip CPU and GPU / FPGA / ASIC

With the rapid growth of Internet users, the rapid expansion of the amount of data volume, the demand for data center computing are rapidly rising. Such as deep learning online prediction, live video transcoding, picture compression and decompression HTTPS encryption and other computing applications demand the ability to reach far beyond the traditional CPU processor. Moore's Law failed today, concerned about the change in system architecture "new" members (GPU \ FPGA \ ASIC) to bring the data center for business coupled with a full engine power.

A heterogeneous computing: WHY

Obviously CPU used properly, why should we consider heterogeneous computing chips?

With the rapid growth of Internet users, the rapid expansion of the amount of data volume, the demand for data center computing are rapidly rising. Such as deep learning online prediction, live video transcoding, picture compression and decompression HTTPS encryption and other computing applications demand the ability to reach far beyond the traditional CPU processor.

Historically, benefit from the continued evolution of semiconductor technology, throughput and system performance computer architecture and continuously improve the performance of the processor will be able to double every 18 months (known as "Moore's Law"), so that the performance of the processor can be meet the needs of the application software. However, in recent years, improvements in semiconductor technology reached a physical limit, more complex circuits, each designed for a development cost millions of dollars, billions of dollars in order to form a new product launch capability. March 24, 2016, Intel announced officially retired "Tick-Tock" processor R & D model, the next development cycle transition from two-year to three-year period. So far, Moore's Law, Intel nearly fail.

On the one hand and then not be able to increase processor performance in accordance with Moore's Law, on the other hand data on the growth of computing performance requirements exceed the speed by "Moore's Law" growth. The processor itself unable to meet the high-performance computing: performance requirements (HPC High Performance Compute) application software, resulting in a gap (see Figure 1) between the demand and performance.

One solution is through hardware acceleration using heterogeneous computing coprocessor dedicated manner to improve processing performance.

The development gap in the form of demand computing and computing power of 1

2 heterogeneous computing: STANDARDS

Usually when we provide solutions for business, there will be four choices deployment platform CPU, GPU, FPGA, ASIC. So what criteria to judge the merits of computing platforms it?


Figure: I am a judge, I have the final say standard

Today should be the ideal co-processor hardware-based design has three basic capabilities. The first is designed to provide specialized hardware acceleration critical processing functions to achieve a variety of application needs. Followed by the coprocessor in performance design is very flexible, and in parallel using a pipeline structure, and updated to keep up with the changing needs of algorithm performance. Finally, the main processor and the coprocessor memory system to provide broadband, low-latency interfaces.

In addition to the hardware requirements, the ideal co-processors should also meet the HPC market, "4P" requirements: performance (performance), performance (productivity), power (power) and price (price).

HPC market performance is the minimum requirement for full and accelerated implementation algorithm, not just a step, and be able to accelerate the entire application software.

Performance requirements from end users. In the conventional computer system, the coprocessor must be easy to install, provides a simple way to configure the system, accelerate existing applications.

Power requirements from the HPC market power constraints on the computing system installation and use. For most users, the limited space can be provided to a computer. The smaller the power consumption of a computing system, it can be taken fewer cooling measures to keep the computer does not overheat. Thus, low power coprocessors only can provide a lower operating cost of the computing system, but also improves the space utilization of the computing system.

Price factors become increasingly important in the HPC market. Ten years ago, some applications demand exceeds the performance capability of a single processor, which prompted the use of special architectures such dense parallel processing (MPP) and symmetric multiprocessing (SMP) and the like. However, such systems require the use of very expensive custom processor and dedicated data path unit, and program development.

HPC market now abandoned such an expensive method, instead of using cluster computing more cost-effective method. Cluster computing using standard commercial architecture, such as Intel and AMD; industry-standard interconnect, such as Gigabit Ethernet and InfiniBand; using standard programming languages, such as C language running on low-cost Linux operating system. Today's coprocessors must be designed to integrate smoothly into commercial cluster computing environment, cost and added to another node in the cluster roughly equal.

After understanding the basic criteria, we have today's hottest deep learning, for example, from chip architecture, computing performance, power consumption, development difficult aspects to analysis and comparison of several different chips.

3.2 Calculation chip performance

Depth study of the scientific name, also known as DNN (Deep Neural Networks), is developed from artificial neural networks (Artificial Neural Networks) model. Our depth study as a starting point to analyze the performance of each chip. FIG 3 is a basic structure of the neural network, a large number of multiplication calculation model is the output layer of each layer and its corresponding weight values ​​of the two matrices.

3 basic configuration of the neural network of FIG.

Horizontal comparison CPU, GPU, FPGA, ASIC computing power, the actual contrast:

1. The hardware multiply-add computing power of the chip.

2. Why are there so multiply computing power?

3. Can you give full play to multiply computing power of the hardware chip?

With these three questions, we compare the computing power of the hardware chip.

3.2.1 CPU computing power analysis

Here CPU power Haswell analyzed using the Intel architecture, the computing unit there are two architectures Haswell FMA (fused multiply-add), each FMA 256bit data can do a multiplication and add operations in a first clock cycle, the 32bit corresponding to single precision floating point computing power for: (256bit / 32bit) 2 (FMA) 2 (multiplication and addition) = 32 SP FLOPs / cycle, i.e. every clock cycle 32 can be single-precision floating-point calculations.

CPU floating point performance of Peak = CPU floating point operations per cycle frequency of the number of CPU cores perform. Intel's CPU type E5-2620V3 been calculated peak computing capacity = 6 (CPU Audit) 2.4GHz (CPU frequency) 32 SP FLOPs / cycle = 460.8 GFLOPs / s i.e. 460G peak floating point calculations per second capacity.

Whether the CPU chip architecture can give full play to floating point computing power? CPU instruction execution is: fetch -> instruction decoding -> instruction is executed only when the instruction is executed, the calculation means only play a role, so that the instruction fetch and instruction decode two periods, the calculation unit is away from work ,As shown in Figure 4.

FIG 4 CPU instruction execution flow

In order to improve the efficiency of CPU instruction execution, when the execution of the current instruction, several instructions previously read back, such that the instructions pipeline processing, improve the efficiency of instruction execution, as shown in FIG. Instruction execution pipeline and to read in advance the proviso that no correlation between the command is not a command to wait until the result of the preceding execution of how instructions are executed in a can is known.

FIG 5 CPU instruction pipeline

As CPU general purpose processor, both the computation and control, the transistor 70% and part used to construct Cache control unit for complex logic and instruction to improve the efficiency, as shown in FIG 6, resulting in calculated versatility, can processing high computational complexity, but in general computing performance.

FIG 6CPU structure

Analysis calculated by the CPU performance, improve calculation performance directly direction: increase the number of CPU cores, CPU frequency increase, increasing the number of CPU architecture modification calculation unit FMA (fused multiply-add). Three directions, to directly increase the number of CPU cores for computing the maximum lift, but the increase in power consumption and chip brought prices, for each physical core is only 30% of the computing unit transistor. CPU frequency increase, the limited room for improvement, and the CPU frequency is too high can cause chips to appear excessive power consumption and overheating problems, so Intel and other chip makers currently go without multi-line, that is, to limit the frequency of a single microprocessor, by a plurality of processor cores integrated to improve processing performance. Modify CPU architectures increase the number of FMA computing unit, currently Intel CPU architecture be adjusted in accordance with a cycle "Tick-Tock" for two years, from 2016 to slow to three years longer updated iteration cycle.

3.2.2 GPU computing capability analysis

GPU mainly parallel computing at doing similar image processing, so-called "coarse-grained parallelism (coarse-grain parallelism)". Pattern feature calculation processing performance is less correlated, the GPU provides a large amount of the computing unit (computing units up to several thousand) and a large amount of data between the high-speed memory to calculate a calculated high density needs to be many pixels simultaneously parallel processing.

FIG 7 is a design structure of the GPU. GPU design starting point is more suitable for GPU computationally intensive, multi parallel computing. Thus, the GPU is used to calculate the more transistor means, unlike a CPU Data Cache controllers and processes. This design is because each unit of data when the same parallel computing program, does not require complicated process control and high computing capacity need, therefore does not require a large cache capacity.

FIG 7 GPU configuration

GPU logic control unit corresponding to a plurality of calculation units, the calculation unit in order to simultaneously fully parallel up, logic control must not be too complicated, too complex logic control calculation unit can not play parallel, e.g. excessive if ... else if ... else if ... branch parallel computing can not increase calculation unit, so there is no need to quickly process a complicated control logic in the control unit GPU.

Here GPU computing capability for analysis by Tesla K40 Nvidia, K40 containing stream processors 2880 (Stream Processor), a GPU stream processor is calculating unit. Each stream processor contains a single-precision floating-point multiply and 32bit unit, i.e., each clock cycle can be done two single-precision floating-point calculations. = Number of floating point performance of the operation of stream processor GPU weekly frequency peak GPU executing floating point calculations. In an example K40, K40 peak = 2880 floating point performance (stream processor) 745MHz 2 (multiplication and addition) = 4.29T FLOPs / s i.e. 4.29T second peak floating point capability.

Whether GPU chip architecture can give full play to floating point computing power? Like GPU CPU instruction execution process is: fetch -> instruction decoding -> instruction is executed only when the instruction is executed, the calculation means only play a role. GPU CPU logic control unit is simple in comparison, so in order to achieve pipeline processing instruction, the instruction execution to improve efficiency, a necessary requirement arithmetic processing itself, low complexity, are independent of data processing, the algorithm itself can serial processing GPU floating point computing power resulting in a significant reduction.

3.2.3 FPGA computing Analysis

FPGA as a high-performance, low-power programmable chips can be customized according to specific algorithms designed to do. Therefore, when handling huge amounts of data, FPGA compared to the CPU and GPU, advantage: FPGA more computationally efficient, FPGA closer IO.

FPGA without using software and instructions, are hardware devices unity. To program the FPGA logic using a hardware description language, a hardware description language can be compiled directly to a combination of a transistor circuit. Therefore, the user is actually implemented algorithm FPGA transistor circuits directly, without an instruction by the translation system.

FPGA abbreviation name translation, the full name of a field programmable gate array, the name of the function of the FPGA has been disclosed, it is a combination of a bunch of logic gates can be programmed, may be reprogrammable. Figure 8 shows a schematic diagram of an internal programmable FPGA.

FIG 8 FPGA internal configuration of FIG.

Here FPGA with computing ability V7-690T Xilinx analysis, V7-690T comprising 3600 DSP (Digital Signal Processing), DSP FPGA is the calculation unit. Each DSP can be done two single precision floating point calculation (multiplication and addition) in each clock cycle. FPGA peak performance floating point DSP = floating point operations per cycle, the number of frequency FPGA. V7-690T 250MHz operating frequency has been calculated, V7-690T floating point performance peak = 3600 (DSP number) 250MHz 2 (multiplication and addition) = 1.8T FLOPs / s i.e. 1.8T second peak floating point capability.

Whether FPGA chip architecture can give full play to floating point computing power? Since FPGA custom algorithm, there is no CPU and GPU instruction fetch and decoding process, the data stream directly from the fixing operation customized algorithm, the calculation unit can execute each clock cycle, it can be full floating point computing power, CPU and computational efficiency than GPU.

3.2.4 ASIC computing power analysis

It is a dedicated ASIC chip with a conventional universal chips have some differences. For some specific needs and customized chip. Computing power and efficiency of the ASIC chip can be customized according to the algorithm, so compared with general-purpose chip ASIC, superiority following aspects: small size, low power consumption, high performance computing, high computational efficiency, the chip the larger the volume, the lower the cost. But the disadvantages are also obvious: the algorithm is fixed, once the algorithm change might not be available. Artificial intelligence is a large outbreak of the current period, a large number of algorithms have been poured out, is far from stable algorithm period, ASIC ASIC how to do to adapt to a variety of algorithms is the biggest problem, if the current CPU and GPU architectures to accommodate a variety of algorithms, it becomes a dedicated ASIC chip with the same CPU, GPU general-purpose chip, performance and power consumption of no advantage.

Let's look at the difference between FPGA and ASIC. The basic principle FPGA is an integrated circuit substantially large amount of digital gates in the chip, and a memory, and the user can be burned into the FPGA configuration files to define the link between these gates and a memory. This is not a one-time burn-in that the user can now configure the FPGA to a microcontroller MCU, tomorrow can edit the configuration file to the same FPGA configuration into an audio codec. Application specific integrated circuit ASIC is, once the complete circuit design and manufacturing is fixed, and can not be changed.

Compare FPGA and ASIC is like comparing Lego and model. For example, if you recently found inside the Star Wars Yoda's fire, I want to do a master Yoda toys to sell, how would you do it?

There are two ways, one is to use Lego ride, there is a custom mold factory is looking for. Lego take it, as long as the finished toy design shape can buy a set of Lego bricks. And find factory mold design, then finished in the outer shape of the toy you need to do a lot of things, whether material such as toys emit odors, whether toys will melt at high temperatures, and so on, so do the toys need to use Lego bricks than the preliminary work from the factory to find a model to make it much less, from design to completion time required to be able to market with Lego also much faster.

FPGA and ASIC is the same, as long as the finished FPGA using Verilog code can be achieved using a hardware accelerator FPGA vendor-supplied tool, and then to design ASIC needs to be done verification and physical design (ESD, Package, etc.), more time. If you want for special occasions (such as military and industrial applications for high reliability requirements), ASIC will need more time to be specifically designed to meet the demand, but directly with the FPGA, then you can buy military-grade high stability FPGA completely impact of development time. However, although the design relatively short time, but made out of Lego toy factory customization than toys to rough (poor performance) of some (figure below), after all, is tailor-made mold factory.

In addition, if large shipments, the cost factory mass production of toys will do a lot cheaper than Lego. The same is true FPGA and ASIC, ASIC accelerator speed at the same time with the best implementation process will be faster than the speed of the accelerator FPGA to do the same process 5-10 times, and once the mass production cost will be far ASIC lower than FPGA solutions.

Listed speed FPGA, an ASIC listed slow, requires a lot of time to develop, and the time cost (production cost lithography mask) is much higher than FPGA, but FPGA performance than the average cost and mass production is lower than the FPGA. Target market, the higher cost FPGA, so the price is not very suitable for sensitive areas, such as enterprise applications, military and industrial electronics, etc. (in these areas reconfigurable really need). The ASIC is suitable for low cost consumer electronics applications, but also in consumer electronics can configure whether the demand is a pseudo open to question.

We see the market situation is also true: the use of FPGA-accelerated deep learning to do mostly business users, Baidu, Microsoft, IBM and other companies have a special team for the server to do the FPGA accelerated program while doing FPGA startups Teradeep target market also the server. The ASIC is mainly aimed at consumer electronics, such as Movidius. Since the mobile terminal belongs to the field of consumer electronics, so the future use of the scheme should be based on ASIC-based.

3.3 platform performance and power consumption comparison

Because of different chip production technology, power and performance of the chip has an impact here compared with the same process at or close to the process, ASIC chip are not commercially viable chips appear, Google's own use only TPU no external information, here ASIC chip used in academic papers published in "DianNao: a Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning" as a representative.

From comparison of the above point of view, than the energy consumption aspects: ASIC> FPGA> GPU> CPU, the root causes of this result: for computationally intensive algorithms, and move higher in the higher energy than the data operation efficiency. ASIC and FPGA are closer to the bottom of the IO, the high computational efficiency and high data movement, but FPGA transistors and redundant connections, low operating frequency, so no energy ratio ASIC. GPU and CPU are of a general purpose processor, will need to fetch, instruction decode, the instruction execution process, a masking process in this way the bottom of IO, such decoupling hardware and software, but it will bring a moving operation data and unable to achieve higher efficiency, so there is no ASIC, FPGA power consumption is high. Energy than the gap between the GPU and the CPU, the main CPU that a majority of the transistors used in the cache logic and control unit, GPU compared to the CPU, for the simultaneous calculation-intensive algorithms low computational complexity, redundant the transistor does not function, energy is less than the previous CPU GPU.

4 Summary and Outlook

The processor chip process their long-term development, the formation of some of the market and the use of distinctive features. There is a lot of CPU & GPU field of open source software and application software, any new technology will first algorithm with the CPU, CPU programming rich resources and easy access, low development costs and development cycle. FPGA implementation using Verilog / VHDL and other low-level hardware description language, those who have a need to develop a more thorough understanding of the characteristics of the FPGA chip, but the properties of high parallelism can often make order to enhance service performance; while a dynamic FPGA can be reconstituted, when deployed in data centers, according to the service may be configured in different forms to achieve different logical hardware acceleration; for example in terms of the current FPGA board is deployed on the server image compression logic, business services in QQ ; real-time advertising at a time when the estimated capacity is needed to get more FPGA computing resources via a simple FPGA reconfiguration process, FPGA boards which can be turned into a "new" hardware to serve ads in real-time estimates, very suitable for mass deployment. ASIC chip can achieve optimum performance, namely high area-efficient, high speed, low power consumption; however AISC great development risk, need to have a large enough market to ensure that the cost price, but also from development to market for a long period of time not suitable for areas such as CNN and other deep learning algorithm is fast iteration.

Talk so much, when you need heterogeneous computing chip business bottlenecks encountered, you can choose based on business characteristics and properties of the appropriate chip chip it?

Complete analysis of the characteristics of various types of chips, then, the focus here!

Today's FPGA has great performance potential, support variable depth pipeline structure, massive parallel computing resources, a clock cycle to complete very complex functions. FPGA's programmability ensures that such a device can meet the special needs of the application software, the cost of custom design coprocessor absent or latency issues. FPGA is re-programmed, it can provide a very flexible customization coprocessing functions for multiple applications in a single chip. Have FPGA, business to have unlimited potential. The same semiconductor technology, both to exploit the performance of the processor to the limit, the FPGA can be simple glue logic controller, developed to a high performance programmable architecture. FPGA fully able to meet the HPC market, "4P" demand.

FPGA's built-in memory is also a great performance advantages. For example, a memory chip logic means coprocessor memory access bandwidth, the number of pins will not be limiting device I O /. Also, the memory and arithmetic logic closely, no longer need to use external high-speed memory buffer. In this way, also to avoid the buffer access large power and consistency. Using the internal memory also means that the coprocessor does not require additional I / O pins to increase its access to the memory capacity, thereby simplifying the design.

A lot of people due to the difficulty of FPGA development and a longer development cycle and hold doubts about its attitude, the good news is HLS and OpenCL language more perfect, many applications use both direct high-level language can achieve greater performance.

Industry Success Stories

In order to better meet the requirements for computing performance, many of the world's major IT companies have carried out the layout and practice acceleration hardware on the FPGA.

Intel:

Intel decided to $ 16.7 billion acquisition of FPGA maker Altera. Intel expects that by 2020, more than 30 percent of the server CPU chip will be equipped with an FPGA coprocessor.

IBM:

IBM and Xilinx announced to carry out a multi-year strategic collaboration, use Xilinx FPGA on IBM POWER systems to accelerate workload processing technology to create higher performance, more energy-efficient data center applications.

Microsoft:

As early as 2014, Microsoft will Altera FPGA use in its Bing search business, the Bing search process of doubling the amount of search time by 29%. By 2015, Microsoft further FPGA used in the field of deep learning. 2016, published in the top-level meetings Micro Microsoft architecture "A Cloud-Scale Acceleration Architecture" shows its ambition in the data center architecture. Now, all with an FPGA board on each server into the Microsoft data center, the basic architecture is as follows:

Papers related to the application scenarios include:

1. Acceleration network (e.g., network packet encryption)

2. Local Application Acceleration (Bing acceleration, DNN delay-sensitive traffic acceleration)

3. The support communications between FPGA, FPGA computing resource pooling, the concept of providing Hardware-as-a-Service of the server and decoupled FPGA.

Facebook:

2016, Facebook also vowed to cooperate with Intel for the construction of the data center with Xeon-FPGA platform.

Baidu:

Baidu also launched a domestic version of Baidu brain FPGA, using the online service; FPGA version of Baidu has been applied to the brain, including voice recognition, advertising click-through rate prediction model, DNA sequence detection and unmanned vehicles business. It is understood that the application of computational performance version of Baidu brain, voice online services, advertising click-through rate prediction models, etc. are improved 3 to 4 times.

This article reprinted from Tengyun Court , has been the author's permission.

Published 74 original articles · won praise 337 · Views 1.3 million +

Guess you like

Origin blog.csdn.net/kebu12345678/article/details/103888233