Embedded algorithm transplantation optimization study notes 5-what are CPU, GPU, TPU, NPU

Embedded algorithm transplantation optimization study notes 5-what are CPU, GPU, TPU, NPU


With the widespread application of AI, deep learning has become the mainstream method of AI research and application. Faced with the parallel computing of massive data, AI's requirements for computing power continue to increase, and higher requirements for the computing speed and power consumption of the hardware are put forward.

At present, in addition to general-purpose CPUs, hardware-accelerated GPUs, NPUs, FPGAs and other chip processors have their own advantages in different applications of deep learning, but which one is better?

Taking face recognition as an example, the basic processing flow and the distribution of computing power required by the corresponding functional modules are as follows:
Insert picture description here
Why is there such an application distinction?

Where is the meaning?

To know the answer, we need to understand the respective principles, architecture and performance characteristics of CPU, GPU, NPU, and FPGA.

First, let’s take a look at the architecture of a general-purpose CPU

1. What is a CPU?

The central processing unit (CPU) is one of the main equipment of an electronic computer and a core accessory in the computer. Its function is mainly to interpret computer instructions and process data in computer software. The CPU is the core component of the computer that is responsible for reading instructions, decoding and executing instructions. The central processing unit mainly includes two parts, namely the controller and the arithmetic unit, which also includes a high-speed bus that realizes the data and control of the connection between the buffer processors. The three core components of an electronic computer are the CPU, internal memory, and input/output devices. The functions of the central processing unit are mainly to process instructions, perform operations, control time, and process data. In the computer architecture, the CPU is the core hardware unit that controls all the hardware resources of the computer (such as memory, input and output units) and performs general operations. The CPU is the computing and control core of the computer. The operation of all software layers in the computer system will eventually be mapped to the operation of the CPU through the instruction set.
The main logic architecture includes the control unit Control, the arithmetic unit ALU and the high-speed buffer memory (Cache) and the data (Data), control and status bus (Bus) that realize the connection between them. Simply put, it is the computing unit, control unit and storage unit .
The architecture diagram is as follows: The
Insert picture description here
CPU follows the Von Neumann architecture, and its core is stored procedures and sequential execution. The CPU architecture requires a lot of space to place the storage unit (Cache) and control unit (Control). In contrast, the computing unit (ALU) occupies only a small part, so it is extremely vulnerable to large-scale parallel computing capabilities. Restrictions, and are better at logic control.

The CPU cannot achieve the ability of parallel computing of large amounts of matrix data, but the GPU can.

2. What is GPU?

Graphics processing unit (English: Graphics Processing Unit, abbreviation: GPU), also known as display core, visual processor, display chip, is a kind of specializing in personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones). Etc.) on the microprocessor that does image and graphics-related operations.

The GPU enables the graphics card to reduce the dependence on the CPU, and performs part of the original CPU work, especially in the 3D graphics processing. The core technologies used by the GPU include hardware T&L (geometric conversion and lighting processing), cubic environment material mapping and vertex blending, Texture compression and bump mapping, dual texture four-pixel 256-bit rendering engine, etc., and hardware T&L technology can be said to be a sign of GPU. GPU manufacturers mainly include NVIDIA and ATI.

The structure of the GPU is relatively simple, with a large number of computing units and an ultra-long pipeline, which is particularly suitable for processing a large amount of uniform data. But the GPU cannot work alone, it must be controlled by the CPU to work. The CPU can act alone to process complex logical operations and different data types, but when a large amount of data with a uniform processing type is required, the GPU can be called to perform parallel calculations.

Insert picture description here

Compared with the CPU, less than 20% of the CPU chip space is ALU, and more than 80% of the GPU chip space is ALU. That is, the GPU has more ALUs for data parallel processing .

The neural network models built by Darknet, AlexNet, VGG-16, and Restnet152, are predicted to perform ImageNet classification tasks on GPU Titan X, CPU Intel i7-4790K (4 GHz):
Insert picture description here

Note: The above data is from https://pjreddie.com/darknet/imagenet/#reference

It can be seen that GPU processing neural network data is far more efficient than CPU.

In summary, GPU has the following characteristics:

  • 1. Multi-threading provides the basic structure of multi-core parallel computing, and the number of cores is very large, which can support parallel computing of large amounts of data.

  • 2. Have a higher memory access speed.

  • 3. Higher floating-point computing capabilities.

Therefore, GPU is more suitable than CPU for a large amount of training data, a large number of matrices, and convolution operations in deep learning.

Although GPU has full advantages in parallel computing capabilities, it cannot work alone and requires the collaborative processing of the CPU. The construction of neural network models and the transmission of data streams are still performed on the CPU. At the same time, there are problems of high power consumption and large volume.

The higher the performance of the GPU, the larger the volume, the higher the power consumption, and the higher the price. It will not be usable for some small devices and mobile devices .

Therefore, a special chip NPU with small size, low power consumption, high computing performance and high computing efficiency was born.

3. What is NPU?

The embedded neural network processor (NPU) adopts a "data-driven parallel computing" architecture, and is particularly good at processing massive multimedia data such as videos and images.
The NPU processor is specially designed for the artificial intelligence of the Internet of Things. It is used to accelerate the calculation of the neural network and solve the problem of inefficiency of the traditional chip in the calculation of the neural network. In GX8010, CPU and MCU each have an NPU. The NPU in the MCU is relatively small, and it is customarily called SNPU.

The NPU processor includes modules such as multiplication and addition, activation functions, two-dimensional data operations, and decompression .

  • The multiply and add module is used to calculate matrix multiply and add, convolution, dot multiply and other functions. There are 64 MACs in the NPU and 32 in the SNPU.

  • The activation function module uses the highest 12-order parameter fitting method to implement the activation function in the neural network. There are 6 MACs in the NPU and 3 SNPUs.

  • The two-dimensional data operation module is used to implement operations on a plane, such as down-sampling, plane data copy, etc. There is one MAC in the NPU and one in the SNPU.

  • The decompression module is used to decompress the weight data. In order to solve the characteristics of small memory bandwidth in IoT devices, the weights in the neural network are compressed in the NPU compiler, and 6-10 times the compression effect can be achieved without affecting the accuracy.

The working principle of NPU is to simulate human neurons and synapses in the circuit layer, and use deep learning instruction set to directly process large-scale neurons and synapses, and one instruction completes the processing of a group of neurons. Compared with CPU and GPU, NPU realizes the integration of storage and calculation through synaptic weights, thereby improving operating efficiency.
Insert picture description here
NPU is constructed by imitating biological neural networks. CPU and GPU processors require neuron processing with thousands of instructions. NPU can be completed with only one or a few, so it has obvious advantages in the processing efficiency of deep learning.

Experimental results show that the performance of the NPU is 118 times that of the GPU under the same power consumption .

Like the GPU, the NPU also requires the collaborative processing of the CPU to complete specific tasks. Below, we can look at how the GPU and NPU work together with the CPU.

GPU acceleration

The GPU is currently just a simple parallel matrix multiplication and addition operation, and the construction of the neural network model and the transmission of the data stream are still performed on the CPU .
Insert picture description here

The CPU loads the weight data, builds the neural network model according to the code, and transmits the matrix operation of each layer to the GPU through the CUDA or OpenCL library interface to achieve parallel calculation and output the result; the CPU then schedules the calculation of the matrix data of the lower neuron group until the nerve The calculation of the network output layer is completed, and the final result is obtained.

Fourth, what is TPU?

TPU (Tensor Processing Unit), or tensor processing unit, is a chip customized for machine learning. It has been trained in deep machine learning and has higher performance (computing power per watt).

Because it can accelerate the operation of its second-generation artificial intelligence system TensorFlow, and the efficiency is much higher than that of GPU-Google's deep neural network is driven by the TensorFlow engine. TPU is tailor-made for machine learning. It requires fewer transistors to perform each operation and is naturally more efficient.

Compared with the CPU and GPU of the same period, TPU can provide 15-30 times performance improvement, and 30-80 times efficiency (performance/watt) improvement.

Each watt of TPU can provide higher-level instructions for machine learning than all commercial GPUs and FPGAs, which is basically equivalent to the level of technology 7 years later. TPU is specially developed for machine learning applications to make the chip more durable when the calculation accuracy is reduced, which means that each operation only requires fewer transistors, uses more sophisticated and high-power machine learning models, and quickly applies these Model, so users can get more accurate results.

Attached:

The respective characteristics of CPU/GPU/NPU/FPGA
Insert picture description here

  • APU-Accelerated Processing Unit, accelerated processor, AMD introduced accelerated image processing chip products.

  • BPU-Brain Processing Unit, the embedded processor architecture led by Horizon Corporation.

  • CPU-Central Processing Unit Central processing unit, the current mainstream product of PC core.

  • DPU-Deep learning Processing Unit, the deep learning processor, was first proposed by the domestic Deep Jian Technology; another said there is the Dataflow Processing Unit data flow processor, the AI ​​architecture proposed by Wave Computing; the Data storage Processing Unit, the intelligence of Shenzhen Dapuwei Solid state drive processor.

  • FPU-Floating Processing Unit Floating Processing Unit, floating-point arithmetic module in general-purpose processors.

GPU-Graphics Processing Unit, graphics processor, using multi-threaded SIMD architecture, born for graphics processing.

  • HPU-Holographics Processing Unit holographic image processor, holographic computing chip and equipment produced by Microsoft.

  • IPU-Intelligence Processing Unit, an AI processor product produced by Graphcore, invested by Deep Mind.

  • MPU/MCU-Microprocessor/Micro controller Unit, microprocessor/microcontroller, RISC computer architecture products generally used for low-computing applications, such as ARM-M series processors.

  • NPU-Neural Network Processing Unit, Neural Network Processor, is a general term for new processors based on neural network algorithms and acceleration, such as the diannao series produced by the Institute of Computing Technology of the Chinese Academy of Sciences/Cambrian Corporation.

  • RPU-Radio Processing Unit, a radio processor, the integrated Wifi/Bluetooth/FM/processor launched by Imagination Technologies is a single-chip processor.

  • TPU-Tensor Processing Unit tensor processor, a dedicated processor for accelerating artificial intelligence algorithms launched by Google. The current generation of TPU is for Inference, and the second generation is for training.

  • VPU-Vector Processing Unit vector processor, the accelerated computing core of the dedicated chip for image processing and artificial intelligence launched by Movidius, acquired by Intel.

  • WPU-Wearable Processing Unit, wearable processor, a wearable system-on-chip product launched by Ineda Systems, including GPU/MIPS CPU and other IP.

  • XPU-Baidu and Xilinx company announced at the 2017 Hotchips conference FPGA intelligent cloud acceleration, containing 256 cores.

  • ZPU-Zylin Processing Unit, a 32-bit open source processor launched by the Norwegian Zylin company.

Guess you like

Origin blog.csdn.net/mao_hui_fei/article/details/113811783