Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times? Reduce operation accuracy

We found some information, want to be able to answer why the TPU operation speed than normal GPU, CPU 15-30 combination of fast times. At the same time, we believe Google TPU innovation in research and development is likely to become Intel, AMD benchmark follow the same hardware development, and ultimately become a trend.

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

First, the depth of learning for the development of customized

TPU is Google specifically for DNN computing power to accelerate research and development of a chip, it is actually a ASIC.

ASIC, IC refers to special specifications according to different needs of customized products, designed by the needs of the specific user requirements and specific electronic systems manufacturing. In general, an ASIC specific functions performed on a special strengthening, complex designs may be necessary, but relatively speaking, higher processing speed and lower power consumption. Corresponding, ASIC production costs are very high.

Generally difficult for companies to bear the cost and risk for the development of deep learning specialized processor ASIC chip. First, we must use the best performance for the semiconductor manufacturing process, and now with the latest technology manufacturing disposable chip will cost several million dollars, very expensive. Even if the money needed to pull a team from scratch design, design time often to more than one year, time to market too long, risky. If the application scale can not be achieved, even if successfully developed also the lack of practical value. Therefore, companies generally tend to use universal chip (e.g., CPU, GPU), or a semi-custom chip (FPGA).

Google dare to do their own customized research and development, on the one hand is naturally rich wayward, on the other hand due to the many services provided by Google, including Google Image Search (Google ImageSearch), Google photos (Google Photo), Google cloud visual API ( Google Cloud Vision API), Google translation products and services will need to use the depth of the neural network. Based on Google's own huge amount of body, to develop a special chip to start with large-scale applications (sharing a large number of R & D costs) are possible.

If there is a scenario in which people use in one day in Google Voice Search for 3 minutes, and we want to run the depth of neural networks for voice recognition system processor is used, then we would have to double the Google data center number.

Our load is a high-level framework of TensorFlow written, and is a production-level application of neural networks (multilayer perceptron, neural network convolution and LSTM), these applications account for our data center computing needs of the neural network inference 95%.

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

Table 1: Six-neural network (neural network for each type of each two species) accounted for 95% TPU load. Columns in the table followed by a variety of neural networks, the type and number of lines of code, middle neural network (FC is fully connected layers, Conv is the convolution layer, Vector is a vector layer, a reservoir layer Pool) and a TPU penetration of the application in July 2016. 

Optimization with respect to the CPU and GPU time varying (cache-order execution, multi-threaded, multi-processing, prefetch ......), deterministic execution model that the TPU (deterministic execution model) better match 99% of the response time requirements of our neural network applications, because more CPU and GPU is helpful throughput (throughout) are averaged, rather than to ensure that the delay performance. Lack of these features help explain why, despite the TPU has great MAC and large memory, but it is relatively small and low power consumption.

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

TPU is a block diagram of the modules. The main calculation section yellow top right matrix multiplication unit. The inputs are blue "weights FIFO" and blue unified cache (Unified Buffer (UB)); output is blue accumulator (Accumulators (Acc)). Yellow activation (Activation) unit performs a nonlinear function of the flow UB in Acc.

Second, the large-scale chip memory

TPU used up to 24MB of local memory on the chip, and memory 6MB accumulator memory for interfacing with the host processor, a total of 37% of the chip area (blue in the figure).

This means that Google is fully aware of the off-chip memory access is low energy efficiency of GPU culprit, so at the cost of on-chip put a huge memory. In contrast, Nvidia same period K80 only 8MB of on-chip memory, it is necessary to continue to access off-chip DRAM. 

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

TPU chip layout of FIG. Blue chip data cache of 37%. Yellow calculated 30%. Green I / O was 10%. Red control only 2%. The control section of the CPU or GPU will have many large (and very difficult to design).

Third, low precision (8-bit) Calculation

TPU performance but also from a low tolerance for operation precision.

The results show that low-precision arithmetic operations caused by the loss of accuracy is very small, but it can bring great convenience on the hardware implementation, including lower power consumption and faster chip area accounted for a smaller operation unit, a smaller memory Bandwidth demand.

The information released, TPU uses 8-bit low-precision arithmetic. That TPU each step will require fewer transistors. In the case of transistors of the same total capacity, per unit time you can run more operations on these transistors, so it can use more complex and powerful machine learning algorithms get smarter results faster through.

Google's test, a 64-bit floating-point mathematical operator 18 of the core movement in 2.3 GHz processor capable of processing Haswell XeonE5-2699 v3 1.3 TOPS of operations per second, and provides a memory bandwidth 51GB / sec; power chip Haswell consumption is 145 watts, the system (has 256 GB of memory) consumption of 455 watts at full load. In contrast, TPU using 8-bit integer math, a host memory has 256GB memory and 32GB can be achieved 34GB / s memory bandwidth, processing speed of up to 92 TOPS, which improved 71 times higher than Haswell, in addition, TPU thermal power of only 384 watts server.

Fourth, the pulsating stream

For GPU, the memory fetch instructions and data from the time-consuming. TPU not even fetch command operation, but it is currently provided to the host processor's instruction, and do the corresponding TPU based on the current operation instruction, which makes it possible to achieve higher TPU computational efficiency.

In the matrix multiplication and convolution operation, the number of data can be multiplexed, and the same data requires a number of different weights and summed to obtain the final multiplication result. Thus, at different times, data is often only one or two inputs needed for new data taken from the outside, but on the other data of a time shift data.

In this case, the on-chip memory data Flush all go get new data is undoubtedly very inefficient. The characteristics of this calculation, TPU added support pulsating stream of data, each cycle of the data shift clock, and to retrieve a new data. This will maximize the multiplexed data, and reduce memory access times, while reducing the memory bandwidth pressure also reduces the energy consumption of the memory access.

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

Fifth, strengthen heat dissipation

For performance, the two factors that limit the maximum speed of the processor is the logic gate delay of heat, wherein the heating is the most important factor limiting the speed. Most current processors use the CMOS technology, a clock cycle will each produce energy dissipation, the faster the heat is larger. Below is a relationship between the CPU clock frequency and power consumption can be seen, the chip power consumption operation with the speed of change exponential growth.

 

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

TPU while reducing power consumption, cooling capacity also made for further optimization. As it can be seen from the external view of the TPU, wherein the inter-metallic projecting a large sheet, which is good for the TPU to be high-speed operation is carried out a lot of heat dissipation.

 

Why TPU computing speed than normal GPU, CPU 15-30 combination of fast times?

Sixth, hardware, software, continuous optimization

Google believes there is a big room for optimization now still TPU hardware and software, such as the NVIDIA K80 GPU assumed to spend in the GDDR5 memory, the TPU can play a better performance.

In addition, Google engineers also developed a software called CNN1 TPU, the TPU can make the speed 70 times higher than the average CPU!

Published 74 original articles · won praise 337 · Views 1.3 million +

Guess you like

Origin blog.csdn.net/kebu12345678/article/details/104081108