The rapid development of deep learning process, it was discovered that the original neural network processor can not meet this particular large number of calculations, a large number of ASIC design starts for this application. Google tensor processing unit (Tensor Processing Unit, hereinafter referred to as TPU) is completed earlier, a representative class of designs, TPU matrix is calculated based on systolic array uses the design acceleration unit can be well accelerated computing neural network . This series of articles will use the TPU V1 relevant information public, must be simplified, and speculated that modifications to actually write a simple version of Google TPU, a more precise understanding of the strengths and limitations of the TPU.
To write a small version of the Google directory TPU Series
Google TPU and simplified overview
Its systolic arrays implemented in TPU
Neural networks normalization and pooling hardware implementation
TPU parallel instructions and data parallelism
Simple TPU design and performance evaluation
SimpleTPU example: image classification
expand
TPU border (planning)
Re-examine the parallel (plan) deep neural networks
1. Complete SimpleTPU design
In Google and simplified overview of TPU been given SimpleTPU in block diagram, as shown in FIG.
The systolic array and the TPU to achieve describes the main unit matrix calculation / computation of convolution - multiply array (FIG. 4), the completion code for the hardware portion and a simple verification; in the neural network hardware and pooled normalized in the manner described to achieve normalization and pooled in the convolutional neural network (figure 6), while floating-point network discussed point of the process, given the Simple TPU weight of the implementation, the code for the hardware portion of the completed and verified.
In instruction TPU parallel and parallel data in the architecture of the overall processing unit are analyzed and discussed, it includes instructions in parallel and two parallel data. How, then, in parallel TPU instructions and data parallel the design ideas mentioned in the systolic array TPU and its realization and hardware implementation normalized and pooled in the neural network computing unit mentioned full use, is a Simple TPU complete the final design. The block diagram of SimpleTPU apparent need to implement functions comprising
- Fetching and decoding instructions (FIG. 4)
- Weight reading (FIG. 2)
- And scheduling execution of each control unit (figure 1)
- Reads the image and write back the results (Fig. 5)
In SimpleTPU design, instruction fetch and decode Weight and reading functions are relatively simple, direct reference code.
When each execution unit control and scheduling need to ensure that each unit can be implemented jointly, since there is no correlation between the data.
In addition, the need to achieve to read and write the result back image function separately. SimpleTPU only focus on the core calculation function, which is part of the function is not optimized, to achieve the effect of subsequent analysis also excludes the portion of the running time out.
At this point, Simple TPU's basic design is complete, the code can be found in https://github.com/cea-wind/SimpleTPU .
2. SimpleTPU characteristics of
Key features include SimpleTPU
- Support INT8 multiplication, supports INT32 cumulative operation
- VLIW instruction in parallel using
- Using a vector architecture for data parallelism
SimpleTPU accordance with design ideas Google TPU V1, the neural network can do most arithmetic reasoning process. Depending on the design, operational support, including (theory)
Operation |
Explanation |
Conv3d |
in_channels: resource-constrained out_channels: resource-constrained kerner_size: almost unlimited stride: almost unlimited padding: almost unlimited dilation: almost unlimited groups: very limited support, architectural limitation bias: Support |
ConvTranspose3d |
Ditto |
Maxpool2d |
kernel_size: almost unlimited stride: almost unlimited padding: almost unlimited |
Avgpool2d |
Ditto |
Relu |
As a nonlinear function supports only RELU |
BatchNorm2d |
Reasoning BatchNorm2d Conv is fused to complete or Pool |
Linear |
Resource-constrained |
UpscalingNearest2D |
Multiple calls to complete the pool |
UpscalingBilinear2D |
Multiple calls avgpool complete |
Wherein, on behalf of the limited resource constrained parameter range, limited primarily by storage design SimpleTPU the like; almost unlimited expresses its value only by the register bit width limitation. Due to problems on the architecture, SimpleTPU very limited support for groupconv, inappropriate parameters in efficiency may be much lower than the ordinary convolution; similar, Google TPU is also not well supported groupconv, and explicitly told not to stop depthwise conv (extreme thinning group conv).
BatchNorm2d fact when the reasoning process pointwise multiplications and additions, which can be fused to the addition computing convolution calculation or the upper layer of the lower layer is carried out, multiplication, and pooling can calculate fusion. In SimpleTPU design, Pooling module has actually been working, even if the network is no pooling layer, SimipleTPU added a 1 * 1, stride = pooling layer 1 is equivalent.
Upscaling operation is completed is calculated by pooling. This is because in SimpleTPU in, reshape operations (support) is without cost. pooling computing bilinear interpolation can be done, and thus can complete the calculation of all values of the upscaling. As it will be appreciated by pooling + reshape completed upscaling calculations.
3. SimpleTPU performance
Simple TPU int8 a design of 32 × 32 multiply array matrix multiplication and convolution calculation, and multiplication int32 array 1 × 32 pools were normalized and calculated. According to consolidated results Xilinx HLS tools on the UltraScale + family of FPGA devices, operating frequency up to 500MHz. Thus a force of about SimpleTPU Operator
32×32×500MHz×2 = 1Tops
In contrast, GoogleTPU V1 operator force about 91Tops (int8), the main difference in the size of their SimpleTPU 1/64, while the operating frequency will be lower than in the FPGA ASIC operating frequency.
Depending on the design, SimpleTPU at a suitable task will be very high operating efficiency, parallel TPU instructions and data parallelism in it and for more detailed description. Macroscopically, SimpleTPU the individual operating units may flow in parallel, i.e.,
Largest fully connected layers and a convolution calculation for the network layer, the design method and multiplication entire column vector calculation may be allowed to design targeted effectively completed multiply-add calculated at each clock cycle; this means that CPU and phase ratio, SimpleTPU can achieve very high efficiency.