Simple TPU design and performance evaluation

The rapid development of deep learning process, it was discovered that the original neural network processor can not meet this particular large number of calculations, a large number of ASIC design starts for this application. Google tensor processing unit (Tensor Processing Unit, hereinafter referred to as TPU) is completed earlier, a representative class of designs, TPU matrix is ​​calculated based on systolic array uses the design acceleration unit can be well accelerated computing neural network . This series of articles will use the TPU V1 relevant information public, must be simplified, and speculated that modifications to actually write a simple version of Google TPU, a more precise understanding of the strengths and limitations of the TPU.

To write a small version of the Google directory TPU Series

 

    Google TPU and simplified overview

    Its systolic arrays implemented in TPU

    Neural networks normalization and pooling hardware implementation

    TPU parallel instructions and data parallelism

    Simple TPU design and performance evaluation

    SimpleTPU example: image classification

    expand

    TPU border (planning)

    Re-examine the parallel (plan) deep neural networks

 

1. Complete SimpleTPU design

    In Google and simplified overview of TPU been given SimpleTPU in block diagram, as shown in FIG.

clip_image002

    The systolic array and the TPU to achieve describes the main unit matrix calculation / computation of convolution - multiply array (FIG. 4), the completion code for the hardware portion and a simple verification; in the neural network hardware and pooled normalized in the manner described to achieve normalization and pooled in the convolutional neural network (figure 6), while floating-point network discussed point of the process, given the Simple TPU weight of the implementation, the code for the hardware portion of the completed and verified.

    In instruction TPU parallel and parallel data in the architecture of the overall processing unit are analyzed and discussed, it includes instructions in parallel and two parallel data. How, then, in parallel TPU instructions and data parallel the design ideas mentioned in the systolic array TPU and its realization and hardware implementation normalized and pooled in the neural network computing unit mentioned full use, is a Simple TPU complete the final design. The block diagram of SimpleTPU apparent need to implement functions comprising

  • Fetching and decoding instructions (FIG. 4)
  • Weight reading (FIG. 2)
  • And scheduling execution of each control unit (figure 1)
  • Reads the image and write back the results (Fig. 5)

    In SimpleTPU design, instruction fetch and decode Weight and reading functions are relatively simple, direct reference code.

    When each execution unit control and scheduling need to ensure that each unit can be implemented jointly, since there is no correlation between the data.

    In addition, the need to achieve to read and write the result back image function separately. SimpleTPU only focus on the core calculation function, which is part of the function is not optimized, to achieve the effect of subsequent analysis also excludes the portion of the running time out.

    At this point, Simple TPU's basic design is complete, the code can be found in https://github.com/cea-wind/SimpleTPU .

2. SimpleTPU characteristics of

    Key features include SimpleTPU

  •     Support INT8 multiplication, supports INT32 cumulative operation
  •     VLIW instruction in parallel using
  •     Using a vector architecture for data parallelism

    SimpleTPU accordance with design ideas Google TPU V1, the neural network can do most arithmetic reasoning process. Depending on the design, operational support, including (theory)

Operation

Explanation

Conv3d

in_channels: resource-constrained

out_channels: resource-constrained

kerner_size: almost unlimited

stride: almost unlimited

padding: almost unlimited

dilation: almost unlimited

groups: very limited support, architectural limitation

bias: Support

ConvTranspose3d

Ditto

Maxpool2d

kernel_size: almost unlimited

stride: almost unlimited

padding: almost unlimited

Avgpool2d

Ditto

Relu

As a nonlinear function supports only RELU

BatchNorm2d

Reasoning BatchNorm2d Conv is fused to complete or Pool

Linear

Resource-constrained

UpscalingNearest2D

Multiple calls to complete the pool

UpscalingBilinear2D

Multiple calls avgpool complete

     

    Wherein, on behalf of the limited resource constrained parameter range, limited primarily by storage design SimpleTPU the like; almost unlimited expresses its value only by the register bit width limitation. Due to problems on the architecture, SimpleTPU very limited support for groupconv, inappropriate parameters in efficiency may be much lower than the ordinary convolution; similar, Google TPU is also not well supported groupconv, and explicitly told not to stop depthwise conv (extreme thinning group conv).

    BatchNorm2d fact when the reasoning process pointwise multiplications and additions, which can be fused to the addition computing convolution calculation or the upper layer of the lower layer is carried out, multiplication, and pooling can calculate fusion. In SimpleTPU design, Pooling module has actually been working, even if the network is no pooling layer, SimipleTPU added a 1 * 1, stride = pooling layer 1 is equivalent.

    Upscaling operation is completed is calculated by pooling. This is because in SimpleTPU in, reshape operations (support) is without cost. pooling computing bilinear interpolation can be done, and thus can complete the calculation of all values ​​of the upscaling. As it will be appreciated by pooling + reshape completed upscaling calculations.

   

3. SimpleTPU performance

    Simple TPU int8 a design of 32 × 32 multiply array matrix multiplication and convolution calculation, and multiplication int32 array 1 × 32 pools were normalized and calculated. According to consolidated results Xilinx HLS tools on the UltraScale + family of FPGA devices, operating frequency up to 500MHz. Thus a force of about SimpleTPU Operator

32×32×500MHz×2 = 1Tops

    In contrast, GoogleTPU V1 operator force about 91Tops (int8), the main difference in the size of their SimpleTPU 1/64, while the operating frequency will be lower than in the FPGA ASIC operating frequency.

    Depending on the design, SimpleTPU at a suitable task will be very high operating efficiency, parallel TPU instructions and data parallelism in it and for more detailed description. Macroscopically, SimpleTPU the individual operating units may flow in parallel, i.e.,

image

    Largest fully connected layers and a convolution calculation for the network layer, the design method and multiplication entire column vector calculation may be allowed to design targeted effectively completed multiply-add calculated at each clock cycle; this means that CPU and phase ratio, SimpleTPU can achieve very high efficiency.

Guess you like

Origin www.cnblogs.com/sea-wind/p/11241018.html