To write a small version of the Google TPU

Google TPU is a well-designed matrix calculation acceleration unit, it may well speed up computing neural network. This series of articles will use public TPU V1 (later referred to as TPU) relevant information must be simplified, and speculated that modifications to actually write a simple version of Google TPU. Plan to implement until the behavioral simulation, only more precise understanding of the strengths and limitations of the TPU, no plans to further implemented on FPGA hardware.

Series catalog

    Google TPU and simplified overview

    Base unit - an array of the matrix multiplication

    The basic unit - normalized and pooled (to be released)

    The TPU Instruction (pending)

    SimpleTPU example: (planned)

    expand

    TPU border (planning)

    Re-examine the parallel (plan) deep neural networks

 

1. TPU Design Analysis

    A large number of multiply-add calculation (such as three-dimensional convolution calculation) artificial neural networks mostly can be summarized in a matrix calculation. Previously, some of various types of processors, in which the underlying hardware is a complete (or more) scalar / vector calculation, these processors do not fully utilize the data multiplexed matrix computation; Google TPU V1 and is designed specifically for powerful matrix calculation processing unit design. Google reference papers disclosed the In-the Analysis of the Performance Datacenter the Tensor Processing Unit A , a block diagram is shown below TPU V1

image

    The block diagram in most attention is enormous Matrix Multiply Unit, MAC may provide the total 64K 92T int8 Ops performance at the operating frequency of 700MHz. Such a matrix array will be calculated in detail the basic unit - matrix multiplication array for further elaboration. The key is to design TPU take advantage of this multiply-add array, so that utilization as high as possible.

    FIG other structural portions are substantially the matrix calculation is run over the array and services as possible, whereby the following design

  1. Providing Local Unified Buffer 256 × 8b @ 700MHz bandwidth (i.e. 167GiB / s, 0.25Kib × 700/ 1024/1024 = 167GiB / s), to ensure that the computing unit will not be missing Data in idle;
  2. Local Unified Buffer space up 24MiB, intermediate results of calculations which means that almost no interaction with the outside world, there would be no case because the data bandwidth limitations in computing power;
  3. Matrix Multiply Unit in each register stores two MAC built Weight , when a calculated loading of another new Weight, Weight to mask load time;
  4. 30GiB / s bandwidth of 256 × 256Weight complete loading takes about 1430 Cycles, which means need to calculate a set of at least Weight 1430Cycles , so Accumulators depth needs to 2K (1430 to take the power of 2, is given in the paper value is 1350, the difference is unknown);
  5. Since the need to be calculated at the same time between the MAC and the Activation Module, so Accumulators need to store twice pingpang designed so Accumulators stored depth designed to 4K ;

    Therefore, from the point of view of hardware design, as long as the TPU ops / Weight Byte reach 1400, TPU can be close to the theoretical 100% efficiency was calculated. However, in actual operation, visit schedule dependencies between the memory and read between the calculation (for example, Read After Write, etc. need to finish reading), the instruction pipeline between the idle period and the process will in certain the degree of influence actual performance.

    To this end, TPU is designed to control a set of instructions which access memory and calculation, the main instruction comprises

  • Read_Host_Memory
  • Read_Weights
  • MatrixMultiply/Convolve
  • Activation
  • Write_Host_Memory

    All are designed to make the matrix cells are not retired and sit, design hope all other instructions can be masked by MatrixMultiply instruction, TPU uses a separate design and implementation of data acquisition (Decoupled-access / execute), which means that the issue Read_Weights after instruction, MatrixMultiply can begin without necessarily waiting for Read_Weight instruction is completed; if Weight / Activation not ready, matrix unit will stop.

    Note that, an instruction can be executed thousands of cycles , so there is no idle period between the assembly line TPU cover up the design process, because the impact on the final performance due to waste Pipline brought dozens cycle is not to 1%.

    Details on the instruction is still not particularly clear, more details to be discussed supplement.

2. TPU simplified

    Achieve a complete TPU little too complicated, in order to reduce workload and improve viability, the need for TPU series of simplified; distinction is done, the TPU will simplify later called SimpleTPU. All should be simplified without losing TPU own design.

    TPU in order to exchange data, there are all kinds of hardware interfaces include PCIE Interface, DDR Interface including; here do not consider the design of these standard hardware interfaces, various types of data exchange through AXI interfaces are complete; TPU concerned only internal calculation implementation, i.e., the red frame as shown in FIG.

image

    TPU are too large size, the size of the multiplier array is 256 × 256, this will give comprehensive debug and create great difficulties, thus modifying its matrix multiplication where the unit is 32 × 32, but also the rest of the corresponding data bit modification, such modifications including

Resource TPU SimpleTPU
Matrix Multiply Unit 256*256 32*32
Accumulators RAM 4K * 256 * 32b 4K * 32 * 32b
Unified Buffer 96K * 256 * 8b 16K*32*8b

    由于Weight FIFO实现上的困难(难以采用C语言描述), Weight采用1K*32*8b的BRAM存放,Pingpang使用;

    由于Matrix Multiply Unit和Accumulators之间的高度相关性,SimpleTPU将其合二为一了;

    由于Activation和Normalized/Pool之间的高度相关性,SimpleTPU将其合二为一了(TPU本身可能也是这样做的),同时只支持RELU激活函数;

    由于并不清楚Systolic Data Setup模块到底进行了什么操作,SimpleTPU将其删除了;SimpleTPU采用了另一种灵活而又简单的方式,即通过地址上的设计,来完成卷积计算;

    由于中间结果和片外缓存交互会增加instruction生成的困难,此处认为计算过程中无需访问片外缓存;(这也符合TPU本身的设计思路,但由于Unified Buffer大小变成了1/24,在这一约束下只能够运行更小的模型了)

    由于TPU V1并没有提供关于ResNet中加法操作的具体实现方式,SimpleTPU也不支持ResNet相关运算,但可以支持channel concate操作;(虽然有多种方式实现Residual Connection,但均需添加额外逻辑,似乎都会破坏原有的结构)

    简化后的框图如下所示,模块基本保持一致

clip_image002

3. 基于Xilinx HLS的实现方案

    In general, multi-chip development process using hardware description languages (Hardware Description Language), such as VHDL or Verilog HDL for development and validation. However, in order to improve efficiency of encoding, at the same time make the code understandable, SimpleTPU tried using C language to describe the underlying hardware; and C by HLS code translation technique HDL code. As the used Xilinx HLS tool before, so here it is still using Xilinx HLS development; information about Xilinx HLS, refer to high-level synthesis (HLS) - Introduction and development of a simple example of the use of Xilinx HLS implement LDPC decoder .

    While here we chose Xilinx HLS tool, but as far as I know, HLS may not be suitable to accomplish this more complex IP design. Although SimpleTPU have been simple enough, but still can not complete all the functions in a function, and HLS does not have the ability to describe a relatively complex function rooms, often only by calling the relationship between two modules or connected via FIFO Channel. However, due to HLS easy to write, easy to read, easy to verify , here still choose HLS, and circumvented part of the problem through a number of means. Real applications, the use of HDL or HDL binding HLS development is a more appropriate choice.

    After planning will be given by two key calculation unit, and control logic design methods and instructions; finally given a neural network and its actual simulation and analysis.

Guess you like

Origin www.cnblogs.com/sea-wind/p/10993958.html