To write a small version of the Google TPU- instruction set

Series catalog

    Google TPU and simplified overview

    Base unit - an array of the matrix multiplication

    The basic unit - normalized and pooled (to be released)

    TPU instruction set

    SimpleTPU example: (planned)

    expand

    TPU border (planning)

    Re-examine the parallel (plan) deep neural networks

  

     TPU V1 defines a set of their own instruction set, although the introduction processors tend to talk about instruction set architecture, but yet here it into the final, which is mainly based on two reasons; one that the individual processing I am not quite understand, and this is the main reason, and the other is public information and did not describe the details of TPU and TPU microarchitecture instruction set. TPU starting to be analyzed from the data stream and computing unit is certainly a lot easier, but if you want to understand TPU design ideas, still need to get back on its architectural design analysis. This part somewhat beyond my current capabilities, inappropriate also please correct me.

    This paper focuses on the architecture design point of view, how do high-performance and energy efficient design TPU time. Come from the high-performance parallel, so this paper discusses the design method instruction parallelism and data parallelism. Due to the specific design TPU instruction set of papers did not describe, unless otherwise stated, discuss this article on TPU instruction set are speculative; in addition, SimpleTPU instructions are not designed systems / complete, here only to clarify a few design The basic idea.

1. TPU instruction set

     TPU use CISC instruction set design, a total of ten instructions, instructions comprising five main

  1. Read_Host_Memory reading data from the CPU to the memory of the Unified Buffer TPU
  2. Read_Weights read from the weight memory to the TPU Weight FIFO.
  3. MatrixMultiply / Convolve multiplication or convolution matrix.
  4. Pooling and performing non-linear operations Activate artificial neural network operation (if any)
  5. Write_Host_Memory write the results back from the Unified Buffer CPU memory.

    As can be seen from the five instruction given, TPU and the general purpose processor instruction set design are very different. Instructions need to specify display process data between memory and the on-chip buffer move. And executing instructions (the MatrixMultiply) Buffer directly specifies the address, instruction and can not see the series of general-purpose registers. This is because the nature of TPU or a dedicated processing chip, high performance and high efficiency are built on the premise of losing some flexibility. In order to achieve higher performance, a series of conventional methods may be employed to design, comprising

  • Parallel instructions, i.e. one time more processing instruction, so that the efficient operation of all execution units
  • Parallel data, a one-time processing a plurality of sets of data, to improve performance

     Later it will be further described with respect to these two points to make, and a brief discussion of other optimization methods and more direction TPU design.

2. Instruction Parallel

2.1 Simple TPU in the pipeline

    In order to increase throughput and clock frequency, the processor is typically designed using a pipeline, the classic five-stage pipeline design is generally shown as follows

 

clk0

clk1

clk2

clk3

clk4

clk5

clk6

clk7

instruction 0

IF

ID

EX

MEM

WB

     

instruction 1

 

IF

ID

EX

MEM

WB

   

instruction 2

   

IF

ID

EX

MEM

WB

 

instruction 3

     

IF

ID

EX

MEM

WB

    Wherein, IF pointing means (insturction fetch), ID refers to the instruction decoder (instruction decode), EX means execution (Execute), MEM refers to a memory read and write (Memory Access), WB write back to the register means (Write back). Pipelined design can improve performance, if not pipelined design, then instruction1 need to begin IF in clk5, seriously affect its performance; if completed IF / ID / EX / MEM / WB function in the same period, due to the logic is extremely complex, It will seriously affect the operating frequency.

    TPU paper describes the use of its 4-stage pipeline design, Simple TPU used in a two-stage pipeline, to complete the control process.

 

clk0

clk1

clk2

clk3

clk4

clk5

clk6

clk7

instruction 0

IF&ID

EX

           

instruction 1

 

IF&ID

EX

         

instruction 2

   

IF&ID

EX

       

instruction 3

     

IF&ID

EX

     

    Also believes that internal Simple TPU has four lines, this is because in the actual implementation process, including the register read, execute, and write-back of three parts, three parts are flowing design.

2.2 very long instruction word (VLIW)

    As previously described, Simple TPU has two basic computation unit - matrix multiplication calculation unit arrays and pooled. In addition, there are no explicit description of the execution units, such as load and store. In this context, even if no matter how well TPU instruction pipeline, the number of cycles per instruction can not occupy less than 1. If the execution cycle number of other execution units is small, then there will always be an execution unit in an idle state, the processor will be a bottleneck in the instruction. (Let the execution unit of execution time becomes longer when another method, Simple TPU through structural design vector systems also have this treatment) in order to solve this problem, when multiple instructions per cycle is straightforward idea.

    Since the specificity of the TPU, and the reasons for the absence of a jump and a control calculation process, the use of multiple transmit VLIW processor design seems to be a very suitable manner. In this design, a fixed instruction emission structure, but can be detected in advance all the adventures and processed by the compiler, which can to a large extent reduce the complexity of hardware implementation. Reference in Simple TPU in the VLIW design idea, as shown in (a schematic) as follows

clip_image002

    Wherein each field is described as follows

  • model mask specifies the current instruction execution module
  • Load weight specifies the weight read from the memory to the instruction SRAM
  • load act. & mac & store result specifies the number (act.) to the register read operation, the array multiplication and addition calculation and writes the result back to the memory of the process
  • set weight specified operands (weight) to calculate the array read process register
  • load act. & pooling & store result field specifies the number (act.) to the register read operation, the completion of pooling and normalization calculation and writes the result back to the memory of the process

    VLIW design gave up a lot of flexibility and compatibility, while the software will put a lot of work to complete, but still suitable for use in such a biased TPU dedicated processor. Simple TPU no data conflicts, any dependent processing, software needs to complete the analysis in advance and avoid. In this design, the next instruction can be scheduled up to four modules at a time, efficiency has improved.

3. convolution calculation data in parallel

3.1 Single Instruction Multiple Data (SIMD)

    Single instruction multiple data, is intended to refer to the name suggests a control calculation instruction sets of data. Obviously, TPU core design employs a data-parallel manner - a control instruction 256 * 256 multiply-add computing unit (MatirxMultiply / Convolve). The correspondence between the instruction stream and data stream, the processor may be divided into the following categories

  • SISD, single instruction stream single data stream, the order of execution of instructions, processing data, the method may be applied instructions in parallel
  • The SIMD, SIMD, multiple sets of the same instruction to start operation of data, can be used to develop data-level parallelism
  • MISD, multiple instruction stream single data stream, no commercial implementations
  • The MIMD, MIMD, each processor with an instruction of various operations of each of the data, can be used in task-level parallelism can also be used for data-level parallelism, more flexible than SIMD

    Since the matrix used in the rules TPU / convolution calculation, on the inside of a single processor design, the SIMD parallel data optimal choice. There are a variety of SIMD implementations, according to the description given (MatirxMultiply / Convolve B * 256 accepts instruction input, output results B * 256), the TPU should be used in a method analogous vector design architecture.

3.2 Vector Architecture

    The base unit - matrix multiplication array of said calculating means calculates complete matrix multiplication, i.e. the vector is calculated. In: Examples of "Computer architecture quantization Method" given as an example, To calculate

for(int i=0;i<N;i++)
    y[i] += a*x[i];

    In MIPS, for example, for general scalar processor and a vector processor, an instruction execution sequence is shown below

image

    The biggest difference is that the vector processor greatly reduces the number of instructions, reduced instruction bandwidth. At the same time, the simple presence of MIPS instruction may be interlocked, reduce performance, and this phenomenon is not present in the vector processor.

    For convolutional neural network convolution operation, calculation may be expressed as (ignored bias)

for(int i=0;i<M;i++){
    for(int j=0;j<N;j++){
        for(int k=0;k<K;k++){
            for(int c=0;c<C;c++){
                for(int kw=0;kw<KW;kw++){
                    for(int kh=0;kh<KH;kh++){
                        result(i,j,k) += feature(i+kw,j+kh,c)*w(k,kw,kh,c);
                    }
                }
            }
        }
    }
}

    Since the KW and KH may be 1 (i.e., the convolution kernel width and height), while the weight in the calculation process that is fixed to the rear of the array of internal calculations, so there is the adjustment cycle sequence

for(int kw=0;kw<KW;kw++){
    for(int kh=0;kh<KH;kh++){
        for(int k=0;k<K;k++){
            for(int i=0;i<M;i++){
                for(int j=0;j<N;j++){
                    for(int c=0;c<C;c++){
                        result(i,j,k) += feature(i+kw,j+kh,c)*w(k,kw,kh,c);
                    }
                }
            }
        }
    }
}

    Wherein the first floor by the instruction control cycle, is calculated in the third layer 256 cycles degree of parallelism in the calculation array, instruction scheduling; 4-6 cycles layer design according to design ideas vector processor, an instruction is completed by three layer calculation cycle. In order to complete the calculation cycle, the need to set the three vector length register, Further, since vector addresses are not consecutive in the SRAM, further steps need to set three different registers. Reference base unit - matrix multiplication array code, specifically

    short ubuf_raddr_step1;
    short ubuf_raddr_step2;
    short ubuf_raddr_step3;
    short ubuf_raddr_end1;
    short ubuf_raddr_end2;
    short ubuf_raddr_end3

    With such a design, SimpleTPU instruction can be completed in a large amount of calculation data, improving data parallelism. These data are calculated in parallel into the array to complete the calculation (which may be considered multiple lanes). Since the data reading SimpleTPU delays are fixed (refer to the SRAM), and therefore generally better designed to quantify the processor is further easier.

    The paper described Google, TPU has repeat fileld, but limited MatrixMultiply / Convolve instruction length, and therefore may have only one or two sets of vector length register and the stride register, but should be similar to the design ideas.

4. Other

    Papers from Google's point of view parameters, TPU has a very high performance per unit power consumption. This part from its core design as the previous article described

  • Using the data type is calculated INT8
  • Optimization calculation using a systolic array
  • It does not use the cache, no branch jump prediction and conflict management data (complete compiler)

    From the teachings herein may be seen, TPU also uses a simple instruction set design + SIMD + vector + VLIW architecture to further optimize the performance of the power unit; In addition, the V2 / V3 in google Still further, also using multi-core and multi-processor design to further improve performance.

 

reference

Jouppi, Norman P. , et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." the 44th Annual International Symposium IEEE Computer Society, 2017.

. JohnL.Hennessy, and DavidA.Patterson Computer Architecture: A Quantitative Approach = Computer Architecture: Method quantization / 5th ED .

Guess you like

Origin www.cnblogs.com/sea-wind/p/11124037.html