Architecture 13_Tomasulo algorithm

Background

1. IBM 360/91 was launched three years later than CDC6600

  -Before commercial computers use Cache technology

2. The entire 360 ​​series has only one command system and compiler (the company has spent a lot of money on it, so I don’t want to change the command system all the time)

  -Requires high floating-point performance, but it is not implemented by a dedicated compiler for high-end machines

  -There are only four double-precision floating-point registers, the effectiveness of compiler scheduling is greatly limited

  -Both memory access time and floating point calculation time are very long

  -Supports multiple iterations of loops and overlapped execution

Tomasulo algorithm and scoreboard

1. Adopted many ideas from the scoreboard

2. Two big differences

  -In the Tomasulo algorithm, conflict detection and execution control are distributed, and are realized by using reservation stations

  -Tomasulo algorithm does not check WAR and WAW correlation, they are eliminated by the algorithm itself

Tomasulo algorithm is implemented on MIPS (improved MIPS)

1.360/91 floating point function unit

  -3 addition units

  -2 multiplication units

  -6 reading units

  -3 write units

2. The difference between MIPS and 360/91 floating point unit

  360/91: Support register-memory instructions MIPS: Only support register-register instructions, use Load/Store to access memory

  360/91: Use streamlined functional units instead of multiple functional units. MIPS: In order to simplify processing, instead of streamlining all components, multiple functional units are used.

3. Each functional unit has a reserved station: buffer

For instructions, flow out from the instruction queue and enter each functional unit.

We say that he has 3 adding components, but in fact there is only 1 adder and 3 buffers. The same is true for the multiplication part on the right, which actually only has one multiplier and two buffers. The 3 buffers and 2 buffers on the left constitute reservation stations, which are called 3 addition reservation stations and 2 multiplication reservation stations respectively. For instructions, entering the addition reserve station after flowing out of the instruction queue is equivalent to an addition instruction having flowed out. From the perspective of the instruction queue, the outflow queue and the reservation station are equivalent to three branches. For the instruction queue, the three reservation stations are equivalent to three adders, and the two reservation stations are equivalent to 2 multipliers, 6 Load buffer is equivalent to 6 Load components, 3 Store buffers, equivalent to 3 Store components. These reserved stations (buffer) are equivalent to giving us more functional components. These functional components are not real, so we Technically, these buffers are called virtual functional components.

The execution process of the MIPS floating-point unit under the Tomasulo algorithm is like this: an instruction may be addition, multiplication, or Store or Load after it comes out. After these instructions come out, they are assigned to the corresponding functional components. This instruction is equivalent to entering execution Phase, whether the operation type of the reserved instruction in each reserved station is addition or subtraction,

His two source operands will be fetched from the floating-point register. If they are ready, they will be fetched. If they are not ready, it will record who generated the operand. Once fetched, the register will have no meaning for the addition component. The register is released. After the operation is completed, it will be sent to a bus. This bus is the hub of the Tomasulo algorithm and is called the public data bus. After putting the data on this bus, all the places that need this data will get this data at the same time. For each instruction, its related detection is actually carried out in the relevant reservation station. The adjustment of his data and pointer makes the register complete a rename operation. The original reading from the register is changed to the reading of the source component from which the data is generated. This completes a rename operation.

The transformation of MIPS five-stage assembly line

The ID and EX stages are replaced by the following three stages

1. Issue

2. Execute

3. Write result

Outflow

   a. Take an instruction from the floating-point instruction queue

   b. If there is an empty reservation station, send out this instruction

   c. If the operand is in the register, send it to the reserved station corresponding to the instruction

   d. Memory fetch/store instructions can flow out as long as there is free cache

   e. If there is no free reserved station or buffer, there is structural correlation, and the instruction is suspended until there is a free reserved station or buffer

Execution and result write back

    carried out

      1. If one or more operands are missing, listen to the CBD (Common Data Bus)

          This stage is actually related to detection and automatic maintenance of RAW (read after write)

      2. If both operands are ready, this instruction can be executed

   Write back results

       1. If the result has been produced, write it to the CBD

       2. Through CDB, write the result to the target register and the reservation station of all functional units waiting for the result

Six domains of the reserved station

Op: operations performed on source operands S1 and S2

Qj, Qk: The reserved station that generates the source operand required by this instruction

          If the value is 0, it means that the source operand is ready

Vj, Vk: the value of the source operand

           V domain and Q domain are not valid at the same time

Busy: This reserved station is occupied

Both register file and storage buffer have Qi domain

Qi: reserved station number

    The reserved station that produced the result corresponding to the number

    If Qi is empty, it means that the result of no instruction currently needs to be written to this register or buffer

Both Load buffer and Store buffer

    busy bit, address field, store field and V field

Instance

For the entire related search, the compiler must first complete a search process or be completed by hardware. In the Tomasulo algorithm, this process is automatically completed by hardware.

Out of order, out of order execution, out of order write back

Advantages of Tomasulo algorithm

1. Utilize distributed hardware conflict detection

2. Use register renaming to completely eliminate the correlation between WAW and WAR

3. If multiple reserved stations wait for the same operand, when the operand is broadcast on the CDB, they can obtain the required data at the same time

Dynamic scheduling method evaluation

1. Dynamic methods can achieve high performance

2. Main defects

    High complexity: requires a lot of hardware

    There is a bottleneck: a single common data bus (CDB) causes competition (ends out of order may end at the same time)

          Extra CDB: Need to set up duplicate hardware interface for each CDB on each reserved station

There are generally three CDBs on current machines, two local and one global. The so-called local data bus has one for floating point (as implemented by the Tomasulo method of MIPS just now), one for integer, and one between floating point and integer. There is one.

 

Guess you like

Origin blog.csdn.net/weixin_42596333/article/details/104226343