Compiler overview - the concept of instruction scheduling

 pipeline instruction pipeline

Assume each instruction takes one cycle to get the result 
Load is 3 cycles
mult is 2 cycles
div is 5 cycles
store is 2 cycles
add is 1 cycle

Note: The cycle here can be understood as the CPU computing resource time occupied

1        3        load a \rightarrow r1
2        4        load b \rightarrow r2
5        6        mult r1,r2 \rightarrow r3
6        8        load c \rightarrow r4
7        9        load d \rightarrow r5
10      14      div r4,r5 \rightarrow r6
15      15      add r3,r6 \rightarrow r7
16      17      store r7 \rightarrow x

It can be seen that the first load has experienced three cycles of 1, 2, and 3 from cycle 1 to cycle 3. The
second load does not depend on the first load, so you can directly start executing the
third load without waiting for the first load to complete. Mult needs to depend on the data of r1 and r2, so it must wait for the above to be executed before starting to execute from cycle 5.
By analogy, here it is mainly necessary to distinguish between dependent and non-dependent

According to this order, it will take a total of 17 cycles to execute. These 17 cycles can be understood as the computing resource time consumed by the CPU in the current order.

Next we change the order of the commands

1        3        load a \rightarrow r1
2        4        load b \rightarrow r2
3        5        load c \rightarrow r4
4        6        load d \rightarrow r5

5        6        mult r1,r2 \rightarrow r3
7       11       div r4,r5 \rightarrow r6
12     12       add r3,r6 \rightarrow r7
13     14       store r7 \rightarrow x 

We can see that after changing the order, putting the load without dependencies at the beginning to execute 17 cycles has been reduced to 14 cycles. In English, this is called  
instruction-level parallel, which roughly means to use as much as possible at the same time. Multiple registers load data at the same time to save time

Before sorting:

17 cycles 3 registers

//The live register (survival register) is displayed here, which can also be understood as the register that still needs to be occupied currently

load a \rightarrow r1           r1                          
load b \rightarrow r2           r1,r2

mult r1,r2  \rightarrow r3 r3
load c  \rightarrow r4 r3,r4
load d  \rightarrow r5 r3,r4,r5 //It is not difficult to find that the peak register pressure of this schedule is 3         
divs r4,r5  \rightarrow r6 r3,r6
add r3,r6  \rightarrow r7 r7
store r7  \rightarrow x none

 After sorting:

14 cycles 4 registers

load a  \rightarrow r1 r1
load b  \rightarrow r2 r1,r2
load c  \rightarrow r4 r1,r2,r4
load d  \rightarrow r5 r1,r2,r4,r5 //You can see that the peak register pressure of this schedule is 4

mult r1,r2 \rightarrow r3      r4,r5,r3
div r4,r5 \rightarrow r6        r3,r6
add r3,r6 \rightarrow r7       r7
store r7 \rightarrow x           none

It can be seen that although the required operation cycle is shortened after the instruction is sorted, the number of required registers has increased. 
This is an NP-complate problem: there is no polynomial time algorithm, that is, no polynomial time algorithm can always be regarded as the optimal solution.
In other words, this is an exponential or factorial index or factorial-level complex problem

So how does the compiler choose? Do you choose to give priority to the operation cycle or the number of registers?
This will depend on different processor configurations

Instruction selection uses as few registers as possible, and register allocation increases the use of registers in order to reduce cycles. They restrict each other.

Out-of-order execution (out-of-order)

Out-of-order execution (OoOE or OOE for short) is a scheduling method used in high-performance microprocessors to make full use of instruction cycles to improve CPU instruction throughput. In this scheduling algorithm, the CPU determines the execution order of instructions according to the availability of operands (operands) carried in the instructions, rather than being determined by the original code order of the program.
 

The figure above compares the instruction execution timing of two different processing methods, sequential execution and out-of-order execution, under the same pipeline design. In sequential execution, the processor executes the machine instructions in the order written in the program (In-Order). We can see that the load instruction that reads data from the memory when executed in the written order, as shown in the figure LOAD R2, 8 (R1), because accessing the memory requires a long instruction cycle, the other instructions that follow immediately follow It will fall into a long wait (waiting for the commit of the previous command). Although this situation seems reasonable, sometimes, the next instruction does not depend on the previous instruction with a long delay, as long as there are operands, it can be executed. Do we have a better way to deal with it? Yes, that is out-of-order execution. For CPUs that support out-of-order execution, the execution order of machine instructions can be disrupted, rather than executed in the order in which the program is written. Instruction throughput IPC (Instructions per cycle).

x86 is out-of-order cpu

Guess you like

Origin blog.csdn.net/weixin_43754049/article/details/126181221