pipeline instruction pipeline
Assume each instruction takes one cycle to get the result
Load is 3 cycles
mult is 2 cycles
div is 5 cycles
store is 2 cycles
add is 1 cycle
Note: The cycle here can be understood as the CPU computing resource time occupied
1 3 load a r1
2 4 load b r2
5 6 mult r1,r2 r3
6 8 load c r4
7 9 load d r5
10 14 div r4,r5 r6
15 15 add r3,r6 r7
16 17 store r7 x
It can be seen that the first load has experienced three cycles of 1, 2, and 3 from cycle 1 to cycle 3. The
second load does not depend on the first load, so you can directly start executing the
third load without waiting for the first load to complete. Mult needs to depend on the data of r1 and r2, so it must wait for the above to be executed before starting to execute from cycle 5.
By analogy, here it is mainly necessary to distinguish between dependent and non-dependent
According to this order, it will take a total of 17 cycles to execute. These 17 cycles can be understood as the computing resource time consumed by the CPU in the current order.
Next we change the order of the commands
1 3 load a r1
2 4 load b r2
3 5 load c r4
4 6 load d r55 6 mult r1,r2 r3
7 11 div r4,r5 r6
12 12 add r3,r6 r7
13 14 store r7 x
We can see that after changing the order, putting the load without dependencies at the beginning to execute 17 cycles has been reduced to 14 cycles. In English, this is called
instruction-level parallel, which roughly means to use as much as possible at the same time. Multiple registers load data at the same time to save time
Before sorting:
17 cycles 3 registers
//The live register (survival register) is displayed here, which can also be understood as the register that still needs to be occupied currently
load a r1 r1
load b r2 r1,r2mult r1,r2 r3 r3
load c r4 r3,r4
load d r5 r3,r4,r5 //It is not difficult to find that the peak register pressure of this schedule is 3
divs r4,r5 r6 r3,r6
add r3,r6 r7 r7
store r7 x none
After sorting:
14 cycles 4 registers
load a r1 r1
load b r2 r1,r2
load c r4 r1,r2,r4
load d r5 r1,r2,r4,r5 //You can see that the peak register pressure of this schedule is 4mult r1,r2 r3 r4,r5,r3
div r4,r5 r6 r3,r6
add r3,r6 r7 r7
store r7 x none
It can be seen that although the required operation cycle is shortened after the instruction is sorted, the number of required registers has increased.
This is an NP-complate problem: there is no polynomial time algorithm, that is, no polynomial time algorithm can always be regarded as the optimal solution.
In other words, this is an exponential or factorial index or factorial-level complex problem
So how does the compiler choose? Do you choose to give priority to the operation cycle or the number of registers?
This will depend on different processor configurations
Instruction selection uses as few registers as possible, and register allocation increases the use of registers in order to reduce cycles. They restrict each other.
Out-of-order execution (out-of-order)
Out-of-order execution (OoOE or OOE for short) is a scheduling method used in high-performance microprocessors to make full use of instruction cycles to improve CPU instruction throughput. In this scheduling algorithm, the CPU determines the execution order of instructions according to the availability of operands (operands) carried in the instructions, rather than being determined by the original code order of the program.
The figure above compares the instruction execution timing of two different processing methods, sequential execution and out-of-order execution, under the same pipeline design. In sequential execution, the processor executes the machine instructions in the order written in the program (In-Order). We can see that the load instruction that reads data from the memory when executed in the written order, as shown in the figure LOAD R2, 8 (R1), because accessing the memory requires a long instruction cycle, the other instructions that follow immediately follow It will fall into a long wait (waiting for the commit of the previous command). Although this situation seems reasonable, sometimes, the next instruction does not depend on the previous instruction with a long delay, as long as there are operands, it can be executed. Do we have a better way to deal with it? Yes, that is out-of-order execution. For CPUs that support out-of-order execution, the execution order of machine instructions can be disrupted, rather than executed in the order in which the program is written. Instruction throughput IPC (Instructions per cycle).
x86 is out-of-order cpu