Arm assembly study notes (7)-ARM9 five-stage pipeline and pipeline interlock

This article mainly analyzes the principle of five-stage pipeline and pipeline interlock, so that more efficient assembly code can be written.


1. ARM9 five-stage pipeline

ARM7 uses a typical three-stage pipeline structure, including three parts of fetching, decoding and execution. Among them, the execution unit completes a lot of work, including read and write operations of registers and memories related to operands, ALU operations, and data transmission between related devices. Each of these three phases generally takes one clock cycle, but if three instructions perform three stages of three-stage pipeline at the same time, one instruction per cycle can still be reached. However, the execution unit often takes up multiple clock cycles, thus becoming the bottleneck of system performance.

ARM9 uses a more efficient five-stage pipeline design. After fetching, decoding, and executing, LS1 and LS2 stages are added. LS1 is responsible for loading and storing the data specified in the instruction, and LS2 is responsible for fetching, sign expansion through bytes or halfword The data loaded by the load command. But LS1 and LS2 are only valid for load and store commands, other instructions do not need to execute these two stages. The following is the definition of the ARM official document:

  • Fetch: Fetch from memory the instruction at addresspc. The instruction is loaded intothe core and then processes down the core pipeline.

  • Decode: Decode the instruction that was fetched in the previous cycle. The processoralso reads the input operands from the register bank if they are not available via one ofthe forwarding paths.

  • ALU: Executes the instruction that was decoded in the previous cycle. Note this instruc-tion was originally fetched from addresspc8 (ARM state) orpc4 (Thumb state).Normally this involves calculating the answer for a data processing operation, or theaddress for a load, store, or branch operation. Some instructions may spend severalcycles in this stage. For example, multiply and register-controlled shift operations takeseveral ALU cycles. 

  • LS1: Load or store the data specified by a load or store instruction. If the instruction isnot a load or store, then this stage has no effect.

  • LS2: Extract and zero- or sign-extend the data loaded by a byte or halfword loadinstruction. If the instruction is not a load of an 8-bit byte or 16-bit halfword item,then this stage has no effect. 

In the ARM9 five-stage pipeline, the operation of reading the register is transferred to the decoding stage, and the execution stage of the three-stage pipeline is further refined, reducing the amount of work that must be completed in each clock cycle, so that each stage of the pipeline can function in It is more balanced, avoiding bus conflicts between data access and fetching instructions, and the average number of cycles per instruction is significantly reduced.

2. The problem of pipeline interlock

Although it has been said that in the three-level and five-level pipelines, one instruction per cycle can generally be reached, but not all instructions can be completed in one cycle. Different instructions require different clock cycles. For details, please refer to Appendix D: Instruction Cycle Timings in ARM's official document Arm System Developer's Guide, which will not be described in detail here. Documentation can also be found in my resources.
In addition, different instruction sequences will also cause different clock cycles. For example, the execution of an instruction requires the result of the previous instruction. If the result has not yet come out, then you need to wait. This is pipeline interlock.
Take the simplest example:
LDR r1, [r2, #4] 
ADD r0, r0, r1
The above code requires three clock cycles, because the LDR instruction will calculate the value of r2 + 4 in the ALU stage, and the ADD instruction is still in the decoding stage, and this one clock cycle is not completed from [r2, # 4] Remove the data from the memory and write it back to the r1 register. The ALU of the ADD instruction will need to use r1 by the next clock cycle. The LS1 phase of the instruction is completed before moving on to the ALU phase of the ADD instruction. The following figure shows the pipeline interlock in the above example:



Look at the following example again:
LDRB r1, [r2, #1] 
ADD r0, r0, r2 
EOR r0, r0, r1
The above code takes four clock cycles, because the LDRB instruction needs to complete the write-back to r1 after the LS2 stage is completed (it is a byte byte load instruction), so the EOR instruction needs to wait for one clock cycle. The pipeline operation is as follows:

Look at the following example again:

        MOV r1, #1
        B case1
        AND r0, r0, r1 EOR r2, r2, r3 ...
case1:
        SUB r0, r0, r1
The above code needs to take five clock cycles, and a B instruction takes three clock cycles, because when a jump instruction is encountered, it will clear the instructions behind the pipeline and go to the new address to fetch instructions again. The pipeline operation is as follows:


3. Avoid pipeline interlock to improve operating efficiency

The Load instruction appears very frequently in the code, and the data given in the official document is about a third of the probability. Therefore, the optimization of the Load instruction and its nearby instructions can prevent the occurrence of pipeline interlock, thereby improving the operating efficiency.
Looking at the following example, the C code is to convert the uppercase letters in the input string to lowercase letters. The following experiments are based on ARM9TDMI.
void str_tolower(char *out, char *in)
{
  unsigned int c;
do {
    c = *(in++);
    if (c>=’A’ && c<=’Z’)
    {
      c = c + (’a’ -’A’);
    }
    *(out++) = (char)c;
  } while (c);
}
The compiler generates the following assembly code:
str_tolower
                LDRB r2,[r1],#1        ; c = *(in++)
                SUB r3,r2,#0x41       ; r3=c-‘A’
                CMP r3,#0x19           ; if (c <=‘Z’-‘A’)
                ADDLS r2,r2,#0x20    ; c +=‘a’-‘A’
                STRB r2,[r0],#1         ; *(out++) = (char)c
                CMP r2,#0                 ; if (c!=0)
                BNE str_tolower         ; goto str_tolower
                MOV pc,r14                ; return
Among them (c> = 'A' && c <= 'Z') conditional judgment after compilation into assembly, the variant becomes 0 <= c-'A' <= 'Z'-'A'.
It can be seen that when the above assembly code LDRB loads characters to c, the next SUB instruction needs to wait 2 more clock cycles. There are two ways to optimize: Preloading and Unrolling.

3.1 Load Scheduling by Preloading

The basic idea of ​​this method is to load data at the end of the previous loop, not at the beginning of the loop. The following is the optimized assembly code:
out RN 0 ; pointer to output string 
in RN 1 ; pointer to input string
c       RN 2    ; character loaded
t       RN 3    ; scratch register
        ; void str_tolower_preload(char *out, char *in)
        str_tolower_preload
      LDRB    c, [in], #1            ; c = *(in++)
loop
      SUB     t, c, #’A’              ; t = c-’A’
      CMP     t, #’Z’-’A’             ; if (t <= ’Z’-’A’)
      ADDLS   c, c, #’a’-’A’        ;   c += ’a’-’A’;
      STRB    c, [out], #1          ; *(out++) = (char)c;
      TEQ     c, #0                   ; test if c==0
      LDRNEB  c, [in], #1         ; if (c!=0) { c=*in++;
      BNE     loop             ;             goto loop; }
      MOV     pc, lr           ; return
This version of the assembly has one more instruction than the assembly compiled by the C compiler, but it saves 2 clock cycles, reducing the cycle clock cycle from 11 to 9 per character, and the efficiency is 1.22 times that of the C compiled version. .
In addition, the RN is a pseudo-instruction, used to give an alias to the register, such as c RN 2; is to use c to represent the r2 register.

3.2 Load Scheduling by Unrolling

The basic idea of ​​this method is to expand the loop and then interleave the code. For example, we can process the three data i, i + 1, i + 2 each cycle. When the processing instruction of i is not completed, we can start the processing of i + 1, so that we do not have to wait for the processing of i It turned out.
The optimized assembly code is as follows:
out     RN 0   ; pointer to output string
in      RN 1   ; pointer to input string
ca0     RN 2   ; character 0
t       RN 3   ; scratch register
ca1     RN 12   ; character 1
ca2     RN 14   ; character 2

	; void str_tolower_unrolled(char *out, char *in)
	str_tolower_unrolled
	STMFD   sp!, {lr}		; function entry
loop_next3
        LDRB    ca0, [in], #1		; ca0 = *in++;
	LDRB    ca1, [in], #1		; ca1 = *in++;
	LDRB    ca2, [in], #1		; ca2 = *in++;
	SUB     t, ca0, #’A’		; convert ca0 to lower case
	CMP     t, #’Z’-’A’
	ADDLS   ca0, ca0, #’a’-’A’
	SUB     t, ca1, #’A’      ; convert ca1 to lower case
	CMP     t, #’Z’-’A’
	ADDLS   ca1, ca1, #’a’-’A’
	SUB     t, ca2, #’A’      ; convert ca2 to lower case
	CMP     t, #’Z’-’A’
	ADDLS   ca2, ca2, #’a’-’A’
	STRB    ca0, [out], #1    ; *out++ = ca0;
	TEQ     ca0, #0           ; if (ca0!=0)
	STRNEB  ca1, [out], #1    ;   *out++ = ca1;
	TEQNE   ca1, #0           ; if (ca0!=0 && ca1!=0)
	STRNEB  ca2, [out], #1    ;   *out++ = ca2;
	TEQNE   ca2, #0		  ; if (ca0!=0 && ca1!=0 && ca2!=0)
	BNE     loop_next3	  ;   goto loop_next3;
	LDMFD   sp!, {pc}	  ; return;
The above code is the most efficient implementation we have experimented with so far. This method requires only 7 clock cycles for each character, which is 1.57 times more efficient than the C compiled version.
However, the total running time of this method is the same as the time of the C compiled version, because its code volume is more than twice that of the C compiled version. And the above code may be out of bounds when reading characters. Here is just to provide an optimization method and idea, you can use this method in the application where the time requirements are strict and the amount of data to be processed is relatively large.



Published 60 original articles · Like 44 · Visits 340,000+

Guess you like

Origin blog.csdn.net/beyond702/article/details/52232269