Arm assembly study notes (9)-efficient branch code and access to unaligned data

The branch code switch (x) is very common in our ordinary code, and it is also a relatively time-consuming operation. If optimized, the efficiency of the code can be greatly improved.

1. For branch codes of type 0 <= x <N

In this case, N cannot be too large. For the following C code:
int ref_switch(int x)
{
     switch (x) {
                 case 0: return method_0();
                 case 1: return method_1();
                 case 2: return method_2();
                 case 3: return method_3();
                 case 4: return method_4();
                 case 5: return method_5();
                 case 6: return method_6();
                 case 7: return method_7();
                 default: return method_d();
     } 
}
We can use the value of the pc register as a reference and the value of x as an index to achieve. The optimized assembly code is as follows:
           ; int switch_relative(int x)
switch_relative
         MP     x, #8
         ADDLT   pc, pc, x, LSL#2
         B       method_d
         B       method_0
         B       method_1
         B       method_2
         B       method_3
         B       method_4
         B       method_5
         B       method_6
         B       method_7

2. x is an ordinary value

If you encounter x does not follow the form of 0 <= x <N, or N is very large, the above method is obviously not applicable. In this case, we can use the hashing function to map, that is, y = f (x), which can be converted into the form of 0 <= y <N, with y = f (x) instead of x as the condition for branch judgment , So that we can use the above method.
For example, suppose that when x = 2 ^ k, the method_k function is called, that is, the value of x is 1, 2, 4, 8, 16, 32, 64, 128, and other values ​​call the default function method_d. We need to find a hash function composed of several powers of 2 minus one multiply (this method is more efficient on ARM, and direct displacement can be achieved). Through experiments, it is found that the 9-11th digits of the numbers obtained by the above 8 values ​​x * 15 * 31 are different, we can use this feature to achieve branch jumps through bit operations.
The following is the optimized assembly code:
x RN0
hash RN 1
                     ; int switch_hash(int x)
switch_hash
                     RSB     hash, x, x, LSL#4             ; hash=x*15
                     RSB     hash, hash, hash, LSL#5   ; hash=x*15*31
                     AND hash, hash, #7 << 9           ; mask out the hash value
                     ADD pc, pc, hash, LSR#6 
                     NOP
                     TEQ x, #0x01
                     BEQ     method_0
                     TEQ     x, #0x02
                     BEQ     method_1
                     TEQ     x, #0x40
                     BEQ     method_6
                     TEQ     x, #0x04
                     BEQ     method_2
                     TEQ     x, #0x80
                     BEQ     method_7
                     TEQ     x, #0x20
                     BEQ     method_5
                     TEQ     x, #0x10
                     BEQ     method_4
                     TEQ     x, #0x08
                     BEQ     method_3
                     B       method_d

The above method is just a special case we cited. In the case where x is a power other than 2, we can still use a similar method to achieve. Only one idea is provided here.

3. Unaligned data access

Non-address aligned data access should be avoided as much as possible, otherwise it is detrimental to portability and efficiency.
  • The simplest access method is to read and write in units of one byte or halfword. This method is more recommended, but the efficiency is relatively low.
The following code reads a non-address aligned 32-bit data. We use t0, t1, t2 three registers to read to prevent pipeline interlock. Each non-address aligned data read on ARM9TDMI requires 7 clock cycles. The following example lists the versions corresponding to little_endian and big_endian.
p RN0 x RN1
t0 RN 2
t1 RN 3
t2 RN 12
                     ; int load_32_little(char *p)
load_32_little
                     LDRB    x,  [p]
                     LDRB    t0, [p, #1]
                     LDRB    t1, [p, #2]
                     LDRB    t2, [p, #3]
                     ORR     x, x, t0, LSL#8
                     ORR     x, x, t1, LSL#16
                     ORR     r0, x, t2, LSL#24
                     MOV     pc, lr
                     ; int load_32_big(char *p)
load_32_big
                     LDRB    x,  [p]
                     LDRB    t0, [p, #1]
                     LDRB    t1, [p, #2]
                     LDRB    t2, [p, #3]
                     ORR     x, t0, x, LSL#8
                     ORR     x, t1, x, LSL#8
                     ORR     r0, t2, x, LSL#8
                     MOV     pc, lr
                   ; void store_32_little(char *p, int x)
store_32_little
                   STRB    x,  [p]
                   MOV     t0, x, LSR#8
                   STRB    t0, [p, #1]
                   MOV     t0, x, LSR#16
                   STRB    t0, [p, #2]
                   MOV     t0, x, LSR#24
                   STRB    t0, [p, #3]
                   MOV     pc, lr
                   ; void store_32_big(char *p, int x)
store_32_big
                   MOV t0, x, LSR#24
                   STRB t0, [p]
                   MOV t0, x, LSR#16
                   STRB t0, [p, #1]
                   MOV t0, x, LSR#8
                   STRB t0, [p, #2]
                   STRB x, [p, #3]
                   MOV pc,lr

Published 60 original articles · Like 44 · Visits 340,000+

Guess you like

Origin blog.csdn.net/beyond702/article/details/52251084