X86_64 Linux runtime stack byte alignment

Foreword

Procedure Call mechanism C language (ie calls between functions) is a key feature (starting in most programming languages, too) are using the memory management principles backward stack data structure provides the first out. Each stack space is called a stack frame function, it contains a saved registers on the stack frame, the space allocated to the local variables and parameters to be passed to the calling function or the like. The basic structure of a stack as shown below:

However, one thing that needs attention, the parameters of the procedure call stack to pass through, and also local variables allocated on the stack, then for different parameter or variable byte length, is the distribution of them in the stack space? What is involved here is that we want to explore the byte alignment.

The examples in this article use the environment as follows:

  • Ubuntu x86_64 GNU/Linux
  • gcc 7.4.0

Data Alignment

Many computer systems legitimate address basic data types do some limitations, requires some type of object address must be a multiple of the value of K, where K follows FIG. This simplifies the alignment restriction is formed an interface between the hardware design of the processor and memory system. For an actual example: as we read a variable length of 8 bytes in memory, then the address where the variable must be a multiple of 8. If the address where the variable is a multiple of 8, then it can be read by one of the variable memory operations to complete. If the address where the variable is not a multiple of 8, it may need to perform memory read twice, because the variable is placed between two 8-byte block of memory.

K Types of
1 char
2 short
4 int, float
8 long,double,char*

Regardless of whether the data is aligned, x86_64 hardware can work, but it will reduce the performance of the system, so our compiler at compile time for us to implement data generally aligned.

Byte aligned stack

Byte aligned stack, the stack pointer must actually refers shall be an integer multiple of 16 bytes. We all know that stack alignment aid within as little as possible to read data memory access cycle, do not align the stack pointer can lead to serious performance degradation.

We said above, even if the data is not aligned, we can also execute the program, but only a bit low, but certain models of Intel and AMD processors for some SSE multimedia instruction operations, if the data is not aligned, then it not execute properly. These instructions operate on 16 bytes of memory, an instruction to transfer data between memory requirements and SSE unit memory address must be a multiple of 16.

Accordingly, any system for x86_64 processors and runtime compiler must ensure that allocated to hold SSE registers may be read or write memory data structure, must be 16-byte aligned, which form a standard:

  • Any memory allocation functions (alloca, malloc, calloc or by realloc) generating a block start address must be a multiple of 16.
  • Most of the border stack frame function must be a multiple of 16 straight.

As above, in the run-time stack, not only the local variables and parameters passed to meet byte alignment, we stack pointer (% rsp) must be a multiple of 16.

Three examples

We look at three practical examples of data alignment in order to achieve byte alignment and the stack, the stack space allocated what particular.

The following is a sample program on CSAPP.

void proc(long  a1, long  *a1p,
          int   a2, int   *a2p,
          short a3, short *a3p,
          char  a4, char  *a4p) {
    *a1p += a1;
    *a2p += a2;
    *a3p += a3;
    *a4p += a4;
}

long call_proc()
{
    long  x1 = 1; int  x2 = 2;
    short x3 = 3; char x4 = 4;
    proc(x1, &x1, x2, &x2, x3, &x3, x4, x4);
    return (x1+x2)*(x3+x4);
}

Use the following command to compile and decompile:

$ gcc -Og -fno-stack-protector -c call_proc.c
$ objdump -d call_proc.o

Where the -fno-stack-protectorparameter instructs the compiler does not add a stack protector of the mechanism

The generated assembly code, where we only see call_proc()the stack space allocated

0000000000000015 <call_proc>:
  15:   48 83 ec 10             sub    $0x10,%rsp
  19:   48 c7 44 24 08 01 00    movq   $0x1,0x8(%rsp)
  20:   00 00 
  22:   c7 44 24 04 02 00 00    movl   $0x2,0x4(%rsp)
  29:   00 
  2a:   66 c7 44 24 02 03 00    movw   $0x3,0x2(%rsp)
  31:   c6 44 24 01 04          movb   $0x4,0x1(%rsp)
  36:   48 8d 4c 24 04          lea    0x4(%rsp),%rcx
  3b:   48 8d 74 24 08          lea    0x8(%rsp),%rsi
  40:   48 8d 44 24 01          lea    0x1(%rsp),%rax
  45:   50                      push   %rax
  46:   6a 04                   pushq  $0x4
  48:   4c 8d 4c 24 12          lea    0x12(%rsp),%r9
  4d:   41 b8 03 00 00 00       mov    $0x3,%r8d
  53:   ba 02 00 00 00          mov    $0x2,%edx
  58:   bf 01 00 00 00          mov    $0x1,%edi
  5d:   e8 00 00 00 00          callq  62 <call_proc+0x4d>
  ...

15 rows (row number to our specific code is given, in fact, these figures should be the starting position of the command, so tentatively called it) first subtracting 0x10% rsp, four local variables allocated a total of 16 bytes of space, and rows 45 and 46, the program stack% rax and $ 0x4, contact details, C and assembler language in the function, it is easy to know the specific space allocated on the stack as shown below It shows:

FIG, in order to align the stack byte, 4-byte single occupies a space 8, and each of the stack of variable types, the data are in line with the alignment requirements.

If the number of bytes occupied our parameters 8 reduction, it will take up less stack space it? We will be altered slightly above the C language program, as follows:

void proc(long  a1, long  *a1p,
          int   a2, int   *a2p,
          short a3, short *a3p,
          char  a4, char a5) {  // char *a4p改为了char a5
    *a1p += a1;
    *a2p += a2;
    *a3p += a3;
    a5 += a4;
}

long call_proc()
{
    long  x1 = 1; int  x2 = 2;
    short x3 = 3; char x4 = 4;
    proc(x1, &x1, x2, &x2, x3, &x3, x4, x4);  // 相应的改变了最后一个参数
    return (x1+x2)*(x3+x4);
}

call_proc()The compilation is as follows:

000000000000000a <call_proc>:
   a:   48 83 ec 10             sub    $0x10,%rsp
   e:   48 c7 44 24 08 01 00    movq   $0x1,0x8(%rsp)
  15:   00 00 
  17:   c7 44 24 04 02 00 00    movl   $0x2,0x4(%rsp)
  1e:   00 
  1f:   66 c7 44 24 02 03 00    movw   $0x3,0x2(%rsp)
  26:   48 8d 4c 24 04          lea    0x4(%rsp),%rcx
  2b:   48 8d 74 24 08          lea    0x8(%rsp),%rsi
  30:   6a 04                   pushq  $0x4
  32:   6a 04                   pushq  $0x4
  34:   4c 8d 4c 24 12          lea    0x12(%rsp),%r9
  39:   41 b8 03 00 00 00       mov    $0x3,%r8d
  3f:   ba 02 00 00 00          mov    $0x2,%edx
  44:   bf 01 00 00 00          mov    $0x1,%edi
  49:   e8 00 00 00 00          callq  4e <call_proc+0x44>
  ...

A control program, the spatial structure of the program stack as shown below:

我们发现,栈空间的占用并没有减少,为了能够达到栈字节对齐的目的,参数8和参数7各占一个8字节的空间,该过程调用浪费了1 + 7 + 7 = 15字节的空间。但为了兼容性和效率,这是值得的。

我们再看另一个程序,当我们在栈中分配字符串时又是怎样的呢?

void function(int a, int b, int c) {
       char buffer1[5];
       char buffer2[10];
       strcpy(buffer2, buffer1);
}

void main() {
        function(1,2,3);

使用gcc -fno-stack-protector -o foo foo.cobjdump -d foo进行编译和反编译后,function()的汇编代码如下:

000000000000064a <function>:
 64a:   55                      push   %rbp
 64b:   48 89 e5                mov    %rsp,%rbp
 64e:   48 83 ec 20             sub    $0x20,%rsp
 652:   89 7d ec                mov    %edi,-0x14(%rbp)
 655:   89 75 e8                mov    %esi,-0x18(%rbp)
 658:   89 55 e4                mov    %edx,-0x1c(%rbp)
 65b:   48 8d 55 fb             lea    -0x5(%rbp),%rdx
 65f:   48 8d 45 f1             lea    -0xf(%rbp),%rax
 663:   48 89 d6                mov    %rdx,%rsi
 666:   48 89 c7                mov    %rax,%rdi
 669:   e8 b2 fe ff ff          callq  520 <strcpy@plt>
 66e:   90                      nop
 66f:   c9                      leaveq 
 670:   c3                      retq

该过程共在栈上分配了32个字节的空间,其中包括两个字符串的空间和三个函数的参数的空间,这里需要提一下的是,尽管再x64下,函数的前6个参数直接用寄存器进行传递,但是有时候程序需要用到参数的地址,这个时候程序就不的不在栈上为参数分配内存并将参数拷贝到内存上,来满足程序对参数地址的操作。

联系程序,该过程的栈结构如下:

图中,因为char类型的地址可以从任意地址开始(地址为1的倍数),所以buffer1和buffer2是连续分配的,而三个int型变量则分配在了两个单独的8字节空间中。

小结

以上,我们看到,为了满足数据对齐和栈字节对齐的要求,或者说规范,编译器不惜牺牲了部分内存,这使得程序提高了兼容性,也提高了程序的性能。


参考:

Guess you like

Origin www.cnblogs.com/tcctw/p/11333743.html