Foreword
Procedure Call mechanism C language (ie calls between functions) is a key feature (starting in most programming languages, too) are using the memory management principles backward stack data structure provides the first out. Each stack space is called a stack frame function, it contains a saved registers on the stack frame, the space allocated to the local variables and parameters to be passed to the calling function or the like. The basic structure of a stack as shown below:
However, one thing that needs attention, the parameters of the procedure call stack to pass through, and also local variables allocated on the stack, then for different parameter or variable byte length, is the distribution of them in the stack space? What is involved here is that we want to explore the byte alignment.
The examples in this article use the environment as follows:
- Ubuntu x86_64 GNU/Linux
- gcc 7.4.0
Data Alignment
Many computer systems legitimate address basic data types do some limitations, requires some type of object address must be a multiple of the value of K, where K follows FIG. This simplifies the alignment restriction is formed an interface between the hardware design of the processor and memory system. For an actual example: as we read a variable length of 8 bytes in memory, then the address where the variable must be a multiple of 8. If the address where the variable is a multiple of 8, then it can be read by one of the variable memory operations to complete. If the address where the variable is not a multiple of 8, it may need to perform memory read twice, because the variable is placed between two 8-byte block of memory.
K | Types of |
---|---|
1 | char |
2 | short |
4 | int, float |
8 | long,double,char* |
Regardless of whether the data is aligned, x86_64 hardware can work, but it will reduce the performance of the system, so our compiler at compile time for us to implement data generally aligned.
Byte aligned stack
Byte aligned stack, the stack pointer must actually refers shall be an integer multiple of 16 bytes. We all know that stack alignment aid within as little as possible to read data memory access cycle, do not align the stack pointer can lead to serious performance degradation.
We said above, even if the data is not aligned, we can also execute the program, but only a bit low, but certain models of Intel and AMD processors for some SSE multimedia instruction operations, if the data is not aligned, then it not execute properly. These instructions operate on 16 bytes of memory, an instruction to transfer data between memory requirements and SSE unit memory address must be a multiple of 16.
Accordingly, any system for x86_64 processors and runtime compiler must ensure that allocated to hold SSE registers may be read or write memory data structure, must be 16-byte aligned, which form a standard:
- Any memory allocation functions (alloca, malloc, calloc or by realloc) generating a block start address must be a multiple of 16.
- Most of the border stack frame function must be a multiple of 16 straight.
As above, in the run-time stack, not only the local variables and parameters passed to meet byte alignment, we stack pointer (% rsp) must be a multiple of 16.
Three examples
We look at three practical examples of data alignment in order to achieve byte alignment and the stack, the stack space allocated what particular.
The following is a sample program on CSAPP.
void proc(long a1, long *a1p,
int a2, int *a2p,
short a3, short *a3p,
char a4, char *a4p) {
*a1p += a1;
*a2p += a2;
*a3p += a3;
*a4p += a4;
}
long call_proc()
{
long x1 = 1; int x2 = 2;
short x3 = 3; char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, x4);
return (x1+x2)*(x3+x4);
}
Use the following command to compile and decompile:
$ gcc -Og -fno-stack-protector -c call_proc.c
$ objdump -d call_proc.o
Where the -fno-stack-protector
parameter instructs the compiler does not add a stack protector of the mechanism
The generated assembly code, where we only see call_proc()
the stack space allocated
0000000000000015 <call_proc>:
15: 48 83 ec 10 sub $0x10,%rsp
19: 48 c7 44 24 08 01 00 movq $0x1,0x8(%rsp)
20: 00 00
22: c7 44 24 04 02 00 00 movl $0x2,0x4(%rsp)
29: 00
2a: 66 c7 44 24 02 03 00 movw $0x3,0x2(%rsp)
31: c6 44 24 01 04 movb $0x4,0x1(%rsp)
36: 48 8d 4c 24 04 lea 0x4(%rsp),%rcx
3b: 48 8d 74 24 08 lea 0x8(%rsp),%rsi
40: 48 8d 44 24 01 lea 0x1(%rsp),%rax
45: 50 push %rax
46: 6a 04 pushq $0x4
48: 4c 8d 4c 24 12 lea 0x12(%rsp),%r9
4d: 41 b8 03 00 00 00 mov $0x3,%r8d
53: ba 02 00 00 00 mov $0x2,%edx
58: bf 01 00 00 00 mov $0x1,%edi
5d: e8 00 00 00 00 callq 62 <call_proc+0x4d>
...
15 rows (row number to our specific code is given, in fact, these figures should be the starting position of the command, so tentatively called it) first subtracting 0x10% rsp, four local variables allocated a total of 16 bytes of space, and rows 45 and 46, the program stack% rax and $ 0x4, contact details, C and assembler language in the function, it is easy to know the specific space allocated on the stack as shown below It shows:
FIG, in order to align the stack byte, 4-byte single occupies a space 8, and each of the stack of variable types, the data are in line with the alignment requirements.
If the number of bytes occupied our parameters 8 reduction, it will take up less stack space it? We will be altered slightly above the C language program, as follows:
void proc(long a1, long *a1p,
int a2, int *a2p,
short a3, short *a3p,
char a4, char a5) { // char *a4p改为了char a5
*a1p += a1;
*a2p += a2;
*a3p += a3;
a5 += a4;
}
long call_proc()
{
long x1 = 1; int x2 = 2;
short x3 = 3; char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, x4); // 相应的改变了最后一个参数
return (x1+x2)*(x3+x4);
}
call_proc()
The compilation is as follows:
000000000000000a <call_proc>:
a: 48 83 ec 10 sub $0x10,%rsp
e: 48 c7 44 24 08 01 00 movq $0x1,0x8(%rsp)
15: 00 00
17: c7 44 24 04 02 00 00 movl $0x2,0x4(%rsp)
1e: 00
1f: 66 c7 44 24 02 03 00 movw $0x3,0x2(%rsp)
26: 48 8d 4c 24 04 lea 0x4(%rsp),%rcx
2b: 48 8d 74 24 08 lea 0x8(%rsp),%rsi
30: 6a 04 pushq $0x4
32: 6a 04 pushq $0x4
34: 4c 8d 4c 24 12 lea 0x12(%rsp),%r9
39: 41 b8 03 00 00 00 mov $0x3,%r8d
3f: ba 02 00 00 00 mov $0x2,%edx
44: bf 01 00 00 00 mov $0x1,%edi
49: e8 00 00 00 00 callq 4e <call_proc+0x44>
...
A control program, the spatial structure of the program stack as shown below:
我们发现,栈空间的占用并没有减少,为了能够达到栈字节对齐的目的,参数8和参数7各占一个8字节的空间,该过程调用浪费了1 + 7 + 7 = 15字节的空间。但为了兼容性和效率,这是值得的。
我们再看另一个程序,当我们在栈中分配字符串时又是怎样的呢?
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
strcpy(buffer2, buffer1);
}
void main() {
function(1,2,3);
使用gcc -fno-stack-protector -o foo foo.c
和objdump -d foo
进行编译和反编译后,function()
的汇编代码如下:
000000000000064a <function>:
64a: 55 push %rbp
64b: 48 89 e5 mov %rsp,%rbp
64e: 48 83 ec 20 sub $0x20,%rsp
652: 89 7d ec mov %edi,-0x14(%rbp)
655: 89 75 e8 mov %esi,-0x18(%rbp)
658: 89 55 e4 mov %edx,-0x1c(%rbp)
65b: 48 8d 55 fb lea -0x5(%rbp),%rdx
65f: 48 8d 45 f1 lea -0xf(%rbp),%rax
663: 48 89 d6 mov %rdx,%rsi
666: 48 89 c7 mov %rax,%rdi
669: e8 b2 fe ff ff callq 520 <strcpy@plt>
66e: 90 nop
66f: c9 leaveq
670: c3 retq
该过程共在栈上分配了32个字节的空间,其中包括两个字符串的空间和三个函数的参数的空间,这里需要提一下的是,尽管再x64下,函数的前6个参数直接用寄存器进行传递,但是有时候程序需要用到参数的地址,这个时候程序就不的不在栈上为参数分配内存并将参数拷贝到内存上,来满足程序对参数地址的操作。
联系程序,该过程的栈结构如下:
图中,因为char类型的地址可以从任意地址开始(地址为1的倍数),所以buffer1和buffer2是连续分配的,而三个int型变量则分配在了两个单独的8字节空间中。
小结
以上,我们看到,为了满足数据对齐和栈字节对齐的要求,或者说规范,编译器不惜牺牲了部分内存,这使得程序提高了兼容性,也提高了程序的性能。
完
参考:
- 《深入理解计算机系统》
- C函数调用过程解析(x86-64)