Pangu! We wrote a C language source file, then from what source file into the executable program that took place in the middle? Compile, link these concepts What does it mean? With curiosity on these issues, I checked some information. Among them, the main reference is the "Programmer's self-cultivation" This book and some online blog.
In windows
often only need to click the down Run
or Debug
you can run a C program, this convenience hides the complexity of the mechanisms behind, and I would like to know in the end what happened behind.
The system is used herein ubuntu
, these concepts also apply to the windows
next.
Four stages 1. Compile the source file
If we write a very simple helloworld.c
program:
#include <stdio.h>
int main(int argc, char *argv[])
{
printf("Hello,World!\n");
return 0;
}
We all know that running the command
gcc helloworld.c -o helloworld
You will be able to compile this file, and the executable file name helloworld
. Then run
./helloworld
Hello,World!
You will be able to execute the file, but this has gone behind it?
note:
This article is not a rigorous discussion compilation of articles, but I know a carding process on this issue.
1.1 Pretreatment (Preprocessing)
In the preprocessing stage, we can be simply understood is to deal with "#" those pre-start instruction, for example:
#define,#include,#if,#elif,#else,#endif
Preprocessor in accordance with the meaning of these instructions deal with the #define
definition of the macro replacement expansion, the #include
file containing the whole replacement came.
You can run the command
gcc -E helloworld.c -o helloworld.i
To get through the pre-processing files, checking can be found in pre-really helped us to #include
document include it, the other in the file also contains some line number information, so the program after an error where the error location.
1.2 compiler (compile)
This step is the last step to get the *.i
compile to get assembly code, you can run the command
gcc -S helloworld.i -o helloworld.s
Obtained after compilation files, wherein a portion of the document as follows:
main:
...
leaq .LC0(%rip), %rcx
call puts
...
We just call the corresponding function in the main program printf
, so we know at this stage is to generate a compilation of documents.
1.3 Assembler (assembly)
This step is the last step of the assembly code compilation for the specific machine code, you can run the command
gcc -c helloworld.s -o helloworld.o
Generated helloworld.o
can be called a target file, let's check the target file, to help understand the 链接
process.
1.3.1 The structure of the target file
The last step is to generate the target file, but the link has not been the target file, it is also one of the few symbols can not be determined, for example, in the above printf
we can not determine where to find a specific definition of the function, through the head file stdio.h
we just know its definition form, we know how to call it, but when the actual implementation is the need to code, where you go to find it? Looking for printf
action and writes it to the address of our program is linked.
We often deal with the file system has
- Executable files ( Executable File ), such as
Windows
under.exe
orlinux
next/bin/bash
file - Shared object files ( Shared Object File ), such as
Windows
under.dll
orlinux
next.so
file - Relocatable files ( Relocatable File ), the resulting file is above us this file , relocatable refers to the symbol of the program in some positions (function and variable names) address has not been determined, after the link stage requires repositioning
In Linux
you can use the command file
to view the specific file formats, let's run
$ file helloworld.o
helloworld.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
So specifically, the target file in the end what it contains? First will contain code, followed by data (defined variables), In addition, we are also concerned that the file contains a symbol table , it is the most important element of our follow-up the implementation of the link.
Run command
$ readelf -S helloworld.o
We can see the object file segment table, details about the segment table, please see the "Programmer's self-cultivation" this book.
There are 13 section headers, starting at offset 0x2d8:
节头:
[号] 名称 类型 地址 偏移量
大小 全体大小 旗标 链接 信息 对齐
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .text PROGBITS 0000000000000000 00000040
0000000000000022 0000000000000000 AX 0 0 1
[ 2] .rela.text RELA 0000000000000000 00000228
0000000000000030 0000000000000018 I 10 1 8
[ 3] .data PROGBITS 0000000000000000 00000062
0000000000000000 0000000000000000 WA 0 0 1
[ 4] .bss NOBITS 0000000000000000 00000062
0000000000000000 0000000000000000 WA 0 0 1
[ 5] .rodata PROGBITS 0000000000000000 00000062
000000000000000d 0000000000000000 A 0 0 1
[ 6] .comment PROGBITS 0000000000000000 0000006f
000000000000002c 0000000000000001 MS 0 0 1
[ 7] .note.GNU-stack PROGBITS 0000000000000000 0000009b
0000000000000000 0000000000000000 0 0 1
[ 8] .eh_frame PROGBITS 0000000000000000 000000a0
0000000000000038 0000000000000000 A 0 0 8
[ 9] .rela.eh_frame RELA 0000000000000000 00000258
0000000000000018 0000000000000018 I 10 8 8
[10] .symtab SYMTAB 0000000000000000 000000d8
0000000000000120 0000000000000018 11 9 8
[11] .strtab STRTAB 0000000000000000 000001f8
000000000000002e 0000000000000000 0 0 1
[12] .shstrtab STRTAB 0000000000000000 00000270
0000000000000061 0000000000000000 0 0 1
We are concerned that the above-mentioned segment table 2
number Segment Table: .rela.text
relocatable table. As we have said before, at the link stage To relocatable file relocation of some of the symbols, so we have to understand what needs to locate the symbol, and .rela.text
is used to record the appropriate symbol.
Wherein the symbol table contains several symbols:
- Symbols defined in the present document can be referenced by other object file
- Symbolic references in this document, but is not defined in this document
- ...
Let's run the command
$ nm helloworld.o
U _GLOBAL_OFFSET_TABLE_
0000000000000000 T main
U puts
To see our object file symbol table, we can see two symbols main
and puts
. The reason is not printf
likely to be the compilation were changed.
Let's run another command to view the detailed symbol table:
$ readelf -s helloworld.o
Symbol table '.symtab' contains 12 entries:
Num: Value Size Type Bind Vis Ndx Name
......
9: 0000000000000000 34 FUNC GLOBAL DEFAULT 1 main
10: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _GLOBAL_OFFSET_TABLE_
11: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND puts
I saw the familiar two symbols, as main
is defined in this document so it is of type FUNC
function, and Ndx=1
can be located in that section of the code, but puts
due to the undefined, so Ndx=UND(undefine)
, so we can get through what symbol the symbol table It is defined in this document, which symbols need to be relocated.
1.4 link (link)
Above we know the existence of the symbol table, the following procedure at the link we detail.
Suppose we have two files, a.c
and b.c
. Examples from the "Programmer's self-cultivation."
/* a.c */
extern int shared;
int main(){
int a=100;
swap(&a, &shared);
return 0;
}
/* b.c */
int shared = 1; // default is global variable, can be accessed by external program
void swap(int *a, int *b){
*a ^= *b ^= *a ^= *b; // swap value
}
The first to use gcc
compile these two files
$ gcc -c a.c b.c
Then we'll get two files a.o
, b.o
view separate symbol table two documents
$ readelf -s a.o
Symbol table '.symtab' contains 13 entries:
Num: Value Size Type Bind Vis Ndx Name
......
8: 0000000000000000 81 FUNC GLOBAL DEFAULT 1 main
9: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND shared
11: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND swap
$ readelf -s b.o
Symbol table '.symtab' contains 10 entries:
Num: Value Size Type Bind Vis Ndx Name
......
8: 0000000000000000 4 OBJECT GLOBAL DEFAULT 2 shared
9: 0000000000000000 75 FUNC GLOBAL DEFAULT 1 swap
Thus, we can see that in a.o
only defines a global symbol main
, while shared
and swap
are not defined, but in the b.o
middle, shared
and swap
it is the definition of.
We will link command is used
$ ld a.o b.o -e main -o ab
- -e indicates
main
a main function of the inlet - -o indicates the output file name
And then view the assigned address assigned before and after
$ objdump -h a.o
a.o: 文件格式 elf64-x86-64
节:
Idx Name Size VMA LMA File off Algn
0 .text 00000051 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 00000000 0000000000000000 0000000000000000 00000091 2**0
CONTENTS, ALLOC, LOAD, DATA
......
$ objdump -h b.o
b.o: 文件格式 elf64-x86-64
节:
Idx Name Size VMA LMA File off Algn
0 .text 0000004b 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .data 00000004 0000000000000000 0000000000000000 0000008c 2**2
CONTENTS, ALLOC, LOAD, DATA
......
I tried several times to run the command
$ ld a.o b.o -e main -o ab
But you are prompted an error
a.o:在函数‘main’中:
a.c:(.text+0x4b):对‘__stack_chk_fail’未定义的引用
I do not know why, so I had to use the command
$ gcc a.o b.o -o ab
But the authors of the document and will generate not the same, as follows
节:
Idx Name Size VMA LMA File off Algn
......
13 .text 00000222 0000000000000560 0000000000000560 00000560 2**4
......
22 .data 00000014 0000000000201000 0000000000201000 00001000 2**3
CONTENTS, ALLOC, LOAD, DATA
23 .bss 00000004 0000000000201014 0000000000201014 00001014 2**0
ALLOC
24 .comment 0000002b 0000000000000000 0000000000000000 00001014 2**0
CONTENTS, READONLY
But still it can be seen VMA (virtual memory address) has been assigned, while in the previous a.o
and b.o
in are not assigned.
This step is meant to go through the link, we will synthesize two object files into a single file, and each function has its own relative address, this time we can give each symbol a given address.
Run command
$ readelf -s ab
To see the symbol table only lists related content
Symbol table '.symtab' contains 66 entries:
Num: Value Size Type Bind Vis Ndx Name
59: 000000000000066a 81 FUNC GLOBAL DEFAULT 14 main
62: 00000000000006bb 75 FUNC GLOBAL DEFAULT 14 swap
65: 0000000000201010 4 OBJECT GLOBAL DEFAULT 23 shared
We can see that the relevant symbol has been given a specific address space, that is, we completed the linking process.
After the above process, we run the command to view the disassembly
$ objdump -d ab
000000000000066a <main>:
66a: 55 push %rbp
66b: 48 89 e5 mov %rsp,%rbp
66e: 48 83 ec 10 sub $0x10,%rsp
672: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
679: 00 00
67b: 48 89 45 f8 mov %rax,-0x8(%rbp)
67f: 31 c0 xor %eax,%eax
681: c7 45 f4 64 00 00 00 movl $0x64,-0xc(%rbp)
688: 48 8d 45 f4 lea -0xc(%rbp),%rax
68c: 48 8d 35 7d 09 20 00 lea 0x20097d(%rip),%rsi # 201010 <shared>
693: 48 89 c7 mov %rax,%rdi
696: b8 00 00 00 00 mov $0x0,%eax
69b: e8 1b 00 00 00 callq 6bb <swap> # <swap> 6bb
6a0: b8 00 00 00 00 mov $0x0,%eax
6a5: 48 8b 55 f8 mov -0x8(%rbp),%rdx
6a9: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx
6b0: 00 00
6b2: 74 05 je 6b9 <main+0x4f>
6b4: e8 87 fe ff ff callq 540 <__stack_chk_fail@plt>
6b9: c9 leaveq
6ba: c3 retq
Notice swap
and variable shared
address has been correctly assigned to the program, what we see as a comparison under the program before the link
$ objdump -d a.o
a.o: 文件格式 elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 10 sub $0x10,%rsp
8: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
f: 00 00
11: 48 89 45 f8 mov %rax,-0x8(%rbp)
15: 31 c0 xor %eax,%eax
17: c7 45 f4 64 00 00 00 movl $0x64,-0xc(%rbp)
1e: 48 8d 45 f4 lea -0xc(%rbp),%rax
22: 48 8d 35 00 00 00 00 lea 0x0(%rip),%rsi # 29 <main+0x29>
29: 48 89 c7 mov %rax,%rdi
2c: b8 00 00 00 00 mov $0x0,%eax
31: e8 00 00 00 00 callq 36 <main+0x36>
36: b8 00 00 00 00 mov $0x0,%eax
3b: 48 8b 55 f8 mov -0x8(%rbp),%rdx
3f: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx
46: 00 00
48: 74 05 je 4f <main+0x4f>
4a: e8 00 00 00 00 callq 4f <main+0x4f>
4f: c9 leaveq
50: c3 retq
We should note that the offset 22
and the offset 31
respectively correspond to shared
and swap
calls the second column hexadecimal represent this instruction, the instruction is four bytes per address, we can see these addresses are 0
this description file a.o
, the inability to determine the specific address, only this time the compiler to assign a special address 0x0
, the correct address before completing the final link phase assignment.
We can also run the command
$ objdump -r a.o
a.o: 文件格式 elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000025 R_X86_64_PC32 shared-0x0000000000000004
0000000000000032 R_X86_64_PLT32 swap-0x0000000000000004
000000000000004b R_X86_64_PLT32 __stack_chk_fail-0x0000000000000004
Which offset
is described to be relocated in position.
2. summary
In fact, in the "Programmer's self-cultivation" of this book is to explore in depth the details, in order to fully understand and grasp too hard.
I would like to summarize the main section on the link. Probably process is:
- Link receives the input file
- Collecting each input file segment table, a synthetic global symbol table, this table contains all the symbols defined
- If you are statically linked, merge multiple input files, address space allocation, after this has been done specifically address all the symbols on the set
- And then reposition each input symbol relocation required file to the correct address