Detailed explanation of C language compilation and linking process
Source File
main.c
#include <stdio.h>
extern int data;
extern int add(int a,int b);
int a1;
int a2 = 0;
int a3 = 10;
static int b1;
static int b2 = 0;
static int b3 = 20;
int main()
{
int c1;
int c2 = 0;
int c3 = 30;
static int d1;
static int d2 = 0;
static int d3 = 40;
c1 = data;
c2 = add(a1,a2);
while(1);
return 0;
}
add.c
int data = 3;
int add(int a,int b)
{
return a+b;
}
Two major processes: compilation and linking
1. Compilation process:
-
Preprocessing(.i)
-
Process preprocessing instructions starting with #: #include #define #ifndef #if #else, etc.
-
Remove comments, add line numbers, generate file indexes, etc.
Command: gcc -E main.c -o main.i, generate .i file
-
-
compile (.s)
Compile the .i file to generate a .s assembly file
Command: gcc -S main.i generate .s file
-
Assembly(.o)
Translate assembly files into two-process relocatable files, i.e. .o files
Command: gcc -c main.s generate .o file
PS: The gcc command is just a wrapper for some background programs. It calls other programs according to different parameters:
-
Precompilation and compilation are combined into one step, using the program cc1 , or you can generate a .s file through the following command
cc1 hello.c
Equivalent to gcc -S hello.c -o hello.s
-
assembler as
-
linkerld
Analyze binary relocatable files
main.c file
#include <stdio.h>
int a1;
int a2 = 0;
int a3 = 10;
static int b1;
static int b2 = 0;
static int b3 = 20;
int main(void)
{
int c1;
int c2 = 0;
int c3 = 30;
static int d1;
static int d2 = 0;
static int d3 = 40;
return 0;
}
Compile command: Compile 32-bit .o files on a 64-bit machine
*gcc -m32 -fno-PIC -c .c
-m32 specifies compilation to generate 32-bit files; -fno-PIC removes segments that are independent of position (leaving only .text.data.bss.comment, etc.)
1. Read the elf file header
$ readelf -h main.o
ELF 头:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
类别: ELF32
数据: 2 补码,小端序 (little endian)
版本: 1 (current)
OS/ABI: UNIX - System V
ABI 版本: 0
类型: REL (可重定位文件)
系统架构: ARM
版本: 0x1
入口点地址: 0x0
程序头起点: 0 (bytes into file)
Start of section headers: 268 (bytes into file)
标志: 0x5000000, Version5 EABI
本头的大小: 52 (字节)
程序头大小: 0 (字节)
Number of program headers: 0
节头大小: 40 (字节)
节头数量: 10
字符串表索引节头: 7
(1) Magic number
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
(2) REL (relocatable file)
(3) Entry point address: 0x0
(4) Start of section headers: 268 (bytes into file)
(5) Header size: 52 (bytes)
2. Get the section headers information of the elf file (for link use)
$ readelf -S main.o
There are 12 section headers, starting at offset 0x2ec:
节头:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 000044 00 AX 0 0 1
[ 2] .rel.text REL 00000000 00026c 000020 08 I 9 1 4
[ 3] .data PROGBITS 00000000 000078 00000c 00 WA 0 0 4
[ 4] .bss NOBITS 00000000 000084 000014 00 WA 0 0 4
[ 5] .comment PROGBITS 00000000 000084 00002a 01 MS 0 0 1
[ 6] .note.GNU-stack PROGBITS 00000000 0000ae 000000 00 0 0 1
[ 7] .eh_frame PROGBITS 00000000 0000b0 00003c 00 A 0 0 4
[ 8] .rel.eh_frame REL 00000000 00028c 000008 08 I 9 7 4
[ 9] .symtab SYMTAB 00000000 0000ec 000140 10 10 14 4
[10] .strtab STRTAB 00000000 00022c 000040 00 0 0 1
[11] .shstrtab STRTAB 00000000 000294 000057 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
p (processor specific)
There are 12 segment headers, and the starting segment header offset is 0x2ec
You can see the offset and size of each segment
3. Print out the contents of the segment
~ $ objdump -s main.o
main.o: 文件格式 elf32-i386
Contents of section .text:
0000 8d4c2404 83e4f0ff 71fc5589 e55183ec .L$.....q.U..Q..
0010 14c745ec 00000000 c745f01e 000000a1 ..E......E......
0020 00000000 8945f48b 15000000 00a10000 .....E..........
0030 000083ec 085250e8 fcffffff 83c41089 .....RP.........
0040 45ecebfe E...
Contents of section .data:
0000 0a000000 14000000 28000000 ........(...
Contents of section .comment:
0000 00474343 3a202855 62756e74 7520372e .GCC: (Ubuntu 7.
0010 352e302d 33756275 6e747531 7e31382e 5.0-3ubuntu1~18.
0020 30342920 372e352e 3000 04) 7.5.0.
Contents of section .eh_frame:
0000 14000000 00000000 017a5200 017c0801 .........zR..|..
0010 1b0c0404 88010000 20000000 1c000000 ........ .......
0020 00000000 44000000 00440c01 00471005 ....D....D...G..
0030 02750043 0f03757c 06000000 .u.C..u|....
4. Read the .o file symbol table
~ $ objdump -t main.o
main.o: 文件格式 elf32-little
SYMBOL TABLE:
00000000 l df *ABS* 00000000 main.c
00000000 l d .text 00000000 .text
00000000 l d .data 00000000 .data
00000000 l d .bss 00000000 .bss
00000004 l O .bss 00000004 b1
00000008 l O .bss 00000004 b2
00000004 l O .data 00000004 b3
00000008 l O .data 00000004 d3.1881
0000000c l O .bss 00000004 d2.1880
00000010 l O .bss 00000004 d1.1879
00000000 l d .note.GNU-stack 00000000 .note.GNU-stack
00000000 l d .eh_frame 00000000 .eh_frame
00000000 l d .comment 00000000 .comment
00000004 O *COM* 00000004 a1
00000000 g O .bss 00000004 a2
00000000 g O .data 00000004 a3
00000000 g F .text 00000044 main
00000000 *UND* 00000000 data
00000000 *UND* 00000000 add
It marks which segment each symbol is in and how much memory it occupies. A1 is marked *COM* to indicate that it is a weak symbol (an uninitialized non-static global variable that may have the same name defined in other files)
The two symbols data and add are marked *UND*, indicating undefined symbols. The definition cannot be found in this file and will be found in other files when linking.
5. Based on the section headers information, draw the composition of the binary relocatable file (.o file)
It can be found that the starting satellite TV of the bss segment and the comment segment are the same, but actual calculation shows that the bss segment is not stored in the .o file, but the bss segment is recorded in the symbol table.
It is concluded that the bss section saves global variables that are not initialized/initialized to 0 , and static local variables that are not initialized/initialized to 0 , so their default values are all 0, so in order to save space in the .o file, No storage is required, but it needs to be recorded in the symbol table. After the executable file is finally executed, the symbols of the bss segment are stored in the virtual address space.
2. Link process:
Compiling on a 64-bit x86 machine - linking commands that produce 32-bit object files and executables
编译:
gcc -m32 -fno-PIC -c *.c
手动链接:
ld -e main -melf_i386 *.o -o run
生成如下文件:
$ ls
add.c add.o main.c main.o run
PS:
-m32 specifies compilation to generate 32-bit files;
-fno-PIC removes segments independent of position (leaving only .text.data.bss.comment, etc.)
-e specifies the program entry, just follow -e with a symbol, or you can use the add function as the program entry, i.e. -e add
-melf_i386 specifies the link to generate a 32-bit, x86 architecture executable file
The essence of the linking process is to "glue" multiple target files together. In essence, what is stitched together are the references to addresses between target files, that is, function names and global variables.
The symbol table is a section of the .o file, symtab , view the symbol table command
readelf -s main.o
objdump -t main.o
nm main.o
What is included in the symbol table, mainly focusing on 1 and 2 :
-
- Global symbols defined in this object file, such as variable names, function names, etc.
-
- Symbols referenced in other target files are not defined in this file and are generally called external symbols.
-
- Section name, such as ".text", ".data", etc.
-
- Local symbols are only visible inside the compilation unit. The debugger can use these symbols to analyze the program or the core dump file when it crashes. The linker often ignores them during the linking process.
$ objdump -t main.o
main.o: 文件格式 elf32-i386
SYMBOL TABLE:
00000000 l df *ABS* 00000000 main.c
00000000 l d .text 00000000 .text
00000000 l d .data 00000000 .data
00000000 l d .bss 00000000 .bss
00000004 l O .bss 00000004 b1
00000008 l O .bss 00000004 b2
00000004 l O .data 00000004 b3
00000008 l O .data 00000004 d3.1877
0000000c l O .bss 00000004 d2.1876
00000010 l O .bss 00000004 d1.1875
00000000 l d .note.GNU-stack 00000000 .note.GNU-stack
00000000 l d .eh_frame 00000000 .eh_frame
00000000 l d .comment 00000000 .comment
00000004 O *COM* 00000004 a1
00000000 g O .bss 00000004 a2
00000000 g O .data 00000004 a3
00000000 g F .text 00000016 main
1. Merge segments of all .o files
As shown in the figure above, when text segments are merged, data segments are merged, and bss segments are merged, weak symbols need to be converted into strong symbols (or weak symbols are replaced by strong symbols), and the size of the bss segment increases.
And after discovering the link, each segment of the generated executable file is assigned a memory address (virtual memory)
2. Merge symbol tables , symbol parsing, and relocation
- Merge symbol tables
It can be seen that the symbol table of an executable file is simply a combination of the symbol tables of multiple .o files.
- Symbol parsing
Convert weak symbols (*COM*) into strong symbols
Undefined symbols in this file (*UND*) were found in other files
- reset
Allocate a virtual memory address to the symbol. The symbol's address is calculated based on the segment address plus its own offset.
Executable file analysis
1. View the file header
$ readelf -h run
ELF 头:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
类别: ELF32
数据: 2 补码,小端序 (little endian)
版本: 1 (current)
OS/ABI: UNIX - System V
ABI 版本: 0
类型: EXEC (可执行文件)
系统架构: Intel 80386
版本: 0x1
入口点地址: 0x80480a1
程序头起点: 52 (bytes into file)
Start of section headers: 4676 (bytes into file)
标志: 0x0
本头的大小: 52 (字节)
程序头大小: 32 (字节)
Number of program headers: 3
节头大小: 40 (字节)
节头数量: 9
字符串表索引节头: 8
Entry point address: 0x80480a1.
2. View segment information
$ readelf -S run
There are 9 section headers, starting at offset 0x1244:
节头:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 08048094 000094 000051 00 AX 0 0 1
[ 2] .eh_frame PROGBITS 080480e8 0000e8 00005c 00 A 0 0 4
[ 3] .data PROGBITS 0804a000 001000 000010 00 WA 0 0 4
[ 4] .bss NOBITS 0804a010 001010 000018 00 WA 0 0 4
[ 5] .comment PROGBITS 00000000 001010 000029 01 MS 0 0 1
[ 6] .symtab SYMTAB 00000000 00103c 000170 10 7 14 4
[ 7] .strtab STRTAB 00000000 0011ac 000059 00 0 0 1
[ 8] .shstrtab STRTAB 00000000 001205 00003f 00 0 0 1
Each segment is assigned a virtual address.
3. View program headers
$ readelf -l run
Elf 文件类型为 EXEC (可执行文件)
Entry point 0x80480a1
There are 3 program headers, starting at offset 52
程序头:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x08048000 0x08048000 0x00144 0x00144 R E 0x1000
LOAD 0x001000 0x0804a000 0x0804a000 0x00010 0x00028 RW 0x1000
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10
Section to Segment mapping:
段节...
00 .text .eh_frame
01 .data .bss
02
Binary relocatable files only have "section headers" , and only executable files have "program headers" . "Program headers" show the virtual address and alignment bytes of each section (one page is 4K)
Merge according to segment attributes , read-only (text+rodata), readable and writable (data+bss), etc.
Use readelf -l main to view the "Segment" of ELF(for loading use)
PS: Because we linked it ourselves and did not link the C library, the content in the paragraph is relatively small.
* If you run gcc main.c -o main directly , the C library will be linked by default, and there will be a lot of content when viewing each section of the executable file.
* Executable files are loaded into the process by execve
* The reason why the executable file can be run is because it specifies the entry address (main) and program headers (specifies the virtual address to be loaded)
* The structure describing "Segment" is called "Program Header" , which describes how the ELF file should be mapped to the virtual space of the process by the operating system.