Article directory
Program Compilation Process
Computer programming languages are usually divided into three categories: machine language, assembly language and high-level language. High-level languages need to be translated into machine language to be executed, and there are two translation methods, one is compiled and the other is interpreted, so we basically divide high-level languages into two categories, one is compiled Type languages, such as C, C++, Java, and the other is interpreted languages, such as Python, Ruby, MATLAB, JavaScript.
Here we will introduce how to convert a program written in high-level C/C++ language into binary code that the processor can execute, including four steps:
- Preprocessing (Preprocessing, preprocessor cpp, role: .c/s -> .i
- Compilation, assembler ccl, function: .i -> .s
- Compile (Assembly), compiler as, function: .s -> .o
- Linking, linker ld, role: .o -> elf executable file
statement
Part of the content comes from the Internet, intruded and deleted.
GCC Common Toolchain
The commonly referred to as GCC is the abbreviation of GUN Compiler Collection, which is a commonly used compilation tool on Linux systems. GCC tool chain software includes GCC, Binutils, C runtime library, etc.
GCC
GCC (GNU C Compiler) is a compilation tool. The process of converting the program written in C/C++ language into the binary code that the processor can execute is completed by the compiler.
Binutils
A set of binary program processing tools, including: addr2line, ar, objcopy, objdump, as, ld, ldd, readelf, size, etc. This set of tools is an indispensable tool for development and debugging, and their brief introductions are as follows:
- addr2line: It is used to convert the program address into its corresponding program source file and corresponding code line, and also get the corresponding function. This tool will help the debugger locate the corresponding source code location during debugging.
- as: Mainly used for assembly, please refer to the following for the detailed introduction of assembly.
- ld: It is mainly used for linking. For details about linking, please refer to the following text.
- ar: Mainly used to create static libraries. In order to facilitate the understanding of beginners, the concept of dynamic library and static library is introduced here:
- If multiple .o object files are to be generated into a library file, there are two types of libraries, one is a static library and the other is a dynamic library.
In Windows, a static library is a file with a suffix of .lib, and a shared library is a file with a suffix of .dll. In Linux, the static library is a file with the suffix .a, and the shared library is a file with the suffix .so. - The difference between a static library and a dynamic library is that the moment when the code is loaded is different. The code of the static library has been loaded into the executable program during the compilation process, so the size is relatively large. The code of the shared library is loaded into the memory when the executable program is running, and is simply referenced during the compilation process, so the code size is small. In the Linux system, you can use the ldd command to view the shared libraries that an executable program depends on.
If there are multiple programs that need to run at the same time in a system and there are shared libraries among these programs, then using a dynamic library will save memory more.
- If multiple .o object files are to be generated into a library file, there are two types of libraries, one is a static library and the other is a dynamic library.
- ldd: can be used to view the shared libraries that an executable program depends on.
- objcopy: Translate one object file into another format, such as converting . bin to . elf, or converting . elf to . bin, etc.
- objdump: The main function is to disassemble. For a detailed introduction to disassembly, see the following text.
- readelf: Display information about ELF files, see below for more information.
- size: List the size and total size of each part of the executable file, code segment, data segment, total size, etc. Please refer to the following for specific usage examples of using size.
C runtime library
The C language standard is mainly composed of two parts: one part describes the syntax of C, and the other part describes the C standard library. The C standard library defines a set of standard header files, and each header file contains some related functions, variables, type declarations and macro definitions. For example, the common printf function is a C standard library function, and its prototype is defined in the stdio header file.
The C language standard only defines the prototypes of the C standard library functions, and does not provide implementations. Therefore, C language compiler usually needs a C runtime library (C Run Time Library, CRT) support. The C runtime library is often referred to simply as the C runtime library. Similar to the C language, C++ also defines its own standard and provides related supporting libraries, called the C++ runtime library.
ENV
Since the GCC tool chain is mainly used in the Linux environment, this article will also use the Linux system as the working environment. In order to demonstrate the whole process of compilation, this section first prepares a simple Hello program written in C language as an example, and its source code is as follows:
#include<stdio.h>
int main(int argc, char *argv[])
{
printf("hello world\r\n");
return 0;
}
compilation process
preprocessing
The preprocessing process mainly includes the following processes:
- Delete all #defines, expand all macro definitions, and process all conditional precompiled directives, such as #if #ifdef #elif #else #endif, etc.
- Processes #include precompiled directives, inserting included files at the location of the precompiled directives.
- Remove all comments "//" and "/* */".
- Add line numbers and file identifiers to generate debugging line numbers and compilation error warning line numbers when compiling.
- Preserves all #pragma compiler directives that are required for subsequent compilation passes.
The command for preprocessing with gcc is as follows:
gcc -E test.c -o test.i
GCC option -E causes GCC to stop after preprocessing
The above command is to preprocess the source file test.c to generate test.i. The content of the test.i file is as follows
extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
# 885 "/usr/include/stdio.h" 3 4
extern int __uflow (FILE *);
extern int __overflow (FILE *, int);
# 902 "/usr/include/stdio.h" 3 4
# 2 "test.c" 2
# 3 "test.c"
int main(int argc, char *argv[])
{
printf("hello world\r\n");
return 0;
}
The test.i file can be opened and viewed as a normal text file
compile
The compilation process is to perform a series of lexical analysis, syntax analysis, semantic analysis and optimization on the preprocessed files to generate corresponding assembly codes.
The command to compile with gcc is as follows
gcc -S test.i -o test.s
GCC's option -S causes GCC to stop after compiling and generate assembler
The above command will preprocess the generated test.i file to compile and generate the assembler test.s
compilation
The assembly procedure call processes the assembly code, generates instructions that the processor can recognize, and saves them in the object file with the suffix .o. Since each assembly statement almost corresponds to a processor instruction, the assembly process is simpler than the compilation process, which can be translated one by one by calling the assembler as in Binutils according to the comparison table of assembly instructions and processor instructions.
When the program is composed of multiple source code files, each file must first complete the assembly work, and the .o object file can be generated before entering the next link work. Note: Object files are already part of the final program, but cannot be executed until linked.
The command to assemble using gcc is as follows
gcc -c test.s -o test.o
The GCC option -c causes GCC to stop after executing the assembly and generate the object file
Or directly call as for assembly
as -c hello.s -o hello.o
Use as in Binutils to assemble the hello.s file to generate an object file
Note: The test.o object file is a redirectable file in ELF (Executable and Linkable Format) format.
Link
Links are also divided into static links and dynamic links, the main points are as follows:
- Static linking refers to directly adding the static library to the executable file during the compilation phase, so that the executable file will be relatively large. The linker copies the function's code from its location (either in a different object file or in a statically linked library) into the final executable program. In order to create an executable file, the main tasks that the linker must complete are: symbol resolution (associating the definition and reference of the symbol in the object file) and relocation (corresponding the symbol definition to the memory address and then modifying all references to the symbol ).
- Dynamic linking means that only some description information is added in the linking stage, and the corresponding dynamic library is loaded from the system into the memory when the program is executed.
- In the Linux system, the order of the dynamic library search path when gcc compiles and links is usually: first search from the path specified by the parameter - L of the gcc command; then address from the path specified by the environment variable LIBRARY_PATH; then search from the default path / lib, /usr/lib, /usr/local/lib Look for.
- In the Linux system, the order of the dynamic library search path when executing binary files is usually: first search the dynamic library search path specified when compiling the object code; then address from the path specified by the environment variable LD_LIBRARY_PATH; then from the configuration file /etc/ The dynamic library search path specified in ld.so.conf; then search from the default path /lib, /usr/lib.
- In the Linux system, you can use the ldd command to view the shared libraries that an executable program depends on.
- Since the paths for linking dynamic libraries and static libraries may overlap, if there are static library files and dynamic library files with the same name in the path, such as libtest.a and libtest.so, gcc will give priority to dynamic libraries by default when linking, and will link libtest .so, if you want gcc to choose to link libtest.a, you can specify the gcc option - static, which will force the use of static libraries for linking. Take Hello World as an example:
- If you use the command "gcc hello.c -o hello", the dynamic library will be used for linking. The size of the generated ELF executable file (use the size command of Binutils to view) and the linked dynamic library (use the ldd command of Binutils to view) are as follows Shown:
gcc test.c -o test size test text data bss dec hex filename 1386 600 8 1994 7ca test ldd test linux-vdso.so.1 (0x00007fffb99f4000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc2a3600000) /lib64/ld-linux-x86-64.so.2 (0x00007fc2a39a9000)
- If you use the command "gcc -static hello.c -o hello", the static library will be used for linking, the size of the generated ELF executable file (use the size command of Binutils to view) and the linked dynamic library (use the ldd command of Binutils to view )As follows:
gcc -static test.c -o test size test text data bss dec hex filename 781877 23240 23016 828133 ca2e5 test ldd test not a dynamic executable
- From the above results, it can be seen that the dynamically linked files are smaller and the statically linked files are larger.
- If you use the command "gcc hello.c -o hello", the dynamic library will be used for linking. The size of the generated ELF executable file (use the size command of Binutils to view) and the linked dynamic library (use the ldd command of Binutils to view) are as follows Shown:
The final file generated by the linker is an executable file in ELF format. An ELF executable file is usually linked into different segments, such as .text, .data, .rodata, and .bss.
Analyze ELF files
Sections of ELF files
The format of the ELF file is as follows, the sections between the ELF Header and the Section Header Table are sections.
ELF header
Program header table
.text
.rodata
...
.data
Section header table
A typical ELF file contains the following sections
- .text: The instruction code segment of the compiled program.
- .rodata: ro stands for read only, that is, read-only data (such as constant const).
- .data: Initialized C program global variables and static local variables.
- .bss: Uninitialized C program global variables and static local variables.
- .debug: debug symbol table, the debugger uses the information in this section to help debug (the -g option needs to be added when compiling).
You can use readelf -S to view the information of each section
readelf -S test
The detailed information displayed by the above command is as follows:
There are 31 section headers, starting at offset 0x3698:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[1] .interp PROGBITS 0000000000000318 00000318
000000000000001c 0000000000000000 A 0 0 1
[2] .note.gnu.pr[...] NOTE 0000000000000338 00000338
0000000000000030 0000000000000000 A 0 0 8
[3] .note.gnu.bu[...] NOTE 0000000000000368 00000368
0000000000000024 0000000000000000 A 0 0 4
[4] .note.ABI-tag NOTE 000000000000038c 0000038c
0000000000000020 0000000000000000 A 0 0 4
[5] .gnu.hash GNU_HASH 00000000000003b0 000003b0
0000000000000024 0000000000000000 A 6 0 8
[6] .dynsym DYNSYM 00000000000003d8 000003d8
00000000000000a8 0000000000000018 A 7 1 8
[7] .dynstr STRTAB 0000000000000480 00000480
000000000000008d 0000000000000000 A 0 0 1
[8] .gnu.version VERSYM 000000000000050e 0000050e
000000000000000e 0000000000000002 A 6 0 2
[9] .gnu.version_r VERNEED 0000000000000520 00000520
0000000000000030 0000000000000000 A 7 1 8
[10] .rela.dyn RELA 0000000000000550 00000550
00000000000000c0 0000000000000018 A 6 0 8
[11] .rela.plt RELA 0000000000000610 00000610
0000000000000018 0000000000000018 AI 6 24 8
[12] .init PROGBITS 0000000000001000 00001000
000000000000001b 0000000000000000 AX 0 0 4
[13] .plt PROGBITS 0000000000001020 00001020
0000000000000020 0000000000000010 AX 0 0 16
[14] .plt.got PROGBITS 0000000000001040 00001040
0000000000000010 0000000000000010 AX 0 0 16
[15] .plt.sec PROGBITS 0000000000001050 00001050
0000000000000010 0000000000000010 AX 0 0 16
[16] .text PROGBITS 0000000000001060 00001060
0000000000000112 0000000000000000 AX 0 0 16
[17] .fini PROGBITS 0000000000001174 00001174
000000000000000d 0000000000000000 AX 0 0 4
[18] .rodata PROGBITS 0000000000002000 00002000
0000000000000011 0000000000000000 A 0 0 4
[19] .eh_frame_hdr PROGBITS 0000000000002014 00002014
0000000000000034 0000000000000000 A 0 0 4
[20] .eh_frame PROGBITS 0000000000002048 00002048
00000000000000ac 0000000000000000 A 0 0 8
[21] .init_array INIT_ARRAY 0000000000003db8 00002db8
0000000000000008 0000000000000008 WA 0 0 8
[22] .fini_array FINI_ARRAY 0000000000003dc0 00002dc0
0000000000000008 0000000000000008 WA 0 0 8
[23] .dynamic DYNAMIC 0000000000003dc8 00002dc8
00000000000001f0 0000000000000010 WA 7 0 8
[24] .got PROGBITS 0000000000003fb8 00002fb8
0000000000000048 0000000000000008 WA 0 0 8
[25] .data PROGBITS 0000000000004000 00003000
0000000000000010 0000000000000000 WA 0 0 8
[26] .bss NOBITS 0000000000004010 00003010
0000000000000008 0000000000000000 WA 0 0 1
[27] .comment PROGBITS 0000000000000000 00003010
000000000000002b 0000000000000001 MS 0 0 1
[28] .symtab SYMTAB 0000000000000000 00003040
0000000000000360 0000000000000018 29 18 8
[29] .strtab STRTAB 0000000000000000 000033a0
00000000000001da 0000000000000000 0 0 1
[30] .shstrtab STRTAB 0000000000000000 0000357a
000000000000011a 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), l (large), p (processor specific)
Disassemble ELF
Since ELF files cannot be opened as ordinary text files, if you want to directly view the instructions and data contained in an ELF file, you need to use the method of disassembly.
Disassemble it with objdump -D
objdump -D test
Some of the results returned by the above command are as follows
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push %rbp
114e: 48 89 e5 mov %rsp,%rbp
1151: 48 83 ec 10 sub $0x10,%rsp
1155: 89 7d fc mov %edi,-0x4(%rbp)
1158: 48 89 75 f0 mov %rsi,-0x10(%rbp)
115c: 48 8d 05 a1 0e 00 00 lea 0xea1(%rip),%rax # 2004 <_IO_stdin_used+0x4>
1163: 48 89 c7 mov %rax,%rdi
1166: e8 e5 fe ff ff call 1050 <puts@plt>
116b: b8 00 00 00 00 mov $0x0,%eax
1170: c9 leave
1171: c3 ret
Use objdump -S to disassemble it and display its C language source code: (you need to add the -g option when compiling)
objdump -S test
The full result of the above command is as follows
test: file format elf64-x86-64
Disassembly of section .init:
0000000000001000 <_init>:
1000: f3 0f 1e fa endbr64
1004: 48 83 ec 08 sub $0x8,%rsp
1008: 48 8b 05 d9 2f 00 00 mov 0x2fd9(%rip),%rax # 3fe8 <__gmon_start__@Base>
100f: 48 85 c0 test %rax,%rax
1012: 74 02 je 1016 <_init+0x16>
1014: ff d0 call *%rax
1016: 48 83 c4 08 add $0x8,%rsp
101a: c3 ret
Disassembly of section .plt:
0000000000001020 <.plt>:
1020: ff 35 9a 2f 00 00 push 0x2f9a(%rip) # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x8>
1026: f2 ff 25 9b 2f 00 00 bnd jmp *0x2f9b(%rip) # 3fc8 <_GLOBAL_OFFSET_TABLE_+0x10>
102d: 0f 1f 00 nopl (%rax)
1030: f3 0f 1e fa endbr64
1034: 68 00 00 00 00 push $0x0
1039: f2 e9 e1 ff ff ff bnd jmp 1020 <_init+0x20>
103f: 90 nop
Disassembly of section .plt.got:
0000000000001040 <__cxa_finalize@plt>:
1040: f3 0f 1e fa endbr64
1044: f2 ff 25 ad 2f 00 00 bnd jmp *0x2fad(%rip) # 3ff8 <__cxa_finalize@GLIBC_2.2.5>
104b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Disassembly of section .plt.sec:
0000000000001050 <puts@plt>:
1050: f3 0f 1e fa endbr64
1054: f2 ff 25 75 2f 00 00 bnd jmp *0x2f75(%rip) # 3fd0 <puts@GLIBC_2.2.5>
105b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Disassembly of section .text:
0000000000001060 <_start>:
1060: f3 0f 1e fa endbr64
1064: 31 ed xor %ebp,%ebp
1066: 49 89 d1 mov %rdx,%r9
1069: 5e pop %rsi
106a: 48 89 e2 mov %rsp,%rdx
106d: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
1071: 50 push %rax
1072: 54 push %rsp
1073: 45 31 c0 xor %r8d,%r8d
1076: 31 c9 xor %ecx,%ecx
1078: 48 8d 3d ca 00 00 00 lea 0xca(%rip),%rdi # 1149 <main>
107f: ff 15 53 2f 00 00 call *0x2f53(%rip) # 3fd8 <__libc_start_main@GLIBC_2.34>
1085: f4 hlt
1086: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
108d: 00 00 00
0000000000001090 <deregister_tm_clones>:
1090: 48 8d 3d 79 2f 00 00 lea 0x2f79(%rip),%rdi # 4010 <__TMC_END__>
1097: 48 8d 05 72 2f 00 00 lea 0x2f72(%rip),%rax # 4010 <__TMC_END__>
109e: 48 39 f8 cmp %rdi,%rax
10a1: 74 15 je 10b8 <deregister_tm_clones+0x28>
10a3: 48 8b 05 36 2f 00 00 mov 0x2f36(%rip),%rax # 3fe0 <_ITM_deregisterTMCloneTable@Base>
10aa: 48 85 c0 test %rax,%rax
10ad: 74 09 je 10b8 <deregister_tm_clones+0x28>
10af: ff e0 jmp *%rax
10b1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
10b8: c3 ret
10b9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
00000000000010c0 <register_tm_clones>:
10c0: 48 8d 3d 49 2f 00 00 lea 0x2f49(%rip),%rdi # 4010 <__TMC_END__>
10c7: 48 8d 35 42 2f 00 00 lea 0x2f42(%rip),%rsi # 4010 <__TMC_END__>
10ce: 48 29 fe sub %rdi,%rsi
10d1: 48 89 f0 mov %rsi,%rax
10d4: 48 c1 ee 3f shr $0x3f,%rsi
10d8: 48 c1 f8 03 sar $0x3,%rax
10dc: 48 01 c6 add %rax,%rsi
10df: 48 d1 fe sar %rsi
10e2: 74 14 je 10f8 <register_tm_clones+0x38>
10e4: 48 8b 05 05 2f 00 00 mov 0x2f05(%rip),%rax # 3ff0 <_ITM_registerTMCloneTable@Base>
10eb: 48 85 c0 test %rax,%rax
10ee: 74 08 je 10f8 <register_tm_clones+0x38>
10f0: ff e0 jmp *%rax
10f2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
10f8: c3 ret
10f9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000001100 <__do_global_dtors_aux>:
1100: f3 0f 1e fa endbr64
1104: 80 3d 05 2f 00 00 00 cmpb $0x0,0x2f05(%rip) # 4010 <__TMC_END__>
110b: 75 2b jne 1138 <__do_global_dtors_aux+0x38>
110d: 55 push %rbp
110e: 48 83 3d e2 2e 00 00 cmpq $0x0,0x2ee2(%rip) # 3ff8 <__cxa_finalize@GLIBC_2.2.5>
1115: 00
1116: 48 89 e5 mov %rsp,%rbp
1119: 74 0c je 1127 <__do_global_dtors_aux+0x27>
111b: 48 8b 3d e6 2e 00 00 mov 0x2ee6(%rip),%rdi # 4008 <__dso_handle>
1122: e8 19 ff ff ff call 1040 <__cxa_finalize@plt>
1127: e8 64 ff ff ff call 1090 <deregister_tm_clones>
112c: c6 05 dd 2e 00 00 01 movb $0x1,0x2edd(%rip) # 4010 <__TMC_END__>
1133: 5d pop %rbp
1134: c3 ret
1135: 0f 1f 00 nopl (%rax)
1138: c3 ret
1139: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000001140 <frame_dummy>:
1140: f3 0f 1e fa endbr64
1144: e9 77 ff ff ff jmp 10c0 <register_tm_clones>
0000000000001149 <main>:
#include<stdio.h>
int main(int argc, char *argv[])
{
1149: f3 0f 1e fa endbr64
114d: 55 push %rbp
114e: 48 89 e5 mov %rsp,%rbp
1151: 48 83 ec 10 sub $0x10,%rsp
1155: 89 7d fc mov %edi,-0x4(%rbp)
1158: 48 89 75 f0 mov %rsi,-0x10(%rbp)
printf("hello world\r\n");
115c: 48 8d 05 a1 0e 00 00 lea 0xea1(%rip),%rax # 2004 <_IO_stdin_used+0x4>
1163: 48 89 c7 mov %rax,%rdi
1166: e8 e5 fe ff ff call 1050 <puts@plt>
return 0;
116b: b8 00 00 00 00 mov $0x0,%eax
1170: c9 leave
1171: c3 ret
Disassembly of section .fini:
0000000000001174 <_fini>:
1174: f3 0f 1e fa endbr64
1178: 48 83 ec 08 sub $0x8,%rsp
117c: 48 83 c4 08 add $0x8,%rsp
1180: c3 ret