qemu-basics - program compilation process (4)

Program Compilation Process

Computer programming languages ​​are usually divided into three categories: machine language, assembly language and high-level language. High-level languages ​​need to be translated into machine language to be executed, and there are two translation methods, one is compiled and the other is interpreted, so we basically divide high-level languages ​​into two categories, one is compiled Type languages, such as C, C++, Java, and the other is interpreted languages, such as Python, Ruby, MATLAB, JavaScript.

Here we will introduce how to convert a program written in high-level C/C++ language into binary code that the processor can execute, including four steps:

  • Preprocessing (Preprocessing, preprocessor cpp, role: .c/s -> .i
  • Compilation, assembler ccl, function: .i -> .s
  • Compile (Assembly), compiler as, function: .s -> .o
  • Linking, linker ld, role: .o -> elf executable file

statement

Part of the content comes from the Internet, intruded and deleted.

GCC Common Toolchain

The commonly referred to as GCC is the abbreviation of GUN Compiler Collection, which is a commonly used compilation tool on Linux systems. GCC tool chain software includes GCC, Binutils, C runtime library, etc.

GCC

GCC (GNU C Compiler) is a compilation tool. The process of converting the program written in C/C++ language into the binary code that the processor can execute is completed by the compiler.

Binutils

A set of binary program processing tools, including: addr2line, ar, objcopy, objdump, as, ld, ldd, readelf, size, etc. This set of tools is an indispensable tool for development and debugging, and their brief introductions are as follows:

  • addr2line: It is used to convert the program address into its corresponding program source file and corresponding code line, and also get the corresponding function. This tool will help the debugger locate the corresponding source code location during debugging.
  • as: Mainly used for assembly, please refer to the following for the detailed introduction of assembly.
  • ld: It is mainly used for linking. For details about linking, please refer to the following text.
  • ar: Mainly used to create static libraries. In order to facilitate the understanding of beginners, the concept of dynamic library and static library is introduced here:
    • If multiple .o object files are to be generated into a library file, there are two types of libraries, one is a static library and the other is a dynamic library.
      In Windows, a static library is a file with a suffix of .lib, and a shared library is a file with a suffix of .dll. In Linux, the static library is a file with the suffix .a, and the shared library is a file with the suffix .so.
    • The difference between a static library and a dynamic library is that the moment when the code is loaded is different. The code of the static library has been loaded into the executable program during the compilation process, so the size is relatively large. The code of the shared library is loaded into the memory when the executable program is running, and is simply referenced during the compilation process, so the code size is small. In the Linux system, you can use the ldd command to view the shared libraries that an executable program depends on.
      If there are multiple programs that need to run at the same time in a system and there are shared libraries among these programs, then using a dynamic library will save memory more.
  • ldd: can be used to view the shared libraries that an executable program depends on.
  • objcopy: Translate one object file into another format, such as converting . bin to . elf, or converting . elf to . bin, etc.
  • objdump: The main function is to disassemble. For a detailed introduction to disassembly, see the following text.
  • readelf: Display information about ELF files, see below for more information.
  • size: List the size and total size of each part of the executable file, code segment, data segment, total size, etc. Please refer to the following for specific usage examples of using size.

C runtime library

The C language standard is mainly composed of two parts: one part describes the syntax of C, and the other part describes the C standard library. The C standard library defines a set of standard header files, and each header file contains some related functions, variables, type declarations and macro definitions. For example, the common printf function is a C standard library function, and its prototype is defined in the stdio header file.

The C language standard only defines the prototypes of the C standard library functions, and does not provide implementations. Therefore, C language compiler usually needs a C runtime library (C Run Time Library, CRT) support. The C runtime library is often referred to simply as the C runtime library. Similar to the C language, C++ also defines its own standard and provides related supporting libraries, called the C++ runtime library.

ENV

Since the GCC tool chain is mainly used in the Linux environment, this article will also use the Linux system as the working environment. In order to demonstrate the whole process of compilation, this section first prepares a simple Hello program written in C language as an example, and its source code is as follows:

#include<stdio.h>

int main(int argc, char *argv[])
{
    
    
    printf("hello world\r\n");

    return 0;
}

compilation process

preprocessing

The preprocessing process mainly includes the following processes:

  • Delete all #defines, expand all macro definitions, and process all conditional precompiled directives, such as #if #ifdef #elif #else #endif, etc.
  • Processes #include precompiled directives, inserting included files at the location of the precompiled directives.
  • Remove all comments "//" and "/* */".
  • Add line numbers and file identifiers to generate debugging line numbers and compilation error warning line numbers when compiling.
  • Preserves all #pragma compiler directives that are required for subsequent compilation passes.

The command for preprocessing with gcc is as follows:

gcc -E test.c -o test.i

GCC option -E causes GCC to stop after preprocessing

The above command is to preprocess the source file test.c to generate test.i. The content of the test.i file is as follows

extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__ , __leaf__));
# 885 "/usr/include/stdio.h" 3 4
extern int __uflow (FILE *);
extern int __overflow (FILE *, int);
# 902 "/usr/include/stdio.h" 3 4

# 2 "test.c" 2


# 3 "test.c"
int main(int argc, char *argv[])
{
    
    
    printf("hello world\r\n");

    return 0;
}

The test.i file can be opened and viewed as a normal text file

compile

The compilation process is to perform a series of lexical analysis, syntax analysis, semantic analysis and optimization on the preprocessed files to generate corresponding assembly codes.

The command to compile with gcc is as follows

gcc -S test.i -o test.s

GCC's option -S causes GCC to stop after compiling and generate assembler

The above command will preprocess the generated test.i file to compile and generate the assembler test.s

compilation

The assembly procedure call processes the assembly code, generates instructions that the processor can recognize, and saves them in the object file with the suffix .o. Since each assembly statement almost corresponds to a processor instruction, the assembly process is simpler than the compilation process, which can be translated one by one by calling the assembler as in Binutils according to the comparison table of assembly instructions and processor instructions.

When the program is composed of multiple source code files, each file must first complete the assembly work, and the .o object file can be generated before entering the next link work. Note: Object files are already part of the final program, but cannot be executed until linked.

The command to assemble using gcc is as follows

 gcc -c test.s -o test.o

The GCC option -c causes GCC to stop after executing the assembly and generate the object file

Or directly call as for assembly

as -c hello.s -o hello.o

Use as in Binutils to assemble the hello.s file to generate an object file

Note: The test.o object file is a redirectable file in ELF (Executable and Linkable Format) format.

Link

Links are also divided into static links and dynamic links, the main points are as follows:

  • Static linking refers to directly adding the static library to the executable file during the compilation phase, so that the executable file will be relatively large. The linker copies the function's code from its location (either in a different object file or in a statically linked library) into the final executable program. In order to create an executable file, the main tasks that the linker must complete are: symbol resolution (associating the definition and reference of the symbol in the object file) and relocation (corresponding the symbol definition to the memory address and then modifying all references to the symbol ).
  • Dynamic linking means that only some description information is added in the linking stage, and the corresponding dynamic library is loaded from the system into the memory when the program is executed.
    • In the Linux system, the order of the dynamic library search path when gcc compiles and links is usually: first search from the path specified by the parameter - L of the gcc command; then address from the path specified by the environment variable LIBRARY_PATH; then search from the default path / lib, /usr/lib, /usr/local/lib Look for.
    • In the Linux system, the order of the dynamic library search path when executing binary files is usually: first search the dynamic library search path specified when compiling the object code; then address from the path specified by the environment variable LD_LIBRARY_PATH; then from the configuration file /etc/ The dynamic library search path specified in ld.so.conf; then search from the default path /lib, /usr/lib.
    • In the Linux system, you can use the ldd command to view the shared libraries that an executable program depends on.
  • Since the paths for linking dynamic libraries and static libraries may overlap, if there are static library files and dynamic library files with the same name in the path, such as libtest.a and libtest.so, gcc will give priority to dynamic libraries by default when linking, and will link libtest .so, if you want gcc to choose to link libtest.a, you can specify the gcc option - static, which will force the use of static libraries for linking. Take Hello World as an example:
    • If you use the command "gcc hello.c -o hello", the dynamic library will be used for linking. The size of the generated ELF executable file (use the size command of Binutils to view) and the linked dynamic library (use the ldd command of Binutils to view) are as follows Shown:
      gcc test.c -o test
      
      size test
          text    data     bss     dec     hex filename
          1386     600       8    1994     7ca test
      
      ldd test
          linux-vdso.so.1 (0x00007fffb99f4000)
          libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc2a3600000)
          /lib64/ld-linux-x86-64.so.2 (0x00007fc2a39a9000)
      
    • If you use the command "gcc -static hello.c -o hello", the static library will be used for linking, the size of the generated ELF executable file (use the size command of Binutils to view) and the linked dynamic library (use the ldd command of Binutils to view )As follows:
      gcc -static test.c -o test
      
      size test
          text    data     bss     dec     hex filename
          781877   23240   23016  828133   ca2e5 test
      ldd test
          not a dynamic executable
      
    • From the above results, it can be seen that the dynamically linked files are smaller and the statically linked files are larger.

The final file generated by the linker is an executable file in ELF format. An ELF executable file is usually linked into different segments, such as .text, .data, .rodata, and .bss.

Analyze ELF files

Sections of ELF files

The format of the ELF file is as follows, the sections between the ELF Header and the Section Header Table are sections.

ELF header
Program header table
.text
.rodata
...
.data
Section header table

A typical ELF file contains the following sections

  • .text: The instruction code segment of the compiled program.
  • .rodata: ro stands for read only, that is, read-only data (such as constant const).
  • .data: Initialized C program global variables and static local variables.
  • .bss: Uninitialized C program global variables and static local variables.
  • .debug: debug symbol table, the debugger uses the information in this section to help debug (the -g option needs to be added when compiling).

You can use readelf -S to view the information of each section

readelf -S test

The detailed information displayed by the above command is as follows:

There are 31 section headers, starting at offset 0x3698:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [1] .interp           PROGBITS         0000000000000318  00000318
       000000000000001c  0000000000000000   A       0     0     1
  [2] .note.gnu.pr[...] NOTE             0000000000000338  00000338
       0000000000000030  0000000000000000   A       0     0     8
  [3] .note.gnu.bu[...] NOTE             0000000000000368  00000368
       0000000000000024  0000000000000000   A       0     0     4
  [4] .note.ABI-tag     NOTE             000000000000038c  0000038c
       0000000000000020  0000000000000000   A       0     0     4
  [5] .gnu.hash         GNU_HASH         00000000000003b0  000003b0
       0000000000000024  0000000000000000   A       6     0     8
  [6] .dynsym           DYNSYM           00000000000003d8  000003d8
       00000000000000a8  0000000000000018   A       7     1     8
  [7] .dynstr           STRTAB           0000000000000480  00000480
       000000000000008d  0000000000000000   A       0     0     1
  [8] .gnu.version      VERSYM           000000000000050e  0000050e
       000000000000000e  0000000000000002   A       6     0     2
  [9] .gnu.version_r    VERNEED          0000000000000520  00000520
       0000000000000030  0000000000000000   A       7     1     8
  [10] .rela.dyn         RELA             0000000000000550  00000550
       00000000000000c0  0000000000000018   A       6     0     8
  [11] .rela.plt         RELA             0000000000000610  00000610
       0000000000000018  0000000000000018  AI       6    24     8
  [12] .init             PROGBITS         0000000000001000  00001000
       000000000000001b  0000000000000000  AX       0     0     4
  [13] .plt              PROGBITS         0000000000001020  00001020
       0000000000000020  0000000000000010  AX       0     0     16
  [14] .plt.got          PROGBITS         0000000000001040  00001040
       0000000000000010  0000000000000010  AX       0     0     16
  [15] .plt.sec          PROGBITS         0000000000001050  00001050
       0000000000000010  0000000000000010  AX       0     0     16
  [16] .text             PROGBITS         0000000000001060  00001060
       0000000000000112  0000000000000000  AX       0     0     16
  [17] .fini             PROGBITS         0000000000001174  00001174
       000000000000000d  0000000000000000  AX       0     0     4
  [18] .rodata           PROGBITS         0000000000002000  00002000
       0000000000000011  0000000000000000   A       0     0     4
  [19] .eh_frame_hdr     PROGBITS         0000000000002014  00002014
       0000000000000034  0000000000000000   A       0     0     4
  [20] .eh_frame         PROGBITS         0000000000002048  00002048
       00000000000000ac  0000000000000000   A       0     0     8
  [21] .init_array       INIT_ARRAY       0000000000003db8  00002db8
       0000000000000008  0000000000000008  WA       0     0     8
  [22] .fini_array       FINI_ARRAY       0000000000003dc0  00002dc0
       0000000000000008  0000000000000008  WA       0     0     8
  [23] .dynamic          DYNAMIC          0000000000003dc8  00002dc8
       00000000000001f0  0000000000000010  WA       7     0     8
  [24] .got              PROGBITS         0000000000003fb8  00002fb8
       0000000000000048  0000000000000008  WA       0     0     8
  [25] .data             PROGBITS         0000000000004000  00003000
       0000000000000010  0000000000000000  WA       0     0     8
  [26] .bss              NOBITS           0000000000004010  00003010
       0000000000000008  0000000000000000  WA       0     0     1
  [27] .comment          PROGBITS         0000000000000000  00003010
       000000000000002b  0000000000000001  MS       0     0     1
  [28] .symtab           SYMTAB           0000000000000000  00003040
       0000000000000360  0000000000000018          29    18     8
  [29] .strtab           STRTAB           0000000000000000  000033a0
       00000000000001da  0000000000000000           0     0     1
  [30] .shstrtab         STRTAB           0000000000000000  0000357a
       000000000000011a  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), l (large), p (processor specific)

Disassemble ELF

Since ELF files cannot be opened as ordinary text files, if you want to directly view the instructions and data contained in an ELF file, you need to use the method of disassembly.
Disassemble it with objdump -D

objdump -D test

Some of the results returned by the above command are as follows

0000000000001149 <main>:
    1149:       f3 0f 1e fa             endbr64
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       48 83 ec 10             sub    $0x10,%rsp
    1155:       89 7d fc                mov    %edi,-0x4(%rbp)
    1158:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
    115c:       48 8d 05 a1 0e 00 00    lea    0xea1(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    1163:       48 89 c7                mov    %rax,%rdi
    1166:       e8 e5 fe ff ff          call   1050 <puts@plt>
    116b:       b8 00 00 00 00          mov    $0x0,%eax
    1170:       c9                      leave
    1171:       c3                      ret

Use objdump -S to disassemble it and display its C language source code: (you need to add the -g option when compiling)

objdump -S test

The full result of the above command is as follows

test:     file format elf64-x86-64


Disassembly of section .init:

0000000000001000 <_init>:
    1000:       f3 0f 1e fa             endbr64 
    1004:       48 83 ec 08             sub    $0x8,%rsp
    1008:       48 8b 05 d9 2f 00 00    mov    0x2fd9(%rip),%rax        # 3fe8 <__gmon_start__@Base>
    100f:       48 85 c0                test   %rax,%rax
    1012:       74 02                   je     1016 <_init+0x16>
    1014:       ff d0                   call   *%rax
    1016:       48 83 c4 08             add    $0x8,%rsp
    101a:       c3                      ret    

Disassembly of section .plt:

0000000000001020 <.plt>:
    1020:       ff 35 9a 2f 00 00       push   0x2f9a(%rip)        # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:       f2 ff 25 9b 2f 00 00    bnd jmp *0x2f9b(%rip)        # 3fc8 <_GLOBAL_OFFSET_TABLE_+0x10>
    102d:       0f 1f 00                nopl   (%rax)
    1030:       f3 0f 1e fa             endbr64 
    1034:       68 00 00 00 00          push   $0x0
    1039:       f2 e9 e1 ff ff ff       bnd jmp 1020 <_init+0x20>
    103f:       90                      nop

Disassembly of section .plt.got:

0000000000001040 <__cxa_finalize@plt>:
    1040:       f3 0f 1e fa             endbr64 
    1044:       f2 ff 25 ad 2f 00 00    bnd jmp *0x2fad(%rip)        # 3ff8 <__cxa_finalize@GLIBC_2.2.5>
    104b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Disassembly of section .plt.sec:

0000000000001050 <puts@plt>:
    1050:       f3 0f 1e fa             endbr64 
    1054:       f2 ff 25 75 2f 00 00    bnd jmp *0x2f75(%rip)        # 3fd0 <puts@GLIBC_2.2.5>
    105b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Disassembly of section .text:

0000000000001060 <_start>:
    1060:       f3 0f 1e fa             endbr64 
    1064:       31 ed                   xor    %ebp,%ebp
    1066:       49 89 d1                mov    %rdx,%r9
    1069:       5e                      pop    %rsi
    106a:       48 89 e2                mov    %rsp,%rdx
    106d:       48 83 e4 f0             and    $0xfffffffffffffff0,%rsp
    1071:       50                      push   %rax
    1072:       54                      push   %rsp
    1073:       45 31 c0                xor    %r8d,%r8d
    1076:       31 c9                   xor    %ecx,%ecx
    1078:       48 8d 3d ca 00 00 00    lea    0xca(%rip),%rdi        # 1149 <main>
    107f:       ff 15 53 2f 00 00       call   *0x2f53(%rip)        # 3fd8 <__libc_start_main@GLIBC_2.34>
    1085:       f4                      hlt    
    1086:       66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
    108d:       00 00 00 

0000000000001090 <deregister_tm_clones>:
    1090:       48 8d 3d 79 2f 00 00    lea    0x2f79(%rip),%rdi        # 4010 <__TMC_END__>
    1097:       48 8d 05 72 2f 00 00    lea    0x2f72(%rip),%rax        # 4010 <__TMC_END__>
    109e:       48 39 f8                cmp    %rdi,%rax
    10a1:       74 15                   je     10b8 <deregister_tm_clones+0x28>
    10a3:       48 8b 05 36 2f 00 00    mov    0x2f36(%rip),%rax        # 3fe0 <_ITM_deregisterTMCloneTable@Base>
    10aa:       48 85 c0                test   %rax,%rax
    10ad:       74 09                   je     10b8 <deregister_tm_clones+0x28>
    10af:       ff e0                   jmp    *%rax
    10b1:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
    10b8:       c3                      ret    
    10b9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

00000000000010c0 <register_tm_clones>:
    10c0:       48 8d 3d 49 2f 00 00    lea    0x2f49(%rip),%rdi        # 4010 <__TMC_END__>
    10c7:       48 8d 35 42 2f 00 00    lea    0x2f42(%rip),%rsi        # 4010 <__TMC_END__>
    10ce:       48 29 fe                sub    %rdi,%rsi
    10d1:       48 89 f0                mov    %rsi,%rax
    10d4:       48 c1 ee 3f             shr    $0x3f,%rsi
    10d8:       48 c1 f8 03             sar    $0x3,%rax
    10dc:       48 01 c6                add    %rax,%rsi
    10df:       48 d1 fe                sar    %rsi
    10e2:       74 14                   je     10f8 <register_tm_clones+0x38>
    10e4:       48 8b 05 05 2f 00 00    mov    0x2f05(%rip),%rax        # 3ff0 <_ITM_registerTMCloneTable@Base>
    10eb:       48 85 c0                test   %rax,%rax
    10ee:       74 08                   je     10f8 <register_tm_clones+0x38>
    10f0:       ff e0                   jmp    *%rax
    10f2:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
    10f8:       c3                      ret    
    10f9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000001100 <__do_global_dtors_aux>:
    1100:       f3 0f 1e fa             endbr64 
    1104:       80 3d 05 2f 00 00 00    cmpb   $0x0,0x2f05(%rip)        # 4010 <__TMC_END__>
    110b:       75 2b                   jne    1138 <__do_global_dtors_aux+0x38>
    110d:       55                      push   %rbp
    110e:       48 83 3d e2 2e 00 00    cmpq   $0x0,0x2ee2(%rip)        # 3ff8 <__cxa_finalize@GLIBC_2.2.5>
    1115:       00 
    1116:       48 89 e5                mov    %rsp,%rbp
    1119:       74 0c                   je     1127 <__do_global_dtors_aux+0x27>
    111b:       48 8b 3d e6 2e 00 00    mov    0x2ee6(%rip),%rdi        # 4008 <__dso_handle>
    1122:       e8 19 ff ff ff          call   1040 <__cxa_finalize@plt>
    1127:       e8 64 ff ff ff          call   1090 <deregister_tm_clones>
    112c:       c6 05 dd 2e 00 00 01    movb   $0x1,0x2edd(%rip)        # 4010 <__TMC_END__>
    1133:       5d                      pop    %rbp
    1134:       c3                      ret    
    1135:       0f 1f 00                nopl   (%rax)
    1138:       c3                      ret    
    1139:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

0000000000001140 <frame_dummy>:
    1140:       f3 0f 1e fa             endbr64 
    1144:       e9 77 ff ff ff          jmp    10c0 <register_tm_clones>

0000000000001149 <main>:
#include<stdio.h>

int main(int argc, char *argv[])
{
    
    
    1149:       f3 0f 1e fa             endbr64 
    114d:       55                      push   %rbp
    114e:       48 89 e5                mov    %rsp,%rbp
    1151:       48 83 ec 10             sub    $0x10,%rsp
    1155:       89 7d fc                mov    %edi,-0x4(%rbp)
    1158:       48 89 75 f0             mov    %rsi,-0x10(%rbp)
    printf("hello world\r\n");
    115c:       48 8d 05 a1 0e 00 00    lea    0xea1(%rip),%rax        # 2004 <_IO_stdin_used+0x4>
    1163:       48 89 c7                mov    %rax,%rdi
    1166:       e8 e5 fe ff ff          call   1050 <puts@plt>

    return 0;
    116b:       b8 00 00 00 00          mov    $0x0,%eax
    1170:       c9                      leave  
    1171:       c3                      ret    

Disassembly of section .fini:

0000000000001174 <_fini>:
    1174:       f3 0f 1e fa             endbr64 
    1178:       48 83 ec 08             sub    $0x8,%rsp
    117c:       48 83 c4 08             add    $0x8,%rsp
    1180:       c3                      ret 

Guess you like

Origin blog.csdn.net/tyustli/article/details/130549434