In-depth analysis of the process of C language code to machine code
Broadly speaking, it can be divided into two stages:
- The first stage: It consists of three stages: Compile, Assemble and Link, and generates an executable program (Executable Program).
- The second stage: Load the executable file into the memory through the loader, and then the CPU reads instructions and data from the memory to start actually executing the program.
Phase 1: Compile, Assemble and Link
- Compile: At this stage, a C language compiler (such as GCC) is used to compile the C source code file (.c file) into an assembly code file (.s file). The compiler performs lexical analysis, syntax analysis, and semantic analysis on the C code, and then generates intermediate code to represent the logical structure of the program.
- Assembly (Assemble): At this stage, an assembler (such as GNU assembler) is used to convert the assembly code file (.s file) into a machine code instruction file (.o file). An assembler translates each instruction in assembly code into a corresponding machine code instruction.
- Link: At this stage, a linker (such as the GNU linker) is used to link multiple machine code instruction files (.o files) and required library files together to generate the final executable file (Executable Program) . The linker resolves references to functions and global variables and associates their definitions with the corresponding references to create the executable file.
Phase Two: Load and Execute
- Load: In this phase, the operating system's loader is responsible for loading the executable file into the appropriate location in memory. The loader allocates memory space and copies the instructions, data, and other resources of the executable file to the corresponding memory address.
- Execution: Once the executable file is successfully loaded into the memory, the CPU reads the instructions and data from the memory and starts executing the program in the order of the instructions. The CPU will perform arithmetic operations, logical judgments, memory access and other operations according to the instructions, and finally realize the functions of the program.
In-depth understanding of the ELF format: an important role in the Linux system
What is ELF?
-
ELF (Executable and Linkable Format, executable and linkable format)
-
In Linux systems, use ELF to store and organize data
ELF file structure
ELF main file structure:
.text Section
: Code section or instruction section (Code Section), used to save the code and instructions of the program;.data Section
: Data Section (Data Section), used to save the initialization data information set in the program;.rel.text Secion
,: Relocation Table (Relocation Table). In the relocation table, what is kept is in the current file, which jump addresses are actually unknown to us..symtab Section
:Symbol Table. The symbol table keeps what we call an address book of function names and corresponding addresses defined in the current file.
The key role of ELF format in the compilation process
- Compile phase (Compile): The object file generated by the compiler usually uses the ELF format to store the compiled code and data.
- Assembly stage (Assemble): The ELF format is used at this stage to store assembled machine instructions and data.
- Link stage (Link): The link stage is the main application area of the ELF format. During the linking phase, the linker reads multiple object files and library files, performs symbol resolution and relocation based on symbol reference relationships, and finally generates an executable file. The ELF format provides structures such as segment tables, symbol tables, and relocation tables to describe the relationship between various parts of the file and symbols, allowing the linker to accurately handle symbol reference and relocation operations.
- Loading phase (Load): The ELF format helps the operating system (Operation System) understand the layout and relocation requirements of the executable file during this phase.
ELF running example
C code
The following two files add_lib.c
and link_example.c
work together to implement an addition function.
// add_lib.c
int add(int a, int b)
{
return a+b;
}
// link_example.c
#include <stdio.h>
int main()
{
int a = 10;
int b = 5;
int c = add(a, b);
printf("c = %d\n", c);
}
compilation
The following is the object file (Object File) generated by add_lib.c
and : and .link_example.c
add_lib.o
link_example .o
Compile with gcc:
$ gcc -g -c add_lib.c link_example.c
$ objdump -d -M intel -S add_lib.o
$ objdump -d -M intel -S link_example.o
The assembly code we get after compilation:
# add_lib函数的汇编代码
add_lib.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <add>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 89 7d fc mov DWORD PTR [rbp-0x4],edi
7: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
a: 8b 55 fc mov edx,DWORD PTR [rbp-0x4]
d: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
10: 01 d0 add eax,edx
12: 5d pop rbp
13: c3 ret
# link_example函数的汇编代码
link_example.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 48 83 ec 10 sub rsp,0x10
8: c7 45 fc 0a 00 00 00 mov DWORD PTR [rbp-0x4],0xa
f: c7 45 f8 05 00 00 00 mov DWORD PTR [rbp-0x8],0x5
16: 8b 55 f8 mov edx,DWORD PTR [rbp-0x8]
19: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
1c: 89 d6 mov esi,edx
1e: 89 c7 mov edi,eax
20: b8 00 00 00 00 mov eax,0x0
25: e8 00 00 00 00 call 2a <main+0x2a>
2a: 89 45 f4 mov DWORD PTR [rbp-0xc],eax
2d: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
30: 89 c6 mov esi,eax
32: 48 8d 3d 00 00 00 00 lea rdi,[rip+0x0] # 39 <main+0x39>
39: b8 00 00 00 00 mov eax,0x0
3e: e8 00 00 00 00 call 43 <main+0x43>
43: b8 00 00 00 00 mov eax,0x0
48: c9 leave
49: c3 ret
Link
gcc -c add_lib.s
gcc -c link_example.s
executable code
gcc -o executable add_lib.o link_example.o
$ ./executable
c = 15 # 运行结果为15
- Note: The jump address
main
called in the functionadd
is no longer the address of the next instruction, butadd
the entry address of the function
link_example: file format elf64-x86-64
Disassembly of section .init:
...
Disassembly of section .plt:
...
Disassembly of section .plt.got:
...
Disassembly of section .text:
...
6b0: 55 push rbp
6b1: 48 89 e5 mov rbp,rsp
6b4: 89 7d fc mov DWORD PTR [rbp-0x4],edi
6b7: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
6ba: 8b 55 fc mov edx,DWORD PTR [rbp-0x4]
6bd: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8]
6c0: 01 d0 add eax,edx
6c2: 5d pop rbp
6c3: c3 ret
00000000000006c4 <main>:
6c4: 55 push rbp
6c5: 48 89 e5 mov rbp,rsp
6c8: 48 83 ec 10 sub rsp,0x10
6cc: c7 45 fc 0a 00 00 00 mov DWORD PTR [rbp-0x4],0xa
6d3: c7 45 f8 05 00 00 00 mov DWORD PTR [rbp-0x8],0x5
6da: 8b 55 f8 mov edx,DWORD PTR [rbp-0x8]
6dd: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
6e0: 89 d6 mov esi,edx
6e2: 89 c7 mov edi,eax
6e4: b8 00 00 00 00 mov eax,0x0
6e9: e8 c2 ff ff ff call 6b0 <add> # 直接在main函数中调用add函数的入口地址
6ee: 89 45 f4 mov DWORD PTR [rbp-0xc],eax
6f1: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
6f4: 89 c6 mov esi,eax
6f6: 48 8d 3d 97 00 00 00 lea rdi,[rip+0x97]
6fd: b8 00 00 00 00 mov eax,0x0
702: e8 59 fe ff ff call 560 <printf@plt>
707: b8 00 00 00 00 mov eax,0x0
70c: c9 leave
70d: c3 ret
70e: 66 90 xchg ax,ax
...
Disassembly of section .fini:
...
The linker scans all input object files and then collects the information in all symbol tables to form a global symbol table. Then according to the relocation table, all codes whose jump addresses are uncertain are corrected according to the addresses stored in the symbol table. Finally, the corresponding sections of all target files are merged into the final executable code.
Windows OS: PE
- The executable file format of Windows is called PE (Portable Executable Format).
- The loader under Linux can only parse the ELF format and not the PE format.
How to make formats compatible under Windows system and Linux system?
- Wine, a well-known open source project under Linux, supports a loader compatible with PE format, allowing us to run Windows programs directly under Linux
- Windows also provides WSL, which is Windows Subsystem for Linux, which can parse and load files in ELF format
- Although various tools exist to achieve executable file format compatibility, the program also relies on dynamic link libraries, system calls, etc. provided by various operating systems themselves, and still needs to be adapted and tested for specific platforms. In other words, format compatibility is only the first step.