The preprocessed source files are stripped of all packaging, comments are deleted, preprocessing commands are basically processed, and the rest is C code. The next second step is to enter the compilation phase. The compilation phase is mainly divided into two steps: the first step, the compiler calls a series of analysis tools to analyze the C code, and compiles the C source file into an assembly file; the second step is to assemble the assembly file into a relocatable target through the assembler document.
Article directory
From C file to assembly file
An assembly file organizes the program in units of segments: code segment, data segment, BSS segment, etc. Each segment is independent of each other, and its organizational structure is very close to that of the binary object file, because assembly instructions are binary instructions The mnemonic, but the program structure of the assembly language needs to use various pseudo-operations to organize. After the assembly file is assembled by the assembler, the various pseudo-operation commands are processed, and it is a binary object file.
The conversion from a C source file to an assembly file starts with converting the program code blocks and functions in the C file into the code segments in the assembler, and converting the global variables, static variables, and constants in the C program into the code segments in the assembler. Data segment, read-only data segment. Generally speaking, the compilation process can be divided into the following 6 steps:
- lexical analysis
- Gramma analysis
- Semantic Analysis
- intermediate code generation
- Assembly code generation
- object code generation
lexical analysis
Mainly used to parse C program statements. Lexical analysis generally reads the source program character by character from left to right through the lexical scanner, parses and recognizes these character streams through the finite state machine, and decomposes the source program into a series of token units that cannot be further decomposed—token .
Token is the smallest token unit meaningful in the character stream parsing process. The common tokens are as follows:
- Various keywords of C language: int, float, for, while, break, etc.
- Various user-defined identifiers: function names, variable names, labels, etc.
- Literals: numbers, strings, etc.
- Operators: More than 40 operators defined by the C language standard.
- Delimiter: semicolon at the end of the program, comma in for loop, etc.
Gramma analysis
Syntactic analysis is to analyze the token sequence generated in the previous stage to see if it can be constructed into a grammatically correct grammatical phrase (program, statement, expression, etc.). Grammatical phrases are represented by a syntax tree, which is a tree structure rather than a linear sequence. During the analysis of the token sequence, if the grammatical analysis tool finds that a grammatically correct statement or expression cannot be constructed, a grammatical error will be reported.
Semantic Analysis
Syntax analysis only checks the syntax of the program, and does not understand the true meaning of the program and the statement, while the semantic analysis mainly checks the various expressions and statements output by the syntax analysis to see if there are any errors. If the actual parameters you pass to the function do not match the declared parameter types of the function, or you use an undeclared variable, or divide by zero, break occurs outside a loop statement or switch statement, or before a loop statement If a continue statement is found outside, a semantic error or warning will generally be reported.
generate intermediate code
The expressions or program statements output in the syntax analysis stage are still stored in the form of syntax trees, and we need to convert them into intermediate codes. The intermediate code is a kind of temporary code during the compilation process, and the common ones are three-address code and P-code .
Compared with the syntax tree, the intermediate code has many advantages: the intermediate code is a one-dimensional linear structure, type pseudocode, and the compiler can easily translate the intermediate code into the target code.
jiaming@jiaming-pc:~/Documents/CSDN_Project$ cat main2.c
int main(void)
{
int sum = 0;
int a = 2;
int b = 1;
int c = 1;
sum = a + b / c;
return 0;
}
jiaming@jiaming-pc:~/Documents/CSDN_Project$ arm-linux-gnueabi-gcc -fdump-tree-gimple main2.c
jiaming@jiaming-pc:~/Documents/CSDN_Project$ cat main2.c.005t.gimple # 自动生成同名文件
main ()
{
int D.4205;
{
int sum;
int a;
int b;
int c;
sum = 0;
a = 2;
b = 1;
c = 1;
_1 = b / c;
sum = a + _1;
D.4205 = 0;
return D.4205;
}
D.4205 = 0;
return D.4205;
}
After the C program statement sum=a+b/c;
is compiled into a three-address code, it becomes a statement similar to the pseudo-code shown above. The intermediate code is generally independent of the platform. If you want to compile the C program into an executable file under the X platform, the last step is to translate the intermediate code into an X86 assembler according to the X86 instruction set; if you want to compile it into an ARM platform Executable files running on the Internet, then it is necessary to refer to the ARM instruction set, allocate registers according to ATPCS rules, and translate the intermediate code into ARM assembler.
jiaming@jiaming-pc:~/Documents/CSDN_Project$ arm-linux-gnueabi-gcc -S main2.c
jiaming@jiaming-pc:~/Documents/CSDN_Project$ cat main2.s # 自动生成该后缀文件
.arch armv5t
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "main2.c"
.text
.global __aeabi_idiv
.align 2
.global main
.syntax unified
.arm
.fpu softvfp
.type main, %function
main:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 1, uses_anonymous_args = 0
push {
fp, lr}
add fp, sp, #4
sub sp, sp, #16
mov r3, #0
str r3, [fp, #-20]
mov r3, #2
str r3, [fp, #-16]
mov r3, #1
str r3, [fp, #-12]
mov r3, #1
str r3, [fp, #-8]
ldr r1, [fp, #-8]
ldr r0, [fp, #-12]
bl __aeabi_idiv
mov r3, r0
mov r2, r3
ldr r3, [fp, #-16]
add r3, r3, r2
str r3, [fp, #-20]
mov r3, #0
mov r0, r3
sub sp, fp, #4
@ sp needed
pop {
fp, pc}
.size main, .-main
.ident "GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
.section .note.GNU-stack,"",%progbits
Assembly process
The assembly process is to use an assembler to translate the assembly file generated in the previous stage into an object file. The main job of the assembler is to refer to the ISA instruction set, translate the assembly code into the corresponding binary instructions, and at the same time generate some necessary information, assemble it into the object file in the form of section, and use this information in the subsequent linking process. The process mainly includes lexical analysis, syntax analysis, instruction generation and other processes.
When the compiler compiles a project, it compiles in units of C source files, and each source file is compiled to generate a corresponding object file (main.c --> main.o). The main.o object file is non-executable and belongs to the relocatable object file. It must be relocated and linked by the linker before it can be assembled into an executable object file a.out.
The relocatable object files generated by compiling are all linked with zero address as the link start address. In the process of translating source files into relocatable object files, the compiler compiles different functions into binary instructions, and stores the instruction sequence of each function in the code segment sequentially starting from zero address. The entry address starts from the zero address and then shifts backwards one by one.
Compiler-only commands: arm-linux-gnueabi-gcc -c main.c
.
Use readelf
the command to view the main.o object file:
jiaming@jiaming-pc:~/Documents/CSDN_Project$ readelf -S main.o
There are 12 section headers, starting at offset 0x354:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[ 0] NULL 00000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 00000000 000034 00005c 00 AX 0 0 4
[ 2] .rel.text REL 00000000 0002c0 000030 08 I 9 1 4
[ 3] .data PROGBITS 00000000 000090 000008 00 WA 0 0 4
[ 4] .bss NOBITS 00000000 000098 000004 00 WA 0 0 4
[ 5] .rodata PROGBITS 00000000 000098 000010 00 A 0 0 4
[ 6] .comment PROGBITS 00000000 0000a8 00002c 01 MS 0 0 1
[ 7] .note.GNU-stack PROGBITS 00000000 0000d4 000000 00 0 0 1
[ 8] .ARM.attributes ARM_ATTRIBUTES 00000000 0000d4 00002a 00 0 0 1
[ 9] .symtab SYMTAB 00000000 000100 000160 10 10 16 4
[10] .strtab STRTAB 00000000 000260 00005d 00 0 0 1
[11] .shstrtab STRTAB 00000000 0002f0 000061 00 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
y (purecode), p (processor specific)
When the main.o object file is compiled, the code segment is assembled with the zero address as the base address. In each relocatable object file, the addresses of functions or variables start at their offset in the file from zero. In the subsequent linking process, when the linker assembles each object file together, the reference address of each object file changes, then the address of the function or variable in this object file must also be updated accordingly, otherwise it will not be able to You can refer to a function by its function name, but not a variable by its variable name.
After the linker assembles the various object files together, it needs to re-modify the addresses of variables or functions in each object file. This process is generally called relocation. Collect the symbols (function names/variable names) that need to be relocated to generate a relocation table, and save them in each relocatable object file in the form of section.
The main function in main.o references the add and sub functions in sub.o. During linker assembly, the addresses of add and sub functions have changed; after linker assembly, the new addresses of add and sub functions need to be recalculated and updated. This process is relocation.
Symbol table and relocation table
The symbol table and the relocation table are two very important tables that provide various necessary information for the linking process.
Symbol table
In the assembly stage, the assembler will analyze the information of each section in the assembly language, collect various symbols, generate a symbol table, and fill the offset address of each symbol in the section into the symbol table. The symbol table is mainly used to save the information of various symbols in the source program, including the address, type, and size of the occupied space of the symbol. On the one hand, this information can assist the compiler in semantic checking to see if there are semantic errors in the source program; on the other hand, it can also assist the compiler in generating compiled code, including address and space allocation, symbol resolution, relocation, etc.
View the symbol table:
jiaming@jiaming-pc:~/Documents/CSDN_Project$ readelf -s main.o
Symbol table '.symtab' contains 22 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 FILE LOCAL DEFAULT ABS main.c
2: 00000000 0 SECTION LOCAL DEFAULT 1
3: 00000000 0 SECTION LOCAL DEFAULT 3
4: 00000000 0 SECTION LOCAL DEFAULT 4
5: 00000000 0 NOTYPE LOCAL DEFAULT 3 $d
6: 00000000 0 SECTION LOCAL DEFAULT 5
7: 00000000 0 NOTYPE LOCAL DEFAULT 5 $d
8: 00000000 0 NOTYPE LOCAL DEFAULT 1 $a
9: 00000054 0 NOTYPE LOCAL DEFAULT 1 $d
10: 00000000 4 OBJECT LOCAL DEFAULT 4 uninit_local_val.4612
11: 00000000 0 NOTYPE LOCAL DEFAULT 4 $d
12: 00000004 4 OBJECT LOCAL DEFAULT 3 local_val.4611
13: 00000000 0 SECTION LOCAL DEFAULT 7
14: 00000000 0 SECTION LOCAL DEFAULT 6
15: 00000000 0 SECTION LOCAL DEFAULT 8
16: 00000000 4 OBJECT GLOBAL DEFAULT 3 global_val
17: 00000004 4 OBJECT GLOBAL DEFAULT COM uninit_val
18: 00000000 92 FUNC GLOBAL DEFAULT 1 main
19: 00000000 0 NOTYPE GLOBAL DEFAULT UND add
20: 00000000 0 NOTYPE GLOBAL DEFAULT UND sub
21: 00000000 0 NOTYPE GLOBAL DEFAULT UND printf
Each symbol in the symbol table has a symbol value and a type. The symbol value is essentially an address, which can be an absolute address, which generally appears in executable object files; it can also be a relative address, which generally appears in relocatable object files. The types of symbols mainly include the following:
- OBJECT: Object type, generally used to represent the variables we define in the program.
- FUNC: Associated is the function name or other executable code that can be referenced.
- FILE: This symbol is associated with the name of the current target file.
- SECTION: Indicates that the symbol is associated with a section, which is mainly used for relocation.
- COMMON: Indicates that the symbol is a common block data object, a global weak symbol, and no space is allocated in the current file.
- TLS: Indicates that the variable corresponding to the symbol is stored in thread local storage.
- NOTYPE: The type was not specified, or the symbol type is not currently known.
If in a C source file, we refer to a function or global variable defined in other files, the compiler will not report an error, just declare it before calling, and the compiler will think that the global variable or global variable you reference is The function may be defined in other files and libraries, and no error will be reported during the compilation phase. In the subsequent linking process, the linker will try to find the definition of the symbol you referenced in other files or libraries, and will report an error if it cannot find it. The error type at this point is a link error.
relocation table
During the process of generating the symbol table for each object file, if the compiler does not find the definition of the symbol in the current file, it will also collect these symbols together and save them in a separate symbol table for subsequent filling. The symbol table is the relocation symbol table. As seen in the symbol table (.symtab) of main.o:
19: 00000000 0 NOTYPE GLOBAL DEFAULT UND add
20: 00000000 0 NOTYPE GLOBAL DEFAULT UND sub
21: 00000000 0 NOTYPE GLOBAL DEFAULT UND printf
The information of the two symbols add and sub is in an undefined state (NOTYPE) and needs to be filled later. A relocation table .rel.text is used in main.o to record these symbols that need to be relocated:
jiaming@jiaming-pc:~/Documents/CSDN_Project$ readelf -r main.o
Relocation section '.rel.text' at offset 0x2c0 contains 6 entries:
Offset Info Type Sym.Value Sym. Name
00000014 0000131c R_ARM_CALL 00000000 add
00000024 0000141c R_ARM_CALL 00000000 sub
00000034 0000151c R_ARM_CALL 00000000 printf
00000040 0000151c R_ARM_CALL 00000000 printf
00000054 00000602 R_ARM_ABS32 00000000 .rodata
00000058 00000602 R_ARM_ABS32 00000000 .rodata
In .rel.text, we can see the symbols add, sub and the library function printf that need to be relocated. The addresses associated with these symbols in the relocation table will be updated to new ones after relocation in the subsequent linking process. actual address.