Program Compilation (3/13)

The preprocessed source files are stripped of all packaging, comments are deleted, preprocessing commands are basically processed, and the rest is C code. The next second step is to enter the compilation phase. The compilation phase is mainly divided into two steps: the first step, the compiler calls a series of analysis tools to analyze the C code, and compiles the C source file into an assembly file; the second step is to assemble the assembly file into a relocatable target through the assembler document.

From C file to assembly file

An assembly file organizes the program in units of segments: code segment, data segment, BSS segment, etc. Each segment is independent of each other, and its organizational structure is very close to that of the binary object file, because assembly instructions are binary instructions The mnemonic, but the program structure of the assembly language needs to use various pseudo-operations to organize. After the assembly file is assembled by the assembler, the various pseudo-operation commands are processed, and it is a binary object file.

The conversion from a C source file to an assembly file starts with converting the program code blocks and functions in the C file into the code segments in the assembler, and converting the global variables, static variables, and constants in the C program into the code segments in the assembler. Data segment, read-only data segment. Generally speaking, the compilation process can be divided into the following 6 steps:

  1. lexical analysis
  2. Gramma analysis
  3. Semantic Analysis
  4. intermediate code generation
  5. Assembly code generation
  6. object code generation

lexical analysis

Mainly used to parse C program statements. Lexical analysis generally reads the source program character by character from left to right through the lexical scanner, parses and recognizes these character streams through the finite state machine, and decomposes the source program into a series of token units that cannot be further decomposed—token .

Token is the smallest token unit meaningful in the character stream parsing process. The common tokens are as follows:

  • Various keywords of C language: int, float, for, while, break, etc.
  • Various user-defined identifiers: function names, variable names, labels, etc.
  • Literals: numbers, strings, etc.
  • Operators: More than 40 operators defined by the C language standard.
  • Delimiter: semicolon at the end of the program, comma in for loop, etc.

Gramma analysis

Syntactic analysis is to analyze the token sequence generated in the previous stage to see if it can be constructed into a grammatically correct grammatical phrase (program, statement, expression, etc.). Grammatical phrases are represented by a syntax tree, which is a tree structure rather than a linear sequence. During the analysis of the token sequence, if the grammatical analysis tool finds that a grammatically correct statement or expression cannot be constructed, a grammatical error will be reported.

Semantic Analysis

Syntax analysis only checks the syntax of the program, and does not understand the true meaning of the program and the statement, while the semantic analysis mainly checks the various expressions and statements output by the syntax analysis to see if there are any errors. If the actual parameters you pass to the function do not match the declared parameter types of the function, or you use an undeclared variable, or divide by zero, break occurs outside a loop statement or switch statement, or before a loop statement If a continue statement is found outside, a semantic error or warning will generally be reported.

generate intermediate code

The expressions or program statements output in the syntax analysis stage are still stored in the form of syntax trees, and we need to convert them into intermediate codes. The intermediate code is a kind of temporary code during the compilation process, and the common ones are three-address code and P-code .

Compared with the syntax tree, the intermediate code has many advantages: the intermediate code is a one-dimensional linear structure, type pseudocode, and the compiler can easily translate the intermediate code into the target code.

jiaming@jiaming-pc:~/Documents/CSDN_Project$ cat main2.c
int main(void)
{
    
    
	int sum = 0;
	int a = 2;
	int b = 1;
	int c = 1;
	sum = a + b / c;
	return 0;
}
jiaming@jiaming-pc:~/Documents/CSDN_Project$ arm-linux-gnueabi-gcc -fdump-tree-gimple main2.c
jiaming@jiaming-pc:~/Documents/CSDN_Project$ cat main2.c.005t.gimple # 自动生成同名文件
main ()
{
    
    
  int D.4205;

  {
    
    
    int sum;
    int a;
    int b;
    int c;

    sum = 0;
    a = 2;
    b = 1;
    c = 1;
    _1 = b / c;
    sum = a + _1;
    D.4205 = 0;
    return D.4205;
  }
  D.4205 = 0;
  return D.4205;
}

After the C program statement sum=a+b/c;is compiled into a three-address code, it becomes a statement similar to the pseudo-code shown above. The intermediate code is generally independent of the platform. If you want to compile the C program into an executable file under the X platform, the last step is to translate the intermediate code into an X86 assembler according to the X86 instruction set; if you want to compile it into an ARM platform Executable files running on the Internet, then it is necessary to refer to the ARM instruction set, allocate registers according to ATPCS rules, and translate the intermediate code into ARM assembler.

jiaming@jiaming-pc:~/Documents/CSDN_Project$ arm-linux-gnueabi-gcc -S main2.c
jiaming@jiaming-pc:~/Documents/CSDN_Project$ cat main2.s # 自动生成该后缀文件
	.arch armv5t
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 6
	.eabi_attribute 34, 0
	.eabi_attribute 18, 4
	.file	"main2.c"
	.text
	.global	__aeabi_idiv
	.align	2
	.global	main
	.syntax unified
	.arm
	.fpu softvfp
	.type	main, %function
main:
	@ args = 0, pretend = 0, frame = 16
	@ frame_needed = 1, uses_anonymous_args = 0
	push	{
    
    fp, lr}
	add	fp, sp, #4
	sub	sp, sp, #16
	mov	r3, #0
	str	r3, [fp, #-20]
	mov	r3, #2
	str	r3, [fp, #-16]
	mov	r3, #1
	str	r3, [fp, #-12]
	mov	r3, #1
	str	r3, [fp, #-8]
	ldr	r1, [fp, #-8]
	ldr	r0, [fp, #-12]
	bl	__aeabi_idiv
	mov	r3, r0
	mov	r2, r3
	ldr	r3, [fp, #-16]
	add	r3, r3, r2
	str	r3, [fp, #-20]
	mov	r3, #0
	mov	r0, r3
	sub	sp, fp, #4
	@ sp needed
	pop	{
    
    fp, pc}
	.size	main, .-main
	.ident	"GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0"
	.section	.note.GNU-stack,"",%progbits

Assembly process

The assembly process is to use an assembler to translate the assembly file generated in the previous stage into an object file. The main job of the assembler is to refer to the ISA instruction set, translate the assembly code into the corresponding binary instructions, and at the same time generate some necessary information, assemble it into the object file in the form of section, and use this information in the subsequent linking process. The process mainly includes lexical analysis, syntax analysis, instruction generation and other processes.

When the compiler compiles a project, it compiles in units of C source files, and each source file is compiled to generate a corresponding object file (main.c --> main.o). The main.o object file is non-executable and belongs to the relocatable object file. It must be relocated and linked by the linker before it can be assembled into an executable object file a.out.
insert image description here
The relocatable object files generated by compiling are all linked with zero address as the link start address. In the process of translating source files into relocatable object files, the compiler compiles different functions into binary instructions, and stores the instruction sequence of each function in the code segment sequentially starting from zero address. The entry address starts from the zero address and then shifts backwards one by one.

Compiler-only commands: arm-linux-gnueabi-gcc -c main.c.

Use readelfthe command to view the main.o object file:

jiaming@jiaming-pc:~/Documents/CSDN_Project$ readelf -S main.o 
There are 12 section headers, starting at offset 0x354:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 000034 00005c 00  AX  0   0  4
  [ 2] .rel.text         REL             00000000 0002c0 000030 08   I  9   1  4
  [ 3] .data             PROGBITS        00000000 000090 000008 00  WA  0   0  4
  [ 4] .bss              NOBITS          00000000 000098 000004 00  WA  0   0  4
  [ 5] .rodata           PROGBITS        00000000 000098 000010 00   A  0   0  4
  [ 6] .comment          PROGBITS        00000000 0000a8 00002c 01  MS  0   0  1
  [ 7] .note.GNU-stack   PROGBITS        00000000 0000d4 000000 00      0   0  1
  [ 8] .ARM.attributes   ARM_ATTRIBUTES  00000000 0000d4 00002a 00      0   0  1
  [ 9] .symtab           SYMTAB          00000000 000100 000160 10     10  16  4
  [10] .strtab           STRTAB          00000000 000260 00005d 00      0   0  1
  [11] .shstrtab         STRTAB          00000000 0002f0 000061 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  y (purecode), p (processor specific)

When the main.o object file is compiled, the code segment is assembled with the zero address as the base address. In each relocatable object file, the addresses of functions or variables start at their offset in the file from zero. In the subsequent linking process, when the linker assembles each object file together, the reference address of each object file changes, then the address of the function or variable in this object file must also be updated accordingly, otherwise it will not be able to You can refer to a function by its function name, but not a variable by its variable name.

After the linker assembles the various object files together, it needs to re-modify the addresses of variables or functions in each object file. This process is generally called relocation. Collect the symbols (function names/variable names) that need to be relocated to generate a relocation table, and save them in each relocatable object file in the form of section.

The main function in main.o references the add and sub functions in sub.o. During linker assembly, the addresses of add and sub functions have changed; after linker assembly, the new addresses of add and sub functions need to be recalculated and updated. This process is relocation.

Symbol table and relocation table

The symbol table and the relocation table are two very important tables that provide various necessary information for the linking process.

Symbol table

In the assembly stage, the assembler will analyze the information of each section in the assembly language, collect various symbols, generate a symbol table, and fill the offset address of each symbol in the section into the symbol table. The symbol table is mainly used to save the information of various symbols in the source program, including the address, type, and size of the occupied space of the symbol. On the one hand, this information can assist the compiler in semantic checking to see if there are semantic errors in the source program; on the other hand, it can also assist the compiler in generating compiled code, including address and space allocation, symbol resolution, relocation, etc.

View the symbol table:

jiaming@jiaming-pc:~/Documents/CSDN_Project$ readelf -s main.o

Symbol table '.symtab' contains 22 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
     0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 00000000     0 FILE    LOCAL  DEFAULT  ABS main.c
     2: 00000000     0 SECTION LOCAL  DEFAULT    1 
     3: 00000000     0 SECTION LOCAL  DEFAULT    3 
     4: 00000000     0 SECTION LOCAL  DEFAULT    4 
     5: 00000000     0 NOTYPE  LOCAL  DEFAULT    3 $d
     6: 00000000     0 SECTION LOCAL  DEFAULT    5 
     7: 00000000     0 NOTYPE  LOCAL  DEFAULT    5 $d
     8: 00000000     0 NOTYPE  LOCAL  DEFAULT    1 $a
     9: 00000054     0 NOTYPE  LOCAL  DEFAULT    1 $d
    10: 00000000     4 OBJECT  LOCAL  DEFAULT    4 uninit_local_val.4612
    11: 00000000     0 NOTYPE  LOCAL  DEFAULT    4 $d
    12: 00000004     4 OBJECT  LOCAL  DEFAULT    3 local_val.4611
    13: 00000000     0 SECTION LOCAL  DEFAULT    7 
    14: 00000000     0 SECTION LOCAL  DEFAULT    6 
    15: 00000000     0 SECTION LOCAL  DEFAULT    8 
    16: 00000000     4 OBJECT  GLOBAL DEFAULT    3 global_val
    17: 00000004     4 OBJECT  GLOBAL DEFAULT  COM uninit_val
    18: 00000000    92 FUNC    GLOBAL DEFAULT    1 main
    19: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND add
    20: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND sub
    21: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND printf

Each symbol in the symbol table has a symbol value and a type. The symbol value is essentially an address, which can be an absolute address, which generally appears in executable object files; it can also be a relative address, which generally appears in relocatable object files. The types of symbols mainly include the following:

  • OBJECT: Object type, generally used to represent the variables we define in the program.
  • FUNC: Associated is the function name or other executable code that can be referenced.
  • FILE: This symbol is associated with the name of the current target file.
  • SECTION: Indicates that the symbol is associated with a section, which is mainly used for relocation.
  • COMMON: Indicates that the symbol is a common block data object, a global weak symbol, and no space is allocated in the current file.
  • TLS: Indicates that the variable corresponding to the symbol is stored in thread local storage.
  • NOTYPE: The type was not specified, or the symbol type is not currently known.

If in a C source file, we refer to a function or global variable defined in other files, the compiler will not report an error, just declare it before calling, and the compiler will think that the global variable or global variable you reference is The function may be defined in other files and libraries, and no error will be reported during the compilation phase. In the subsequent linking process, the linker will try to find the definition of the symbol you referenced in other files or libraries, and will report an error if it cannot find it. The error type at this point is a link error.

relocation table

During the process of generating the symbol table for each object file, if the compiler does not find the definition of the symbol in the current file, it will also collect these symbols together and save them in a separate symbol table for subsequent filling. The symbol table is the relocation symbol table. As seen in the symbol table (.symtab) of main.o:

    19: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND add
    20: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND sub
    21: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND printf

The information of the two symbols add and sub is in an undefined state (NOTYPE) and needs to be filled later. A relocation table .rel.text is used in main.o to record these symbols that need to be relocated:

jiaming@jiaming-pc:~/Documents/CSDN_Project$ readelf -r main.o

Relocation section '.rel.text' at offset 0x2c0 contains 6 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
00000014  0000131c R_ARM_CALL        00000000   add
00000024  0000141c R_ARM_CALL        00000000   sub
00000034  0000151c R_ARM_CALL        00000000   printf
00000040  0000151c R_ARM_CALL        00000000   printf
00000054  00000602 R_ARM_ABS32       00000000   .rodata
00000058  00000602 R_ARM_ABS32       00000000   .rodata

In .rel.text, we can see the symbols add, sub and the library function printf that need to be relocated. The addresses associated with these symbols in the relocation table will be updated to new ones after relocation in the subsequent linking process. actual address.

Guess you like

Origin blog.csdn.net/weixin_39541632/article/details/131906778