Dig deep into the whole process from .c to .exe, and see the essence through the phenomenon


insert image description here

The translation environment and execution environment of the program

In any implementation of ANSI C, there are two distinct environments.

The first is the translation environment, where source code is converted into executable machine instructions.
The process from .c to .exe needs to rely on the translation environment

The second is the execution environment, which is used to actually execute the code

translation environment

compile

The compilation process is actually subdivided into three links, namely precompilation, compilation and assembly

Each source file (file with the suffix of .c) that makes up a program is converted into object code (that is, the file with the suffix of .obj) through the compilation process

Each object file is bundled together by a linker to form a single and complete executable program (.exe).

The linker will also import any functions used by the program in the standard C function library, and it can search the programmer's personal
program library and link the functions it needs into the program

As shown in the picture:

insert image description here

Summarize:Each source file is processed by the compiler separately to generate an object file, and all object files together with the link library generate an executable file under the action of the linker

First write a piece of code:

#include<stdio.h>

int main()
{
    
    
	printf("hehe\n");
	return 0;
}

After running, view the .exe file, .exe is actually an executable program

insert image description here

insert image description here
insert image description here

The two object files add.obj and test.obj are processed by the compiler, and the object file (.obj) is finally generated. The
test.c file is precompiled to generate test.i

Precompiled

The precompilation process mainly deals with the precompilation instructions starting with "#" in those source code files. For example, "#include", "#define", etc., the main processing rules are as follows:

  • Delete all "#define" and expand all macro definitions.

  • Processes all conditional precompiled directives, such as "#if", "#ifdef", "#elif", "#else", "#endif".

  • Processes the "#include" precompiled directive, inserting the included file at the position of the precompiled directive. Note that this process is recursive, which means that included files may also include other files.

  • Remove all comments "//" and "/**/".

  • Add line number and file name identification, such as #2 "hello.c" 2, so that the compiler can generate line number information for debugging and display the line number when compilation errors or warnings are generated during compilation.

  • Keep all #pragma compiler directives, because the compiler needs to use them.

The precompiled .i file does not contain any macro definitions because all macros have been expanded and the included files
have been inserted into the .i file. So when we can't judge whether the macro definition is correct or whether the header file is included correctly, we can check the precompiled file to determine the problem.

In the process of precompilation, we mainly analyze three parts: inclusion of header files, deletion of comments, replacement of symbols defined by #define

header file inclusion

In the precompilation phase, the compiler will replace all the header files included in the code with the contents of the header files. For example, the code #include <stdio.h> will be replaced with all the code in the header file stdio.h

There is #include "test.h" in the test.c file, and the function of #include "test.h" is to copy a copy of the test.h file into test.i

insert image description here

delete comment

In the precompilation phase, the compiler will put all the comments "//" and "/**/" in the code

Replace symbols defined by #define

Whether the identifier defined by #define or the defined macro, they all play the role of replacement, and the moment of real replacement is the precompilation stage

The 3 things done in the precompilation stage are actually some text operations, and the code is not run

compile

The compilation process is to perform a series of preprocessed filesLexical analysis, syntax analysis, semantic analysis and optimization to produce corresponding assembly code files, this process is often the core part of what we call the entire program construction, and it is also one of the most complicated parts. Let's briefly introduce the specific steps of compilation, which involves some content such as compilation principles

insert image description here

lexical analysis

For example, we have a line of source code in C language as follows:

array[index] = (index+4)*(2+6)

First, the source code program is input into the scanner (Scanner). The task of the scanner is very simple. It simply performs lexical analysis. Using an algorithm similar to a finite state machine (Finite State Machine), the source code can be easily The character sequence is divided into a series of tokens (Token).
For example, the above line of program contains a total of 28 non-null characters. After scanning, 16 tokens are generated, as shown in the table.

insert image description here

The tokens generated by lexical analysis can generally be divided into the following categories:Keywords, identifiers, literals (including numbers, strings, etc.) and special symbols(eg plus sign, equal sign). While recognizing the marks, the scanner does other work as well. For example, store identifiers in the symbol table, store numbers and string constants in the text table, etc., for use in later steps.

There is a program called lex that does thislexical scan, it will follow the lexical rules described by the user before the inputSplit the string into individual tokens. Because of the existence of such a program, compiler developers do not need to develop an independent lexical scanner for each compiler, but just change the lexical rules as needed

In addition, for some languages ​​with preprocessing, such as C language, its macro replacement and file inclusion are generally not included in the scope of the compiler and handed over to an independent preprocessor.

Gramma analysis

Next, the grammar analyzer (Grammar Parser) will parse the tokens generated by the scanner to generate a syntax tree (Syntax Tree).
The whole analysis process adopts the analysis method of context-free grammar (Context-free Grammar). If you are familiar with context-free grammar and pushdown automata, you should understand it well. Otherwise, you can refer to some calculation theory materials, which generally have a very detailed introduction. I won't repeat them here.
Simply put,The syntax tree generated by the parser is a tree with the expression (Expression) as the node
We know that a statement in C language is an expression, and a complex statement is a combination of many expressions.
The statement in the above example is a complex statement composed of assignment expression, addition expression, multiplication expression, array expression, and bracket expression. It forms a syntax tree as shown in the figure after passing through the syntax analyzer.

insert image description here

We can see from the figure that the entire statement is regarded as an assignment expression; the
left side of the assignment expression is an array expression, and its right side is a multiplication expression; the array expression is composed of two symbolic expressions ,etc.
Symbols and numbers are the smallest expressions, they are not composed of other expressions, so they are usually used as leaf nodes of the entire syntax tree.
At the same time of grammatical analysis, the priority and meaning of many operation symbols have also been determined. For example, multiplication expressions have higher precedence than addition, parenthesized expressions have higher precedence than multiplication, and so on.
In addition, some symbols have multiple meanings. For example, an asterisk * can represent a multiplication expression in C language, and can also represent an expression that takes content from a pointer, so these contents must be distinguished in the syntax analysis stage.
If there is an illegal expression, such as various bracket mismatches, missing operators in the expression, etc., the compiler will report an error in the syntax analysis phase

Just as there is lex for lexical analysis,
there is also a ready-made tool for grammatical analysis called yacc (Yet AnotherCompiler Compiler). It is also like lex, which can parse the input token sequence according to the grammatical rules given by the user, so as to build a syntax tree.
For different programming languages, compiler developers only need to change the grammatical rules without writing a syntax analyzer for each compiler, so it is also called "Compiler Compiler".

Semantic Analysis

The next step is semantic analysis, which is done by the Semantic Analyzer.
Syntax analysis only completes the analysis of the grammatical level of the expression, but it does not know whether the statement is really meaningful.

For example, in the C language, it is meaningless to multiply two pointers, but this statement is grammatically legal; for example, whether the multiplication of the same pointer and a floating-point number is legal, etc.

The semantics that the compiler can analyze are static semantics (Static Semantic). The so-called static semantics refer to the semantics that can be determined at compile time, and the corresponding dynamic semantics (Dynamic Semantic) are the semantics that can only be determined at runtime.

After the semantic analysis stage, the expressions of the entire syntax tree are marked with types.
If some types need to be converted implicitly, the semantic analysis program will insert corresponding conversion nodes in the syntax tree.
The syntax tree described above becomes the form shown in the figure after the semantic analysis stage

insert image description here

Symbol summary

In this link, the variable symbols of the global scope of each source file will be summarized

insert image description here

compilation

An assembler converts assembly code into machine-executable instructions, and each assembly statement corresponds to almost one machine instruction. Therefore, the assembly process of the assembler is relatively simple compared with the compiler. It has no complicated syntax, no semantics, and does not need to optimize instructions. It is just translated one by one according to the comparison table of assembly instructions and machine instructions.

insert image description here

Translate the assembly code into a binary instruction. This binary instruction is stored in the target file. At the same time, an address is assigned to the symbols summarized in each source file, and then a symbol table is generated respectively.The symbol summary in the compilation step is to serve the symbol table formed by the assembly

insert image description here

The symbol Add extracted in the test.c file is just a declaration of the Add function, not a definition, and it is impossible to judge whether the Add function really exists, so the address assigned to the Add symbol when test.c generates the symbol table is a meaningless (illegal) address

Link

Merge Segment Table

(The suffix of the object file generated by vs is .obj, and the suffix of the object file generated by gcc is .o)

The generated obj file will be divided into several sections after the compilation is completed. During the linking process, the corresponding sections of each obj file will be merged according to certain rules, and finally an executable program (.exe is the suffix) will be formed.

insert image description here

Merging and relocation of symbol tables

A program is not immutable once it is written, it may be modified frequently.

For example, if we insert one or more instructions after the 1st instruction and before the 5th instruction, then the position of the 5th instruction and subsequent instructions will be moved back accordingly, the lower 4 bits of the original first instruction The numbers will need to be adjusted accordingly.

In this process, we need to manually recalculate the target address of each subroutine or jump. When the program is modified, these positions must be recalculated, which is very tedious, time-consuming, and prone to errors. This process of recalculating the address of each target is called relocation

insert image description here

Symbol tables are not meaningless. If a function needs to be called, the compiler will look up the symbol in the symbol table, if there is, the call is successful, otherwise the call fails
Symbol tables come from global variables, functions, etc. Not all symbols have symbol tables. Local variables do not have symbol tables, because local variables can only be used in a local scope and cannot be used across files.

After a program is divided into multiple modules, how to combine these modules to form a single program is a problem to be solved.
The problem of how to combine modules can be attributed to the problem of how to communicate between modules. The most common communication between C/C++ modules belonging to static language has two ways, one is the function call between modules, and the other
is Variable access between modules.
Function access must know the address of the target function, and variable access must also know the address of the target variable, so these two methods can be attributed to one method, that is, the reference of symbols between modules.
Relying on symbols to communicate between modules is like a jigsaw puzzle. The module that defines the symbol has an extra area, and the module that references the symbol just lacks that area. The splicing of the two is just a perfect combination. The splicing process of the modules is Linking.

execution environment

The execution process of the exe program can be roughly divided into four steps:

The program must first be loaded into memory. In an environment with an operating system, this operation is generally performed by the operating system. In a stand-alone environment, program loading can be done manually or by placing executable code into read-only memory.
Program execution starts. Then call the main function.
Start executing program code. At this time, the program will use a runtime stack (stack) to store the local variables and return address of the function. Programs can also use static memory, and variables stored in static memory retain their values ​​throughout the execution of the program.
Terminate program. The main function is terminated normally, or it may be terminated unexpectedly.

If you think this article is helpful to you, you might as well move your fingers to like, collect and forward, and give Xi Ling a big attention. Every
support from you will be transformed into my motivation to move forward! ! !

Guess you like

Origin blog.csdn.net/qq_73478334/article/details/130508050