[Compile, link, load three] Compiler - syntax analysis, lexical analysis, semantic analysis, compiler backend

[Compile and Link 3] Compiler - syntax analysis, lexical analysis, semantic analysis, compiler backend

content summary

  • Lexical analysis is the process of dividing the program into tokens.
  • Syntax analysis is to identify the structure of the program and form an abstract syntax tree that is easy to be processed by the computer.
  • Semantic analysis is to eliminate semantic ambiguity and generate some attribute information for this tree.
  • The compiler backend is to generate assembly code from this tree.

1. Lexical Analysis

Usually, the compiler's first job is called lexical analysis. Just like reading an article, an article is composed of individual Chinese words. The same is true for program processing, except that it is not called a word here, but a "lexical token", which is called Token in English.
insert image description here

For example, look at the following piece of code, if we want to understand it, what should we do first?
insert image description here

  • We recognize keywords like if, else, int, identifiers like main, printf, age, operators like +, -, =, symbols like curly braces, parentheses, semicolons, and numbers Literals, string literals, etc. These are tokens.

  • So, how to write a program to identify Token? It can be seen that spaces and punctuation are usually used to separate words in English content, which is convenient for readers to read and understand. But in a computer program, just separating with spaces and punctuation is not enough. For example, "age >= 45" should be divided into three tokens: "age", ">=" and "45", but they can be connected together in the code, and there is no need for a space in between.

  • This is a bit like Chinese, where there is no space between each word. But we will subconsciously disassemble the words in the sentence correctly. For example, dismantling the sentence "I learn programming" into "I", "learning" and "programming", this process is called "word segmentation"

  • In fact, we can distinguish each different Token by formulating some rules. I have given a few examples, you can take a look.

    • Recognizes identifiers like age. It starts with a letter, can be followed by letters or numbers, and ends when it encounters the first character that is neither a letter nor a number.
    • Operators like >= are recognized. When scanning to a > character, be aware that it may be a GT (Greater Than, greater than) operator. But because GE (Greater Equal, greater than or equal to) also starts with >, so look down one more bit, if it is =, then this Token is GE, otherwise it is GT.
    • Recognizes numeric literals like 45. When a numeric character is scanned, it starts to treat it as a number until a non-numeric character is encountered.
  • Finite-state Automaton (Finite-state Automaton, FSA, or Finite Automaton).
    A finite automaton is an automatic machine with a finite number of states. We can take the flush toilet as an example, which is divided into two states: "filling" and "full". Press the button to flush the toilet, it turns to the state of "water filling", and the float ball rises to a certain height, the water filling valve will be closed, and it turns to the state of "full water".
    insert image description here

  • The same is true for the lexical analyzer, which analyzes the string of the entire program and drives it to a different state when it encounters a different character. For example, when scanning age, the lexical analysis program is in the "identifier" state, and when it encounters a > symbol, it switches to the "comparison operator" state. The lexical analysis process is such a process of state transition.
    insert image description here

There is a program called lex that can implement lexical scanning. It will divide the input string into tokens according to the lexical rules described by the user. Because of the existence of such a program, compiler developers do not need to develop an independent lexical scanner for each compiler, but just change the lexical rules as needed.
insert image description here

In addition, for some languages ​​with preprocessing, such as C language, its macro replacement and file inclusion are generally not included in the scope of the compiler and handed over to an independent preprocessor.

2. Syntactic Analysis, or Parsing

  • Next, the grammar analyzer (Grammar Parser) will parse the tokens generated by the scanner to generate a syntax tree (Syntax Tree).

The whole analysis process adopts the analysis method of context-free grammar (Context-free Grammar). If you are
familiar with context-free grammar and push-down automaton, then it should be easy to understand. Otherwise, you can refer to some calculation theory materials, which generally have a very detailed introduction. I won't repeat them here.

  • Simply put, the syntax tree generated by the parser is a tree with Expression as the node. We know that a statement in C language is an expression, and a complex statement is a combination of many expressions. The statement in the above example is a complex statement composed of assignment expression, addition expression, multiplication expression, array expression, and bracket expression.
array[index] = (index + 4) * (2 + 6)

The above code forms a syntax tree as shown in Figure 2-3 after passing through the parser.
insert image description here
We can see from Figure 2-3 that the entire statement is regarded as an assignment expression; the left side of the assignment expression is an An array expression whose right-hand side is a multiplication expression; an array expression in turn consists of two symbolic expressions, and so on.

  • The program also has a well-defined grammatical structure, and its grammatical analysis process is to construct such a tree. A program is a tree, this tree is called Abstract Syntax Tree (AST). Each node (subtree) of the tree is a grammatical unit, and the composition rules of this unit are called "grammar". Each node can also have subordinate nodes. The nested tree structure is our intuitive understanding of computer programs. Computer languages ​​always have one structure nested within another, large programs nested within subroutines, and subroutines can contain subroutines.

Just as there is lex for lexical analysis, there is also a ready-made tool for grammatical analysis called yacc (Yet Another Compiler Compiler). It is also like lex, which can parse the input token sequence according to the grammatical rules given by the user, so as to build a syntax tree. For different programming languages, compiler developers only need to change the grammatical rules without writing a syntax analyzer for each compiler, so it is also called "Compiler Compiler".insert image description here

3. Semantic Analysis

  • The next step is semantic analysis, which is done by the Semantic Analyzer. Syntax analysis only completes the analysis of the grammatical level of the expression, but it does not know whether the statement is really meaningful.

For example, in the C language, it is meaningless to multiply two pointers, but this statement is grammatically legal; for example, whether it is legal to multiply the same pointer and a floating-point number, etc.

  • The semantics that the compiler can analyze are static semantics (Static Semantic). The so-called static semantics refer to the semantics that can be determined at compile time, and the corresponding dynamic semantics (Dynamic Semantic) are the semantics that can only be determined at runtime.

Static semantics usually include declaration and type matching, and type conversion. For example, when a floating-point expression is assigned to an integer expression, a floating-point to integer conversion process is implied, and this step needs to be completed during the semantic analysis process. For example, when assigning a floating-point type to a pointer, the semantic analysis program will find that the type does not match, and the compiler will report an error. Dynamic semantics generally refers to semantic-related problems that occur at runtime, such as using 0 as a divisor is a runtime semantic error.

  • After the semantic analysis stage, the expressions of the entire syntax tree are marked with types. If some types need to be converted implicitly, the semantic analysis program will insert corresponding conversion nodes in the syntax tree.

The syntax tree described above becomes the form shown in Figure 2-4 after the semantic analysis stage.
insert image description here

As you can see, each expression (both symbols and numbers) is typed. Almost all expressions in our example are of integer type, so no conversion is required, and the whole analysis process goes smoothly. The semantic analyzer also updates the symbol types in the symbol table.

Fourth, the back-end implementation technology of the compiler

The task of the backend of the compiler is to generate the object code (assembly code), and then the assembler generates the machine code. The generated file is called the object file, and finally the linker can be used to generate the executable file or library file.
insert image description here

Another form is to generate bytecode to ensure the portability of the code. In some scenarios, we cannot know in advance the target machine where the program is running, so there is no way to compile in advance.

The meta language is first compiled into bytecode, and then compiled into object code on the fly to run on the specific running platform.

The intermediate code AST allows the compiler to be divided into a front end and a back end. The front-end of the compiler is responsible for generating machine-independent intermediate code, and the back-end of the compiler converts the intermediate code into target machine code. In this way, for some cross-platform compilers, they can use the same front end for different platforms and several back ends for different machine platforms.
Object Code Generation and Optimization
The source-level optimizer generates intermediate code that marks the following processes as belonging to the editor backend. The compiler backend mainly includes Code Generator and Target Code Optimizer.

Reference
1. "Programmer's Self-Cultivation Link Loading and Library"
2. Other materials

Guess you like

Origin blog.csdn.net/junxuezheng/article/details/130141905