Compiler Principle (II) lexical analysis, syntax analysis of the basic concept, and semantic analysis of the intermediate code generator

1. lexical analysis

The process of lexical analysis, source code is input to something called a scanner, the scanner job is to be lexical analysis. He applied the called finite state machine algorithm for a source code into a token, for example such as array[index] = (index + 4) * (2 + 3)lines of code, scanned becomes a mark following a:

mark Types of
array Identifier
[ Left square bracket
index Identifier
] Right bracket
= Assignment
( Left parenthesis
index Identifier
+ plus
4 digital
) Left parenthesis
* Multiplication sign
( Left parenthesis
2 digital
+ plus
3 digital
) Right parenthesis

These more general symbols have the following categories: keywords , identifiers , literals (numbers, strings, etc.) and special symbols .

Type the word Other species Species do not code
Keyword if、else、for…… The term one yards
Identifier Variable name, array name ...... Multi-word one yards
constant Integer, floating point, character ...... A type one yards
Operators Arithmetic (+ - * /%), the relationship (> <=), logic (& | ~) The term one yards
Delimiter ; ( ) [ ] { } The term one yards

While the identification of these markers, the scanner also the identifier stored in the symbol table, the number, character string constants in the table, for subsequent steps. Pretreatment for the C language, his and files containing macro substitution to a compiler does not work but to the scope of separate preprocessor.

2. parsing

Parsing by the parser to scan the marks produced by the scanner to parse, it generates a syntax tree, the process uses a context-free grammar analysis means, generates a parse tree node of the expression tree is a tree, as follows:
Here Insert Picture Description
the number of branches is left array[index]and right branches are (index + 4) * (2 + 3)left and right branches they can open again, forming a syntax tree
by a syntax tree, we can see a lot of meaning and priority operation symbol has also been identified down. For symbols have multiple meanings, like *you can do multiplication, can be used as a pointer, parsing stage, going to determine their meaning to distinguish, there has been no legal representation, a syntax error will be thrown.

3. Semantic Analysis

Semantic analysis by the semantic analyzer to complete parser completed only right and wrong grammar, meaning he does not take care of the code implemented in C language pointer multiplied by two is pointless, but indeed in the syntax level legal. Semantic compiler is able to analyze the static semantics that can be determined at compile semantics, on the contrary, dynamic semantics is at run time before deciding semantics.

Static semantics including statements type and type of match, conversion. For example, when a floating-point number is assigned to an integer, a hidden process, is a floating-point-to-integer conversion process. However, when the pointer is assigned to a floating-point number, in the syntax level it can be, but the semantic stage, you will find type mismatch error occurs.
After semantic analysis, the overall number of grammatical expressions are marked type, as follows:
Here Insert Picture Description
the basic expression of all types are integers, do not need to type conversion, some need to do the conversion will be inserted in the syntax tree conversion node

4.z intermediate language generation

目前的编译器都会有很多优化,在源代码中就会有一些优化过程,比如以上的(2+6)就会在编译的时候进行优化,优化后就直接变成了一个数字5
Here Insert Picture Description
其实直接在语法树上边做优化比较困难,所以源代码优化器往往把整个语法树转换成中间代码,跟目标机器和运行环境无关,他不包含数据的尺寸、变量地址和寄存器的名字。常见的中间代码有三地址码P-代码等。比如x = y op z,该三地址码表示将变量y和z进行op操作后赋值给x,比如x = y + z;一下是常用的三地址表示方式

指令类型 指令形式
赋值操作 x = y op z 、 x = op y
复制指令 x = y
条件跳转 if x op y goto z
非条件跳转 goto z
参数传递 param z
过程调用 call p, n
过程返回 return x
数组引用 x = y[ i ]
数组赋值 y[ i ] = x
地址以及指针操作 x = &y 、x = *y 、*x = y

我们把上述的例子的语法树翻译成三地址码如下

t1 = 2 + 3
t2 = index + 4
t3 = t1 * t2
array[index] = t3

在三地址码基础上进行优化,会把2+3的结果计算出来,得到t1 = 5然后把t1换成5。这样三地址码就变成了如下

t2 = index + 4
t2 = t2 * 8
array[index] = t2

中间代码把编译器分成了前端和后端(此前后端非彼前后端),前端负责生成与机器无挂的中间代码,后端则是将中间代码变成目标代码。这样对于一些跨平台的编译器而言,可以针对不同平台使用同一个前端,然后对应不同机器开发不同后端。

参考资料《编译原理》、《程序员的自我修养(链接、装载与库)》

发布了62 篇原创文章 · 获赞 20 · 访问量 5798

Guess you like

Origin blog.csdn.net/weixin_44415928/article/details/104352582