Compiler theory along the lines of the design of a calculator

(Travel around [http://www.cnblogs.com/naturemickey] Copyright, all rights reserved)

 

First, look at the functions of the calculator:

CALC> set a = 1; b = 2
CALC> set c = 3
CALC> calc (10 + pow(b, c)) * sqrt(4) - 1
35.0
CALC> exit

As shown above, the calculator function is very simple:

  1. Setting context variables set command.
  2. A value calculated using the expression calc command.
  3. Exiting the Calculator with the exit command.
We compiled a focus on analytical calculation expressions calc command behind, the other part we can simply deal with (such as set commands can be as simple deal: press multiple assignments separated by semicolons get treatment, you can press the equal sign separated set the variable in context, it does not require complex compilation process).
Examples of the above presentation, we use the compiler technology processing part is (10 + pow (b, c )) * sqrt (4) - 1, other parts of the text we use only simple processing.

Though small, but perfectly formed, this calculator includes compiler technology is the most essential part. Although we just implemented a calculator, but the technology used is sufficient to achieve a simple script interpreter of the language.
The calculator is divided into the following parts:
  1. Lexical analysis: the text expression of resolve to the lexical elements list (tokenList).
  2. Parsing: The tokenList parsed into a syntax tree (syntaxTree).
  3. Semantic Analysis: the syntax tree is converted into the code assembly language (ASM)
  4. Assembler: Translation of the assembly code to machine code (bytecode).
  5. VM: bytecode execution.
Buju general compilation does not contain "assembler" and "virtual machine", and here the reason is because contains two parts:
  1. Usually the compiler generates machine code directly from the intermediate code, instead of generating assembly code, and here the reason I want to generate assembly code is the reason "when debugging compiled readability very good", if they are directly generate target code, It will be very difficult to read with the naked eye.
  2. The reasons for their virtual machines are: existing machines (including physical and virtual machines and simulators) instructions although very rich, but seem to have no direct calculation of "power" or "evolution" of instruction, self-realization virtual machine computing instructions can be arbitrarily designed, which can reduce the complexity of the whole process.
Because some assembler and compiler theory of virtual machine is not, so the following does not describe the implementation details, but because the target code calculator code compiler is assembler code, so need to do some assembly instructions instructions (hereinafter this referred to as assembly language ASM ) .

ASM instructions Introduction
Mnemonic Number of operations Description refers to life
store number The number into the top of the stack
add   Two numbers together removed from the stack, and push the result back into the stack
sub   Remove a number from the stack as a minuend, subtrahend and then take as a result of the subtraction after the stack
I have   Remove multiply two numbers from the stack, the stack and the results
div   Taken as a division number of molecules from the stack, and then taken out as a denominator of the division, the division result stack
pow   Remove a number from the stack as the bottom, then remove power as a calculation result Drawing
sqrt   Remove a number from the stack, the result of the square root of this number Drawing
print   In the digital console stack of print
这个虚拟机是基于栈来设计的,所有的计算指令的操作数都从栈中取,store命令向栈顶添加数据。
print指令用于打印当前栈顶的数据,在我们编译的汇编代码要做到:正确计算出结果,且计算完成之后的结果要刚好在栈顶,这样最后调用一个print指令即可以控制台看到计算结果。

ASM举例:
例1,如果我们要计算1-2*3,则我们写出的汇编代码如下(行号是为下文解释代码方便而放上去的,不是代码的一部分):
 
  1. store 3
  2. store 2
  3. mul
  4. store 1
  5. sub
  6. print
对这段代码的说明如下:
  1. 前两行向栈顶添加两个数字,先压入3再压入2,这样栈顶的数字是2,第二个数字是3。
  2. 第三行mul会从栈顶弹出两个数字(2和3)计算乘法,并把结果(6)再压入栈中,此时栈中只有一个数字6。
  3. 第四行向栈顶压入一个数字1,此时栈顶为1,第二个数字是6。
  4. 第五行sub指令从栈顶取出两个数字,第一个数字1做为被减数,第二个数字6做为减数,即计算1-6,并把结果压入栈中,此时栈中只有一个数字-5。
  5. 最后一行print指令不对栈做写操作,只读取栈顶的数字,并打印出来。
在这里,我们用到两个运算,mul和sub,这两个运算都是二元运算,因我在设计指令的时候,先取出来的数字是第一个操作数,所以先压入的应该是第二个操作数,这也是为什么代码中先压入的是3,之后是2,最后才是1。

例2, 如果我们要计算(10 + pow(2, 3)) * sqrt(4) - 1, 则我们写出的汇编代码如下 (行号是为下文解释代码方便而放上去的,不是代码的一部分
 
  1. store 1
  2. store 4
  3. sqrt
  4. store 3
  5. store 2
  6. pow
  7. store 10
  8. add
  9. I have
  10. sub
  11. print
Code description is as follows:
  1. This code is a little bit complicated, but the experience of the previous piece of code, we can see that all the order from right to left operand store, so the store instruction sequence is fixed, and the rest is the key where operating instructions should be placed.
  2. Operation instruction is also a regular, namely: the current top of the stack just good data satisfies certain operation, the operation instruction on where to put, such as:
    • store 1 when no operation of operands on the stack.
    • When the store 4, just good sqrt operands are in the stack, at which time the sqrt instruction.
    • store 3 store 2, the just right can be calculated pow.
    • 10, the adder can calculate the store, so in this case the add instruction is added.
    • Add after completion of the calculation, together with the previously calculated sqrt instruction has been completed, at which time all the multiplication operands in the stack, at which time the mul instruction.
    • Finally, the Save operation may be calculated, together with the sub command.
    • All calculations print out the results after completion.
In this example, I call the "law" actually means "postfix expression."
We usually write the arithmetic expression is "infix", that symbol in the middle of the operand, such as 1 + 2 + 1 and 2 of the middle, into postfix form that is 12 +
here because I respect the order of the parameters design is "normal" in reverse order, so 1 + 2 for the assembler, its postfix form it should be 21 +
you can be in accordance with the law, relatively simple to achieve this calculator - just do lexical tokenList analysis can be directly generated by the assembler code in accordance with the laws postfix expression - but our purpose is to compile examples of this calculator is concerned, they still progressively terms.

Lexical analysis
purposes lexical analysis section is the text into a list of lexical elements, for example (10 + pow (2, 3 )) * sqrt (4) - 1, lexical analysis done after a few words will be broken down into the following process elements:
10 + pow ( 2 , 3 ) ) * sqrt ( 4 ) - 1



Here are just doing a text processing - before treatment, there is something in our hands is the string of a string of characters, after processing, we decomposed into a plurality of words according to certain "rules."
Algorithms are diverse, creative programmers will come up with various ways to deal with the problem of the word decomposition.
Compiling principles, the following common way is to use a state transition diagram implemented

in the figures, "oval" is to have a directional line between the state, state and state, this line represents from one state to another a path state, and in that there is above the line state has been reached after a previous state of an input representative of a brace (expressed for convenience, represents 0-9 ten numbers from 0 to 9, az denotes a through z twenty-six letters, etc.), as starting from the sTART state, enter a number 1, it will reach the INT state.
The blue state is the end state, if after several input from reaching the end state START state, the input of these characters can be combined into a legitimate word lexical - in this case the need for additional point: when using the input matching status greedy ways, namely: try to enter more characters.
After identifying a legitimate lexical unit, the state returned to START to continue to identify the next element, until there are no new elements.

The state transition diagram there is a proper noun to call it "deterministic finite automaton", the English referred to DFA in translation principle. Here, "OK" means that each state may be determined through a character input path to only one state to another, "limited" means that the number of states is limited. For a complex language of the state is the size of several orders of magnitude the number of the state machine. But we now need only a calculator these states is enough.
Regular method usually DFA reads generated by the tool described, rather than direct manual configuration, but for us now calculator design, its DFA is very small, manual configuration is very convenient, so we do not have the tools. In addition, if you use the tool, then this article I will not use an existing tool, but a tool to achieve their own.

Here's an example: We have an expression 12.3 + abc, Let me describe the process of running DFA:
  1. Define a variable s, to indicate the current state of the state machine is located (initially s = START).
  2. 1 enter the first character, then this input a START condition pharmaceutically arriving INT state, the state variable s assigned INT, INT can now see the color of blue indicates the current state is recognizable as a valid lexical elements, but because our rules are greedy match, so we have to see if it can also match more characters.
  3. 2 Enter the second character, this time INT status can accept this man enter, and reaches the INT state (turning a circle back to its original state), then assigned to the variable s INT state.
  4. Another input of the third character. In this case an acceptable state INT. NUMBER arrival state, then the state variable s NUMBER assigned.
  5. At this point we again look ahead a character + NUMBER state at this time can not accept this character, and at the same time we can see the color NUMBER state is blue, representing the current state can identify a legitimate lexical elements, namely: sTART from the beginning to the present, we experienced a total input of 12.3, the first word as a legitimate lexical elements.
  6. After the successful identification of a lexical element, go back to START state variable s, and continue typing +, this time from a state START ADD reachable state, and the state of ADD and not allowed to accept any input, while the color is blue ADD state , then we identify the second lexical elements +.
  7. After identifying a second lexical elements to change state s to return to START, and continue to input a, then a START condition may be directed path ID reaches the state, then the new state s assigned ID.
  8. Now the situation is similar in the case of identification NUMBER, the current state is an end state, but according to the principle of greed will continue to see if it matches can be matched to the new input.
  9. B and c are back in the ID state in a circular motion, I will not go into details, so that we recognize the last lexical elements of abc.

May be identified by any methods correct arithmetic expression procedure described above, but there are several points should be particularly noted:

  1. How to identify error: I have seen the current language specification in describing how to properly compile a language, but there is not a specification describing how an error, so we can do right now is take the law as long as normal a dead end, and that is wrong, it should error, and provides detailed error information as possible.
  2. Blank identification: I did not draw in DFA and identifies gaps part, the reason for the calculator program, a blank totally useless, so I just ignore this input directly from your code, so the state does not recognize blank, at the same time after identifying lexical elements to remove the blank before and after the word. For some languages, the gap is significant, as required lexical elements identified, can not be ignored.
  3. For express lexical elements: Usually, we will use a type of Token to represent a lexical elements, Token there are two properties, one is used to indicate the type of Token, another is used to indicate the content, only numbers and identifiers only needed Token types of content attribute, as other types of the same type may be represented by only one, then there is no need to save the contents down (e.g., ADD content must be +).
  4. About identifier: ID for identifying the state in DFA identifier, which here includes custom variables, including the function name. In DFA of the design process, we can choose the common identifier and reserved word as a different state, it can also be used with the same status. We are now designed to use an ID status indicates that all identifiers, and upon recognition of an ID, we look at whether it is a reserved word, so set a different type when returning Token object.
  5. For INT and NUMBER: This calculator are all calculated using a double type numbers, although the time where we can recognize lexical INT, but we define Token type, you define only one type NUMBER, INT or NUMBER status determination the words are returned NUMBER Token objects of this type.

In front of the logic there is a principle of greed terrine with this principle in certain language in the lexical there will be some exceptions do not apply, such as in C ++ has an operator ">>", the operators and representatives in JAVA right, but in C ++ 11 standard can write code like std :: vector <std :: vector < int >>, in JDK5 and above also write List <List <Integer >>, where if by greed batch with wrong. It must be added in the context of the lexical analysis is based on two judge can decide> to identify or be identified by a >> - is part of the context of the judgment parsing, but for complex lexical structure in the absence of a unified lexical With more advanced methods have a case of parsing algorithm can handle.

 

All that remains now is to write code.

If I posted the code here would be too long, is not suitable. So I can only describe the following description of the idea of ​​DFA:

  1. An idea: to describe the static code directly, like this manner each path describe the state machine are: IF S = START AND c = '1' THEN S = INT ... ELSIF ..., so that you can enter run.
  2. Thinking two: form-driven, for example, listed the following table to know which new state which state reached after which input - left heading in the table is the current status, the title is entered on the current state of the contents of the table is through the route to the state.
  3. Thinking three: combine the first two ideas - the first to write a code generation tool, the tool reads the contents of the table "idea two" and generate "ideas a" static code.

 

  [0-9] . [_A-zA-Z] + - * / ( ) ,
START INT POINT ID ADD SUB I HAVE DIV LBT RBT COMMA
INT INT NUMBER                
POINT NUMBER                  
NUMBER NUMBER                  
ID ID   ID              
ADD                    
SUB                    
I HAVE                    
DIV                    
LBT                    
RBT                    
COMMA                    

The following give what type of object DFA Token returned: NUM, FUN, VAR, ADD , SUB, MUL, DIV, LBT, RBT, COMMA
different states where the former three and DFA:
  • Representative NUMBER NUM INT state and two states identified lexical elements.
  • FUN ID status and VAR are identified elements, if the ID is the name of a function, the Token is FUN type, or the type that is VAR.
Other types of DFA state is one to one.

Finally, talk about DFA interface:
  1. DFA suppose there is a method called parse, the argument of this method is only used to pass a string expression can be. If we write DFA is used to resolve the long text (rather than just resolve this brief arithmetic expressions ), you can consider the parameters for the input stream.
  2. It returns the parse method can be considered a return Token cursor, the cursor for the parser calls; can also be considered to resolve all finished lexical element returns a list of Token. Because of the relatively common parsing algorithm only needs to look ahead a Token object, so the cursor is sufficient.
  3. Because the parser would have been possible to read out the Token back, next time then, so the cursor could be considered a putBack method - may also consider not implement this method, and by the parser cache less than their current lexical elements - If the DFA simply returns a list of some, the parser can move back and forth back and forth can be offset.
  4. DFA return list is simple way to achieve that, but for the data to be parsed very large do not apply - in particular, data from the network in a manner stream pass over, so we do not know when this will flow end, we still need to return an element of an element of practice safer - for the calculator program we are doing it, just how to do anything on it.


语法分析
语法分析是把词法分析过程返回的tokenList转换为语法树的过程。
词法分析的结果是为语法分析服务的,语法分析自然也是为下一步的语义分析做准备的,在这一节中,我们只讲一下语法树是怎么构造的,下一节语义分析的部分再讲如何使用语法树。

语法树是一个树形的数据结构,树的每个根节点表示一个语法结构,子节点表示构成根这个大语法结构的小语法结构。这样不断划分更小的语法结构直到无法再分的子节点,其实就是一个整体与部分的关系。
因树形结构是要画图的,但我们通常的工具是文本工具,所以我们通常用类似以下文本形式来表示树形结构:
NODE1     -> NODE2 NODE3
NODE2     -> NODE4
这个表示的意思是:NODE1为根节点,NODE1有两个子节点NODE2和NODE3,NODE2又有一个子节点NODE4。
这样即可用文本的形式表示树形结构了。

这种表示形式叫“巴科斯范式”,英文简称为BNF——下文中有需要表示这个名称的地方,我就直接用BNF三个字母来代替了。

下面我们看一下这个计算器的BNF( 对于复杂语言的BNF描述要写几十页,但我们的计算器就只有这么几行):
 
  1. exp     -> term
  2. exp     -> term + exp
  3. exp     -> term - exp
  4. term    -> factor
  5. term    -> factor * term
  6. term    -> factor / term
  7. factor  -> varName
  8. factor  -> number
  9. factor  -> - number
  10. factor  -> funCall
  11. factor  -> ( exp )
  12. funCall -> funName ( params )
  13. params  -> exp
  14. params  -> exp , params
下面对这个语法结构描述一下:
  1. exp为算术表达式,即为我们要分析的表达式整体。
  2. term是可做加法或减法的项。
  3. 我们可看到exp有三行表示基结构,这三行是或的关系,即:exp可能是一个term,也可能是一个term加上一个exp,也可能是一个term减掉一个exp。也就是说,exp这个根节点的子节点可能只有一个节点,也可能有三个节点。
  4. factor是可做乘法或除法的项。
  5. term继续分解的过程与exp是完全一样的,这里就不再赘述。
  6. 这里把加减法与乘除法分开为两种语法结构的原因是,乘法与除法的优先级高于加法与减法,按照我们现在的表示,在计算加法或减法之前必须先计算term,这样因term是先计算的,所以优先级就表示出来了。
  7. varName是变量名,number是数字,funCall是函数调用
  8. 对于factor的组成部分就可理解为:可以为一个变量,或一个数字,或一个减号连着一个数字(负数),或一次函数调用,或用小括号括起来的表达式。
  9. funName是函数名,params是函数调用的参数。
  10. funCall的结构为:函数名后跟着一个括号,再跟着一个或多个参数,再跟着一个括回。
  11. params由两个产生式,因每个参数都可以是单独的算术表达式,所以一个exp节点即可表示一个参数,如果有多个参数,则exp后成要跟着一个逗号,之后再跟着其它的参数。
这种表示方法相对于树形的表示来说,有一个优点,即:相对比较容易表示“或”,比如factor这个节点可能由变量名来表示,也可能由数字来表示,如可能是……,这样我们如果画图的话,如何表示多种可能中只选其中一种的情况呢?

上面的表示是对语法结构的抽象表示,抽象的东西还是具体化为我们的代码中的数据结构的。
我们的目的是把我们的算术表达式变成符合语法树描述的样子,举个例子来说:(1 + 2) * 3转成语法语法树就是下图所示的样子:

图中蓝色的部分为语法树的叶子节点,也就是不能再分解的节点。这些叶子节点正是词法分析部分的词法元素。
下面我们看一下来看一下构造这个树的步骤:
  1. 上面的图中只有三种非叶子节点的类型(exp、term、factor)以及几个叶子节点的类型[number、*、+、(、)],所以下面我只以这几个为例做一下描述,其它的节点类型的解析步骤是类似的——因+*()这几个符号在描述中会不太好看,所以下文中我改用add、mul、lbt、rbt来表示。
  2. 节点的类型:
    • 我们可以为每种节点类型单独创建一个类,如:exp类型的节点即为Exp类型的对象,term类型的节点即为Term类型的对象……
    • 也可以考虑只用一个类型TreeNode,而用Node中的一个属性type来表示不同的节点类型。
    • 我选择用第二种方式来表示节点的类型,这样对于子节点的表示也相对简单的直接用一个list即可。
  3. 对于number类型的节点,还需要一个属性来表示数字的内容,所以TreeNode类中除了type字段之外,还要有一个content字段——varName或funName类型也是需要content字段的,用于表示变量名或函数名。
  4. 为每个非叶子节点写一个函数来解析这个语法结构,如:parseExp、parseTerm、parseFactor,因为我们每个语法结构(或子语法结构)TreeNode类型的对象为根的树,所以这几个函数的返回值类型为TreeNode。
  5. 每个函数内的结构化代码与BNF中的描述可完全对应得上,比如:
    • exp有三个产生式:"exp -> term"和"exp -> term + exp"和"exp -> term - exp",则parseExp的代码可以按下面的步骤来写:
    • 创建一个类TreeNode的对象expNode,并指定其类型为exp。
    • 调用parseTerm函数解析出expNode的第一个子元素termNode。
    • 向前看下一个词法元素(如果还有下一个的话):
      1. 如果为add或sub类型的Token,则创建第二个子元素opNode(这个元素的类型为add或sub),之后再递归调用parseExp解析第三个子元素。
      2. 如果不为add或sub类型的Token,则一定是出错了。
    • 反回expNode。
  • term有三个产生式:"term -> factor"和"term -> factor * term"和"term -> factor / term",这三个产生式与exp的三个产生式是同一个格式的,所以代码也几乎是一样的。
  • factor有五个产生式,但我们现在这个例子用不到变更和函数,所以只看其中的三个:"factor -> number"和"factor -> - number"和"factor -> ( exp )",则parseFactor的代码可以这样写:
  • 创建一个类TreeNode的对象factorNode,并指定其类型为factor。
  • 向前看一个词法元素:
    1. 如果是num类型的Token,则创建一个number类型的TreeNode对象做为factorNode的子节点(这个节点的content值为当前这个Token的值)。
    2. 如果是sub类型的Token,则再从tokenList中读出一个num,创建一个number类型的TreeNode做为factorNode的子节点(这个节点的content值为当前这个Token的值把符号位反过来)。
    3. 如果是lbt类型的,则先创建一个lbt类型的TreeNode做为factorNode的第一个子节点,再调用parseExp函数来获得第二个子节点,再向前看一个Token,如果是rbt类型的,则创建一个rbt类型的Token做为第三个子节点(如果看到的不是rbt类型的就要报语法错误了)。
  • 返回factorNode。
按上面描述的逻辑实现代码是要不了多少行代码就可以实现这个计算器的语法解析了。
还有两个funCall和params,这两个类型的解析与exp或term或factor差不多,就不再描述了。

下面还要说一点额外的注意事项:
  1. 我们现在写代码这种方式叫做LL(1)型,也叫递归向下型的语法分析,在这种语法分析方式中,我们总是先创建根节点,再创建子节点,在创建子节点时,我们总是由最左边的节点开始一个一个去创建(如parseExp中,我们先创建出来的就是expNode,然后再创建的是termNode,之后如果还有元素的话就会创建opNode和下一个expNode),这种语法解析方式是最容易用代码实现的,所以我才使用这种方式来做,但这种解析方式有一个局限性:语法的BNF描述中不能有左递归。何为左递归呢?举个例子来说:exp的产生式中的两个"exp -> term + exp"和"exp -> term - exp",如果改为"exp -> exp + term"和"exp -> exp - term",也是对的,但按照这样的产生式来写代码时,第一个要解析的就不是term,则还是一个exp,要解析第一个TreeNode就要递归调用parseExp函数了,这样就会形成无限的递归了。所以我们在设计LL(1)型的文法描述时要避免左递归。
  2. LL(1)的文法设计在解析复杂的语言时会有语法描述上的局限性,如:C语言的一条语句开头如果为ABC,则这个ABC可能一个变量名,也可能是行号,此时必须再向前看一个Token才能知道应该使用哪个产生式,这就变成LL(2)型了——LL后面括号中的数字代表我们在识别语法产生式时,要向前看几个Token——这样的语法解析的代码就复杂很多了。那么,有没有实现简单,且语法描述能力又强的解析方法呢?答案为:是有的。但本文只需要实现计算器的语法,就没有必要喧宾夺主来花费大量的篇幅来写其它的语法解析方式。以后如果有时间可专门再写一个博文来讲语法解析。
  3. 语法分析有时也是要判断上下文的,比如:假设在C++代码中有这样的一句:f<A, B>(c);这句代码在没有上下文的情况下是不能判断是什么,它可能是一个函数调用(f是函数名,A和B是范形参数,c是函数的参数),也可能是一个逗号表达式(f,A,B,c都是变量或值,这一名就表示为f小于A,B大于c)——其实这是C++语法上的一个考虑不周之处,这给编译器的设计者带来了很大的麻烦,JAVA对于范型方法的调用就避免了C++的尴尬:JAVA中范型参数位置不在方法名与参数之间,而在方法名之前,如:<A, B> f(c),这样就不会与其它的JAVA语法冲突了。从这也可看出,语法的设计对语法解析程序的编写影响是非常直接的。
最后再讲一点稍微偏门一点的东西:
上面讲的解析过程是正统的语法解析方式,用这样的方式来解析计算器的算术表达式有点杀鸡用牛刀的感觉。
对于算术表达式的语法树可以用以更简单的结构,例如:表达式 (10 + pow(2, 3)) * sqrt(4) - 1的语法树可以表示成下面的图形:

这个图显然比正统的语法解析方式的结果树要小很多——因为这个树中只有终结符号,exp、term、factor等非终结符号是不存在的。
对于这样的语法树的语义分析也更简单,后面讲语义分析时再讲一下如何解析这个袖珍版本的语法树。
这个树的构造即:所有的叶子节点都是操作数,所有的根节点都是操作符——同时大家也可以注意到括号不见了,实际上括号在语法树中也是不必要的。
至于这个语法树如何构造就留个悬念,不在这里讲了。



语义分析
语义分析是把语法树转成中间代码,再由中间代码转成目标代码。但对于简单的分析来说,我们省略中间代码的步骤,直接读语法树生成目标代码(目标代码即为前面讲过的 ASM代码)。
虽然对于复杂的语言来说语义分析这个部分是非常复杂,但对于语法与设计都很简单的语言来说语义分析这个部分简直简单到可以合并到语法分析的过程中去做了,只是我们现在的目的是用计算器的例子来讲编译过程,所以这个部分还是简单讲一讲。
我之所以说这个计算器的语义分析可以合并到语法分析中是因为计算器的结构中没有上下文的判断,所以语法分析不报错的话,语义分析就一定没有问题了。

我们还是以在语法分析中的 (1 + 2) * 3 的例子来讲语义分析,这个表达式的语法树还是这个:

语法分析是构造语法树,语义分析是读语法树,所以语义分析的代码与语法分析的代码是相通的,读这个语法树我们也是需要parseExp、parseTerm、parseFactor三个函数的,只是这三个函数的参数不再是tokenList,而是TreeNode,返回值不再是TreeNode,而要返回ASM代码(直接返回字符串文本即可——不过直接返回一个字符串列表可能更好一些,这样海汇编器直接收到的就是一个个的汇编表示了)。下面分别描述一下 parseExp、parseTerm、parseFactor三个函数的逻辑
  1. parseExp:检查一下exp节点的子节点有几个,如果只有一个,则这一个一定是一个term结构,则把这个节点传给parseTerm得到结果并返回即可;如果有三个,则对第三个参数调用parseExp得到一个asmCode,把第一个参数传给parseTerm得到一串汇编指令,并加到asmCode后面做为新的asmCode最后判断第二个参数是add还是sub,直接向asmCode后面加上一个add指令或sub指令作为新的asmCode,最后返回asmCode即可。
    • 这里要说明一个在解析三个子节点的情况下为什么要从第三个开始,然后第一个,最后第二个呢?是因为:
    • 我们现在的汇编器的设计中,指令要从本中取操作数,所以要先把所有的操作数搞定才能搞写指令,所以第二个操作数的处理要放到最后。
    • 我们现在的汇编器的设计中,操作数出栈的顺序是按我们正常表达式的从左到右的顺序,所以入线的顺序就要从右到左,所以第三个操作数先处理,然后才是第一个。
parseTerm:与parseExp也几乎是一样的,也不赘述了。 parseFactor:在语法解析中我们只将了这三个产生式: "factor -> number"和"factor -> - number"和"factor -> ( exp )",所以在这里我们还是继续以此为例。
  • 如果factor的第一个子节点是一个number类型,则直接返回"store " + number指令即可。
  • 如果factor的第一个子节点是一个sub类型,则直接返回"store -" + number指令即可。
  • 如果factor的第一个子节点是lbt类型,则对第二个参数调用parseExp函数得到asmCode,并返回asmCode即可——因此时第三个操作数一定是rbt,所以此时不管它也可以,其实在语法分析时本就可以直接不构造lbt和rbt这两个节点的。
就这样,我们就可以得到ASM代码了——为了在控制台看到计算结构,我们还要在ASM代码的结尾加一个print指令。
最后我们要做的事情是调用已经存在的汇编器和虚拟机来运行了。


到现在所涉及到的技术对于处理复杂语言是远远不够,但也足可以用这些知识来设计语法简单,但功能强劲的语言解释器,对于一般的领域特定语言的语法处理都是足够的了。

以后有机会可能会再写其它的关于编译技术的文章。

 

ps:还有一点没说的:简化版语法树的语义分析:
这个非常简单:对语法树进行后根遍历(即先右节点,再左节点,再根节点),碰到叶子节点就用store指令,碰到根节点就用对应的计算指令——非常简单吧?

转载于:https://www.cnblogs.com/naturemickey/p/3667567.html

Guess you like

Origin blog.csdn.net/weixin_34377065/article/details/93434709