Foreword
Although the title is the syntax of programming languages, but talking about the parsing of morphology and syntax, in fact, on this front that writing compilers series description will be more clearly related to language syntax should be interspersed throughout the designs, but also to see language designers mood
These Chinese and English is not the same natural language, computer language must be precise, their syntax and semantics must ensure that there is no ambiguity, of course, also make parsing easier
So for the compiler to a very important task it is to structure the rules when other programming languages, it is necessary to accomplish this goal two requirements:
- Complete description of the rules of grammar
- Determine whether a given program structure up in accordance with these rules, which is in line with the rules of grammar
The first requirement is mainly expressions and context-free grammar to describe the completion of a positive, and the second requirement is to be done by the compiler is parsing the
Description Syntax: regular expressions and context-free grammar
For lexical, you can describe it in three rules:
- splice
- select
- Kleene (that is, repeated any number of times)
For example, an integer constant can be repeated any number of times more digits, also called regular languages. If for a string, we add recursive definition which can describe the whole syntax, it can be called context-free grammar
Word regular expression
For the programming language, type the word is nothing more than keywords, identifiers, and in line with the various types of constants
For integer constants can use this regular expression to represent
integer -> digit digit* digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Context-free grammar
General regular expression applies only to describe the word, because the regular expression can not describe nested structures, general regular expression is achieved using a finite state automaton, implemented in Python before a simple regular expression engine , too, but for the nested structure match any need to have a depth of arbitrarily large state machine, it is clearly inconsistent with. The definition of a nested structure describes the syntax is very useful, so there is a context-free grammar
expr := id | number | - expr | ( expr ) | expr or expr
op := + | - | * | /
For context-free grammar, each rule type is called a generation, the left portion of the production line with nonterminal called, and the right portion is a plurality of terminators or non-terminal, and finally all the rules are pushed to the terminator, the terminator word is defined by a regular expression
Derivation and syntax tree
A proper context-free grammar, guidance on how we can generate a string terminator grammatical
The simplest is to start from the symbol start, the right to use this portion of the production line with a substituted starts, and then proceed to derive the string selection obtained from a nonterminal, until there are no remaining nonterminal, this process as a recursive structure process tree
expr := expr op expr
:= expr op id
:= expr + id
:= expr op expr + id
:= expr op id + id
:= expr * id + id
:= id * id + id
But it is possible for a given context syntax will derive more than one parse tree, we say that the context syntax is ambiguous nature. So for the above context-free grammar is there a better grammar
scanning
Scanning is lexical analysis, lexical analysis did not require any regular expressions, what automatic machine, hand line and out, and now the industry in order to better generate an error message, but also should be a lot of hand-lexical analyzer
Hand lexical analyzer, nothing more has been read into the character, to be able to judge its token into the parser
Finite state automata
Finite state machine such lexical analysis steps are generally
Lexical given regular expression
The regular expression into a non-deterministic finite automaton (NFA)
In fact, for any regular expression can be used stitching, and choose to represent Kleene closure
And the same, finite automata can be represented by these three methods, map not drawn, before writing this regular expression engine Python's article are painted over (slip
NFA is converted to a deterministic finite automaton (DFA)
Converting the NFA to DFA may be employed subset construction method, the main idea is, DFA state after reading a given input is reached, all states may be represented by the following UM reads the same original input NFA
Minimize DFA
The main idea is to minimize the DFA, we DFA state all equivalence classes are divided into two, and a non-terminating termination phase state state. Then we again search input equivalence class X and compliance c, such as when a given input C, X can be converted to a state located in k> different state of an equivalence class. After we put X into k classes so that all class turntable for C will be transferred to members of the same old class. Until then divided the class can not be found this way, we completed
These four steps before you write regular expression engines are completed, in three articles that there will be a bit more detail
Gramma analysis
General parser is input token stream, and the output is a parse tree. Wherein the method of analysis generally be divided into top-down and bottom-up types, the most important of these two categories are referred to as LL and LR
LL represent from left to right, the most left, LR represent from left to right, rightmost derivation. The order of these two types of grammar are read from left to right input, and then trying to figure out the input parser derivation results
Top-down approach
General top-down parser more in line with the previous derivation, repeated recursively starting to look like a leaf node is derived from the root until the current leaf nodes are terminator
Recursive descent
Recursive descent is in line with the above said, starting from the root node derivation, generally used for a number of relatively simple language
read A
read B
sum := A + B
write sum
write sum / 2
For example, for this recursive procedure decline, began to call the program a parser function, after reading the first word read, program will call stmt_list, then call again stmt really began to match read A. In this way continues, the parser will trace the execution path parse tree from left to right, top to bottom traversal
Table-driven top-down LL
Table-driven LL parser is based on a table and a stack
Process analysis is
- Initializing a stack
- The start symbol onto the stack
- Pop stack, then depending on the sign of the stack and the current input symbol look-up table
- If a non-terminal symbol of the pop-up, production will continue to look-up table to determine a next pushed on the stack
- If the match is a terminator
Set of predictions
From the above it can be seen the most important thing is that the parsing tables, parsing table is actually generated in accordance with the current input character to the next type of prediction, here we must use a concept: set of predictions, that is, First and Follow sets. The compiler written in more detail before the series talking about here is not to write
Of course, LL grammar will be a lot of grammar can not deal with, so it will have other grammatical analysis
Bottom-up approach
In practice, the bottom-up parsing is table-driven, this analyzer save all the sub-tree root partially completed in a stack. When it gets a new word from the scanner, it will be the word onto the stack. When it finds a number of symbols positioned to form a right portion of the top of the stack, it will be the reduction to the corresponding symbols left portion.
A bottom-up parsing procedure corresponds to the process of parsing an input string configured book, it starts from the leaf node to reach the root node and reduce progressively upwardly through the shift operation
Bottom-up parsing need to store a stack of symbol resolution, for example, the following syntax:
0. statement -> expr
1. expr -> expr + factor
2. | factor
3. factor -> ( expr )
4. | NUM
2 + 1 to resolve
stack | input | |
---|---|---|
null | 1 + 2 | |
ON ONE | + 2 | Start reading a character, and resolved into the corresponding token stack, called shift operation |
factor | + 2 | The grammar derivation, factor -> NUM, the NUM the stack, the stack factor, this operation is called reduce |
expr | + 2 | Reduce operating here continue to do, but because the syntax derivation has two productions, so it is necessary to look ahead in order to comply with a judgment is shift or reduce, that is, parsing of LA |
expr + | 2 | shift operation |
NUM expression + | null | shift operation |
expr + factor | null | The fator performed reduce the production |
expr | null | reduce operating |
statement | null | reduce operating |
At this time reduced to the start symbol, and the input string is also empty, the representative successful parsing
Construction of finite state automaton
0. s -> e
1. e -> e + t
2. e -> t
3. t -> t * f
4. t -> f
5. f -> ( e )
6. f -> NUM
- The starting derivation closure operations do
In the first initial production -> the right with a.
s -> .e
Symbol on the right to make closure operation, that is to say if the sign on the right is a nonterminal, then there must be an expression, -> left is the non-terminal, to add these expressions come in
s -> . e
e -> . e + t
e -> . t
Add in new derivations repeated this operation is repeated until all comprehension -> right of the non-terminal symbol that is located derivations are introduced
- Production of the introduced partition
The The right side has the same non-terminal expression included a partition, such as
e -> t .
t -> t . * f
As on the same partition. Finally. The right move for each partition expression of one, to form a new state node
- Jump to build relationships of all the nodes partition
Based on each node what character input symbol to the left of the node jumped
For example, the symbol t is left, when the state machine is in state 0, when t is input, jump to state 1.
- Repeat for all nodes Construction newly generated
Finally, each node newly generated repeated construction, building and jump until all nodes of all states
summary
This is mainly a mention analysis of morphology and syntax of the language because you want to combine design and practice, we should see a more detailed write a compiler front of family