PL really interesting (2): The programming language syntax

Foreword

Although the title is the syntax of programming languages, but talking about the parsing of morphology and syntax, in fact, on this front that writing compilers series description will be more clearly related to language syntax should be interspersed throughout the designs, but also to see language designers mood

These Chinese and English is not the same natural language, computer language must be precise, their syntax and semantics must ensure that there is no ambiguity, of course, also make parsing easier

So for the compiler to a very important task it is to structure the rules when other programming languages, it is necessary to accomplish this goal two requirements:

  • Complete description of the rules of grammar
  • Determine whether a given program structure up in accordance with these rules, which is in line with the rules of grammar

The first requirement is mainly expressions and context-free grammar to describe the completion of a positive, and the second requirement is to be done by the compiler is parsing the

Description Syntax: regular expressions and context-free grammar

For lexical, you can describe it in three rules:

  1. splice
  2. select
  3. Kleene (that is, repeated any number of times)

For example, an integer constant can be repeated any number of times more digits, also called regular languages. If for a string, we add recursive definition which can describe the whole syntax, it can be called context-free grammar

Word regular expression

For the programming language, type the word is nothing more than keywords, identifiers, and in line with the various types of constants

For integer constants can use this regular expression to represent

integer -> digit digit* digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Context-free grammar

General regular expression applies only to describe the word, because the regular expression can not describe nested structures, general regular expression is achieved using a finite state automaton, implemented in Python before a simple regular expression engine , too, but for the nested structure match any need to have a depth of arbitrarily large state machine, it is clearly inconsistent with. The definition of a nested structure describes the syntax is very useful, so there is a context-free grammar

expr := id | number | - expr | ( expr ) | expr or expr

op := + | - | * | /

For context-free grammar, each rule type is called a generation, the left portion of the production line with nonterminal called, and the right portion is a plurality of terminators or non-terminal, and finally all the rules are pushed to the terminator, the terminator word is defined by a regular expression

Derivation and syntax tree

A proper context-free grammar, guidance on how we can generate a string terminator grammatical

The simplest is to start from the symbol start, the right to use this portion of the production line with a substituted starts, and then proceed to derive the string selection obtained from a nonterminal, until there are no remaining nonterminal, this process as a recursive structure process tree

expr := expr op expr
     := expr op id
     := expr + id
     := expr op expr + id
     := expr op id + id
     := expr * id + id
     := id * id + id

But it is possible for a given context syntax will derive more than one parse tree, we say that the context syntax is ambiguous nature. So for the above context-free grammar is there a better grammar

scanning

Scanning is lexical analysis, lexical analysis did not require any regular expressions, what automatic machine, hand line and out, and now the industry in order to better generate an error message, but also should be a lot of hand-lexical analyzer

Hand lexical analyzer, nothing more has been read into the character, to be able to judge its token into the parser

Finite state automata

Finite state machine such lexical analysis steps are generally

  • Lexical given regular expression

  • The regular expression into a non-deterministic finite automaton (NFA)

In fact, for any regular expression can be used stitching, and choose to represent Kleene closure

And the same, finite automata can be represented by these three methods, map not drawn, before writing this regular expression engine Python's article are painted over (slip

  • NFA is converted to a deterministic finite automaton (DFA)

Converting the NFA to DFA may be employed subset construction method, the main idea is, DFA state after reading a given input is reached, all states may be represented by the following UM reads the same original input NFA

  • Minimize DFA

The main idea is to minimize the DFA, we DFA state all equivalence classes are divided into two, and a non-terminating termination phase state state. Then we again search input equivalence class X and compliance c, such as when a given input C, X can be converted to a state located in k> different state of an equivalence class. After we put X into k classes so that all class turntable for C will be transferred to members of the same old class. Until then divided the class can not be found this way, we completed

These four steps before you write regular expression engines are completed, in three articles that there will be a bit more detail

Gramma analysis

General parser is input token stream, and the output is a parse tree. Wherein the method of analysis generally be divided into top-down and bottom-up types, the most important of these two categories are referred to as LL and LR

LL represent from left to right, the most left, LR represent from left to right, rightmost derivation. The order of these two types of grammar are read from left to right input, and then trying to figure out the input parser derivation results

Top-down approach

General top-down parser more in line with the previous derivation, repeated recursively starting to look like a leaf node is derived from the root until the current leaf nodes are terminator

  • Recursive descent

Recursive descent is in line with the above said, starting from the root node derivation, generally used for a number of relatively simple language

read A
read B
sum := A + B
write sum
write sum / 2

For example, for this recursive procedure decline, began to call the program a parser function, after reading the first word read, program will call stmt_list, then call again stmt really began to match read A. In this way continues, the parser will trace the execution path parse tree from left to right, top to bottom traversal

  • Table-driven top-down LL

Table-driven LL parser is based on a table and a stack

Process analysis is

  1. Initializing a stack
  2. The start symbol onto the stack
  3. Pop stack, then depending on the sign of the stack and the current input symbol look-up table
  4. If a non-terminal symbol of the pop-up, production will continue to look-up table to determine a next pushed on the stack
  5. If the match is a terminator

Set of predictions

From the above it can be seen the most important thing is that the parsing tables, parsing table is actually generated in accordance with the current input character to the next type of prediction, here we must use a concept: set of predictions, that is, First and Follow sets. The compiler written in more detail before the series talking about here is not to write

Of course, LL grammar will be a lot of grammar can not deal with, so it will have other grammatical analysis

Bottom-up approach

In practice, the bottom-up parsing is table-driven, this analyzer save all the sub-tree root partially completed in a stack. When it gets a new word from the scanner, it will be the word onto the stack. When it finds a number of symbols positioned to form a right portion of the top of the stack, it will be the reduction to the corresponding symbols left portion.

A bottom-up parsing procedure corresponds to the process of parsing an input string configured book, it starts from the leaf node to reach the root node and reduce progressively upwardly through the shift operation

Bottom-up parsing need to store a stack of symbol resolution, for example, the following syntax:

0.  statement -> expr
1.  expr -> expr + factor
2.           | factor
3.  factor ->  ( expr )
4.           | NUM

2 + 1 to resolve

stack input
null 1 + 2
ON ONE + 2 Start reading a character, and resolved into the corresponding token stack, called shift operation
factor + 2 The grammar derivation, factor -> NUM, the NUM the stack, the stack factor, this operation is called reduce
expr + 2 Reduce operating here continue to do, but because the syntax derivation has two productions, so it is necessary to look ahead in order to comply with a judgment is shift or reduce, that is, parsing of LA
expr + 2 shift operation
NUM expression + null shift operation
expr + factor null The fator performed reduce the production
expr null reduce operating
statement null reduce operating

At this time reduced to the start symbol, and the input string is also empty, the representative successful parsing

Construction of finite state automaton

0.  s -> e
1.  e -> e + t
2.  e -> t
3.  t -> t * f
4.  t -> f
5.  f -> ( e )
6.  f -> NUM
  • The starting derivation closure operations do

In the first initial production -> the right with a.

s -> .e

Symbol on the right to make closure operation, that is to say if the sign on the right is a nonterminal, then there must be an expression, -> left is the non-terminal, to add these expressions come in

s -> . e
e -> . e + t
e -> . t

Add in new derivations repeated this operation is repeated until all comprehension -> right of the non-terminal symbol that is located derivations are introduced

  • Production of the introduced partition

The The right side has the same non-terminal expression included a partition, such as

e -> t .
t -> t . * f

As on the same partition. Finally. The right move for each partition expression of one, to form a new state node

  • Jump to build relationships of all the nodes partition

Based on each node what character input symbol to the left of the node jumped

For example, the symbol t is left, when the state machine is in state 0, when t is input, jump to state 1.

  • Repeat for all nodes Construction newly generated

Finally, each node newly generated repeated construction, building and jump until all nodes of all states

summary

This is mainly a mention analysis of morphology and syntax of the language because you want to combine design and practice, we should see a more detailed write a compiler front of family

Guess you like

Origin www.cnblogs.com/secoding/p/11919712.html