[Compilation Principle] Syntax Analysis (1)

The lexical analyzer converts the source program into a sequence of lexemes, which lets us know that a sequence of symbols 'i', 'f' is a keyword "if", and a sequence of symbols '1', '2', '3 ', '4' is a constant "1234" and so on. However, the work of the lexical analyzer ends here, it cannot explain the relationship between several morphemes. For example, for lexeme strings "int", "x", "=", "1", ";", the lexical analyzer does not know that it is a statement; for lexeme strings "int", "x", "==" , "1", ";", the lexical analyzer cannot detect its grammatical errors. For this reason, after lexical analysis, grammatical analysis is also required.

The role of the parser

We all know that there is a specific set of rules to follow when writing programs using a certain programming design. For example, in C language, a program consists of multiple functions, a function consists of declarations and statements, a statement consists of expressions, and so on. This set of rules precisely describes the syntax of a well-formed programming language. A parser can determine the grammatical structure of a source program, detect grammatical errors in the source program, and recover from common errors and continue processing the rest of the program.

A parser takes a sequence of lexemes from a lexer and verifies that the sequence can be generated from the grammar of the source language. The parser will construct a parse tree and pass it to other parts of the compiler for further processing. In the process of building the parse tree, it verifies whether the sequence of lexemes conforms to the grammar of the source language. The location of the parser in the compiler is as follows:

20171029_img1

grammar

A grammar is used to systematically describe the constructs of programming languages. A properly designed grammar gives the structure of a language that helps translate source programs into correct object code and also helps detect errors.

context free grammar

A context-free grammar (hereafter referred to as the grammar) consists of terminal symbols, nonterminal symbols, a start symbol, and a set of productions:

20171029_img2

For two or more productions with the same production head, their production bodies can be written as one production by connecting them with "|". For example, for the production E→E+Tsum E→T, it can be written as E→E+T|T.

An expression grammar that runs through the text

Before getting into the subject, let's give a simple expression grammar, call it grammar G, which will serve as an example throughout:

    E→E+T|T
    T→T*F|F
    F→(E)|id
  
  
   
   1
   
   2
   
   3

In grammar G, there are 3 productions. Among them, the symbols "+", "*", "(", ")", and "id" are terminal symbols, and the symbols "E", "T", and "F" are non-terminal symbols; the symbol E is the start symbol, indicating the grammar G can generate languages.

Conventions about symbols

In the following, due to the need for a large number of symbols, in order to reduce redundancy and facilitate understanding, we have the following conventions for the symbols used:

Capital letters represent non-terminal symbols, such as A, B, C, etc.;
Greek letters represent any string or empty string composed of nonterminal and terminal symbols, such as α, β, γ, etc.

Derivation and Specification

When the parser builds a parse tree, the commonly used methods can be divided into top-down and bottom-up. As the name suggests, the top-down method is a method of constructing from the root node of the parsing tree to the leaf nodes, and the bottom-up method is a method of constructing from the leaf node of the parsing tree to the root node. In the top-down construction process, a subtree needs to be "derived" from a non-leaf node; in the bottom-up construction, several non-root nodes need to be "reduced" into its root nodes.

Derive

From the production point of view, a derivation process is the process of replacing the left part of the production with the right part of the production.

20171029_img3

For grammar G, a leftmost derivation and rightmost derivation from E to id*id+id are:

20171029_img4

Among them, id*id+id is a sentence of grammar G, and the other intermediate steps are the sentence patterns of grammar G. It can also be seen from this that the derivation process for the same sentence pattern is not unique.

The derivation process can be represented by a parsing tree. The parsing process of the leftmost derivation from E to id*id+id is shown in the following figure:

20171029_img5

Each parse tree corresponds to a certain derivation. Connecting the leaf nodes of the parse tree from left to right is a sentence pattern of grammar G, also known as the result or edge of the parse tree. The last parse tree The leaf nodes of , connected from left to right, are a sentence of grammar G.

Statute

Reduction is the inverse process of derivation. A reduction replaces a string matching a production body with the nonterminal of the production head.

The reduction procedure can be used to construct parse trees bottom-up. For example, reducing id*id+id to E, each reduction here is the inverse process of a rightmost derivation, and the constructed parse tree is shown in the following figure:

20171029_img6

design grammar

This section describes how to transform a grammar to make it more suitable for parsing, including disambiguation, left-recursion elimination, and left-common factor extraction.

disambiguation

A grammar is ambiguous if it can generate multiple parse trees for a sentence. Simply put, an ambiguous grammar is a grammar that has multiple leftmost derivations or multiple rightmost derivations for the same sentence .

Consider the following grammar:

    E→E+E
    E→E*E
    E→(E)
    E→id
  
  
   
   1
   
   2
   
   3
   
   4

It has two leftmost derivations for sentence id+id*id. The following figure shows the derivation process and the corresponding parse tree:

20171029_img7

It can be seen that these two derivation processes reflect the priority problem of addition and multiplication. The derivation multiplication on the left has a higher priority than addition, and the derivation addition on the right has a higher priority than multiplication. In fact, the derivation process on the left is more in line with our conception.

This ambiguity problem can be resolved by rewriting the grammar in the form of a grammar G, which both generate the same language.

eliminate left recursion

A grammar is left-recursive if, after one or more derivations for a nonterminal A, the string Aα is obtained. A→AαA production is immediately left-recursive if there is a production of the form in the grammar .

Before introducing the elimination of left recursion in grammars, we first describe how to eliminate immediate left recursion in productions. For any immediately left recursive production, immediate left recursion can be eliminated as follows:

20171029_img8

The set of strings produced by the two productions on the right is the same as the set of strings produced by the production on the left, and the two productions on the right have no left recursion, thus eliminating the left recursion of the productions.

Now formally introduce how to eliminate the left recursion of a left recursive grammar. For a left-recursive grammar, eliminate left-recursion as follows:

20171029_img9

For a grammar G, eliminate its left recursion:

Sort nonterminals as E, T, F;
When i=1, for the symbol E, the production E→E+T|Tis immediately left-recursive, replacing it with the E→TE＇sum E＇→+TE＇|ε;
When i=2, for symbol T, T→T*F|Fthere is no E in the production, but it is immediately left recursive, replace it with T→FT＇sum T＇→*FT＇|ε;
When i=3, for the symbol F, F→(E)|idthere is E in the production, and it is obtained after replacing E with TE' F→(TE＇)|id, it is not immediately left recursive, and the algorithm ends;
The resulting non-left recursive grammar is:

    E→TE＇
    E＇→+TE＇|ε
    T→FT＇
    T＇→*FT＇|ε
    F→(TE＇)|id
  
  
   
   1
   
   2
   
   3
   
   4
   
   5

Extract the left common factor

When constructing a parse tree using a top-down approach, if there is more than one production to choose from when "expanding" a node of a nonterminal, it is not clear which production to use. To solve this problem, we can postpone this decision by rewriting the production, and then make the correct choice after we have enough information.

For example, for productions A→+α|+β, assuming that the symbols at the beginning of α and β are different, when you see the input "+", you cannot decide whether to use +α or +β to replace A. Only by continuing to read the next input can you determine which to use. a production. In this regard, you can A→+α|+βconvert it into A→+A＇sum A＇→α|β, so when you see the input "+", you can replace A with A', and then decide how to replace A' according to the subsequent input.

Extract the left common factor for a grammar. For each nonterminal A, find the longest common prefix α between its two or more options. If α is not an empty string, then set all productions whose heads are A Do the following substitutions:

20171029_img10

Among them, γ represents all strings that do not start with α, and A' is a new non-terminal symbol. This transformation is applied until no two production bodies of each nonterminal have a common prefix.