[Compilation Principle] Syntax Analysis (3)


Commonly used syntax analysis methods include top-down and bottom-up methods. The top-down syntax analysis method has been introduced in the previous article. This article will introduce the bottom-up syntax analysis method.

Grammar & Conventions

By convention, we give an expression grammar G throughout the text:

    E→E+T|T
    TT*F|F
    F→(E)|id
  
  
  • 1
  • 2
  • 3

and the notation conventions used:

  • Capital letters: represent non-terminal symbols, such as A, B, C, etc.;
  • Lowercase letters: represent terminal symbols, such as a, b, c, etc.;
  • Greek letters: represent strings or empty strings composed of terminal symbols and non-terminal symbols, such as α, β, γ, ω, etc.;
  • Start symbol: use S to represent the start symbol of the grammar;
  • End symbol: use $ to indicate the end mark, such as the end of the input, the stack is empty, etc.;
  • Empty string: Use ε to represent a string of length 0, that is, an empty string.

bottom-up parsing

As the name implies, the bottom-up parsing process corresponds to the process of constructing a parsing tree for an input string, starting from the leaf node and gradually reaching the root node. Contrary to the top-down parsing process, the bottom-up parsing process reduces several leaf nodes or intermediate nodes into one intermediate node. Each reduction is the reverse process of a rightmost derivation. When After completing a bottom-up parsing, a reverse rightmost derivation can be obtained for the corresponding input string.

handle

For grammar G and the input string id*id, we construct its parse tree using a bottom-up approach:

20171031_img1

If you start from the rightmost parse tree and end with the leftmost parse tree, and connect the root nodes of each parse tree with deduce4symbols, you can get id*ida rightmost derivation of the string:

Ededuce4Tdeduce4T*Fdeduce4T*iddeduce4F*iddeduce4id*id

id*idThat is to say, the bottom-up parsing process of strings is the process of id*idgetting the result from the beginning, after repeated reductions E. where each reduction is the inverse of some rightmost derivation.

The key problem of bottom-up parsing is to determine which string to reduce at each time, which we call a handle. Formally, if S deduce4... deduce4αAγ deduce4αβγ, then the production A→β is a handle of αβγ, and A→β can be reduced to β, that is, β is a handle of αβγ.

id*idFor the rightmost derivation of the string obtained above , because there is F*iddeduce4id*id, the first id is id*ida handle; and because there is T*iddeduce4F*id, F is F*ida handle; and so on.

Thus, for an input string ω, assuming that one of its rightmost derivations is S deduce4α1 deduce4α2 deduce4... deduce4αn deduce4ω, if we can know a handle for all αi (1<=i<=n) and ω, we can use the bottom The upward method constructs the parse tree of ω.

Note that there may be more than one handle to a string, such as an ambiguous grammar.

shift-reduce parsing techniques

Shift-reduce parsing is a general bottom-up parsing technique. It uses a stack to hold the grammar symbols and an input buffer to store the remaining input symbols. Using this method, the handle appears on top of the stack until it is recognized.

A shift-reduce parser can perform four actions:

  1. Move in: push the next input symbol onto the top of the stack;
  2. Reduction: The right end of the reduced symbol string must be the top of the stack. The parser determines the left end of the string in the stack and decides which non-terminal symbol to replace the string with;
  3. Accept: Declare the successful completion of the parsing process;
  4. Error: A syntax error was found and an error recovery subroutine was called.

For the input string id*id, its a shift-reduce parsing process is as follows:

20171031_img2

In the diagram above, the top of the stack is on the right, and it is on top of the stack every time a handle appears. Here we do not introduce when to perform shift and when to perform reduction, that is, how to identify a handle. The next section describes a method for discovering handles.

Also, note that during shift-reduce parsing, conflicts may arise, including shift/reduce conflicts and reduce/reduce conflicts. A shift-in/reduce conflict is a conflict that occurs when a shift-in action or a reduction action can be performed in a certain step of shift-reduce parsing. A reduce/reduce conflict is a conflict that occurs when the handle at the top of the stack can be optionally reduced to two or more production heads in a step of move-reduce parsing.

Simple LR Technique: SLR

The most popular bottom-up parsers are based on the concept of so-called LR(k) parsing. where "L" means to scan the input from left to right, "R" means to construct a rightmost derivation sequence in reverse, and "k" means to look ahead k input symbols when making parsing decisions. When (k) is omitted, k=1 is assumed.

This section introduces the Simple LR technique (SLR for short), which is one of the simplest ways to construct a move-reduce parser. SLR relies on a parsing table, which includes ACTION and GOTO sets, which are obtained from an LR(0) automaton, which consists of a state set and a transition function.

Canonical LR(0) terms and LR(0) automata

An LR parser makes shift-reduce decisions by maintaining states that indicate where it is in the parsing process.

An LR(0) term (or term for short) of a grammar is a production of that grammar plus a point somewhere in its body. For example, for the production A→αβγ, it has four terms, A→·αβγ, A→α·βγ, A→αβ·γ, and A→αβγ·. The term A → αβγ indicates that we expect to see a string in the next input that can be deduced from αβγ; the term A → α βγ indicates that we have just seen a string in the input that can be deduced from α, and we Hope to see a string in the input that can be deduced from βγ; the term A → αβγ indicates that we have seen a string in the input that can be deduced from αβγ, and it is time to reduce this string to A .

One or more items can form an itemset, and a set of itemsets provide the basis for building a DFA, which can be used to make parse decisions. Such a DFA is called an LR(0) automaton.

Each state of an LR(0) automaton represents an itemset. In order to determine which items are included in the item set represented by each state of the LR(0) automaton, we need to use two functions CLOSURE and GOTO, which are somewhat similar to the ε-closure and move functions of DFA.

For an itemset I of the grammar, the construction rules of CLOSURE(I) are as follows:

  1. Add all items in I to CLOSURE(I);
  2. If A→α·Bβ is in CLOSURE(I), B→γ is a production, and B→·γ is not in CLOSURE(I), add the term B→·γ to CLOSURE(I). Continue to apply this rule until no new items can be added to CLOSURE(I).

E→·E+TFor grammar G, the process of computing the CLOSURE set of itemset { } is as follows:

  1. Add the item E→·E+Tto the CLOSURE collection;
  2. Since it E→Tis a production and E→·Tnot in the CLOSURE set, it is added to the CLOSURE set;
  3. Since it T→T*F|Fis a production, T→·T*Fand T→·Fneither of the sums are in the CLOSURE set, they are added to the CLOSURE set;
  4. Since it F→(E)|idis a production, F→·(E)and F→·idneither of them are in the CLOSURE set, they are added to the CLOSURE set. At this point, no new items can be added to the CLOSURE collection, and the final CLOSURE collection is:
    E→·E+T
    E→·T
    T→·T*F
    T→·F
    F→·(E)
    F→·id
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

For an itemset I of a grammar and a grammar symbol X, the construction rules of GOTO(I, X) are as follows:

  1. If A→α·Xβ is in I, add the term A→αX·β to GOTO(I, X);
  2. Computes the closure of GOTO(I, X) with GOTO(I, X) as an argument to the CLOSURE function.

For the CLOSURE set of the itemset { E→·E+T} obtained above and the symbol "(", the process of calculating its GOTO set is as follows:

  1. Add the item F→(·E)to the GOTO collection;
  2. Since it E→E+T|Tis a production, E→·E+Tand E→·Tneither of the sums are in the GOTO set, they are added to the GOTO set;
  3. Since it T→T*F|Fis a production, T→·T*Fand T→·Fneither of the sums are in the GOTO set, they are added to the GOTO set;
  4. Since it F→(E)|idis a production F→·(E)and F→·idneither sum is in the GOTO set, they are added to the GOTO set. At this point, no new items can be added to the GOTO set, and the final GOTO set is:
    F→(·E)
    E→·E+T
    E→·T
    T→·T*F
    T→·F
    F→·(E)
    F→·id
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

So far, we have seen how to determine each state (CLOSURE function) and transition function (GOTO function) of an LR(0) automaton. In addition, in order to standardize the LR(0) automaton of a grammar, we denote the grammar as an augmented grammar, that is, the grammar obtained by adding a new start symbol S' and the production S'→S to the grammar. The purpose of introducing this new start symbol and production is to tell the parser when it should stop parsing and declare to accept the input string. That is, the input string of symbols is accepted if and only if the parser reduces using S'→S.

The LR(0) automaton of the augmented grammar of grammar G is as follows:

20171031_img3

How do LR(0) automata help make shift-reduce decisions? Suppose the string γ makes the LR(0) automaton run from the starting state 0 to some state j, if the next input symbol is a and the state j has a transition on a, then move into a, otherwise perform a reduction operation, the state The terms of j will tell us which production to use for the reduction.

For grammar G and the input string id*id, using the LR(0) automaton of grammar G given above, the process of shifting-reduce parsing is as follows:

20171031_img4

In fact, in an LR parser, an LR(0) automaton is converted into an LR parsing table, and in the next subsection, we continue to describe how to construct an LR parsing table from an LR(0) automaton.

LR parsing table

The parsing table of an LR parser consists of a parsing action function ACTION and a conversion function GOTO:

  • ACTION function: The ACTION function has two parameters, one is the state i, the other is the terminal symbol (including the input end tag $) a, ACTION[i, a] has four values:
    1. Move in. If there is a transition a from state i to state j, then ACTION[i, a] = move into j;
    2. reduction. If there is no transition a on state i, then ACTION[i, a]=reduce according to the production on state i;
    3. accept. The parser accepts input and completes the parsing process;
    4. report an error. The parser finds an error in its input and performs some corrective action.
  • GOTO function: It is essentially the same as the GOTO function of the itemset, except that the itemset is replaced by the state. That is, if the GOTO function of the itemset has GOTO[Ii, A]=Ij, then the GOTO function of the LR parsing table has GOTO[i, A]=j.

According to an LR(0) automaton, we can immediately obtain the GOTO function of the LR parsing table, but, for the ACTION function, the following rules are applied:

  1. In an LR(0) automaton, if the item A→α·aβ is in the itemset Ii and GOTO[Ii, a]=Ij, then set ACTION[i, a] to "shift into j";
  2. In the LR(0) automaton, if the item A→α is in the itemset Ii, then for all a in FOLLOW(A), set ACTION[i, a] to "reduce according to A→α", Here A is not equal to S';
  3. In an LR(0) automaton, if the item S'→S· is in the itemset Ii, then set ACTION[i, $] to "accept".

In addition, set all blank ACTION and GOTO to "Error".

PS: If you don't know how to calculate the FOLLOW function, you can browse the previous article [Compilation Principle] Syntax Analysis (2) .

Now try to build an LR parsing table for grammar G. For convenience, we number each production in grammar G:

    (1) E→E+T    (2) E→T
    (3) TT*F    (4) TF
    (5) F→(E)    (6) F→id
  
  
  • 1
  • 2
  • 3

And agree on the meaning of each symbol in the ACTION function:

  • si means move in and push the state i onto the stack;
  • rj means to reduce according to the production whose sequence number is j;
  • acc means acceptance;
  • Blank indicates an error.

The resulting LR parsing table is as follows:

20171031_img5

PS: For the ACTION function of each terminal symbol, if the value of the ACTION function is shifted in, then its essence is the GOTO function.

In order to illustrate the use of the LR parsing table, here is an example: maintain a state stack, initially state 0 is at the top of the stack, if the next input symbol is "id", then push state 5 onto the stack; when it is in state 5, If the next input symbol is "*", use the production F→idto reduce, replace id with F and pop state 5 from the top of the stack, which is now in state 0 (state 0 is at the top of the stack), since state 0 passes through the symbol The transition of F reaches state 3, so state 3 is pushed onto the stack; and so on. The use of the complete system of LR parsing tables will be introduced in the next subsection.

LR parsing algorithm

An LR parser consists of an input buffer, a state stack, a parsing table, and a result output, as shown in the following figure:

20171031_img6

The syntax analysis table has been introduced in the previous section, and here we focus on the state stack. The state stack maintains a sequence of states s0s1...sn, where sn is at the top of the stack, each state si corresponds to a state in the LR(0) state machine, and in addition to the initial state, each state has a unique correlation grammatical symbols for the link. That is to say, in the LR(0) state machine, if the transition from Ii to Ij passes through the symbol α, then the associated symbol of the state j is α.

The state of the parser at a certain moment can be completely represented by the state stack and the remaining input string, which is essentially a sentence pattern in the reverse rightmost derivation. We use (s0s1…sm, a1a2…an$) to denote the state of the parser, and call it the parser pattern. where the first component is the sequence of states in the state stack (sm is the top of the stack), and the second component is the remaining input symbol string. If you replace each state in the first component with its associated grammar symbol, Concatenated with the second component, a sentence pattern in the reverse rightmost derivation can be obtained.

Assuming that the current format of the LR parser is (s0s1…sm, aiai+1…an$), when determining the next action according to the current format, first read the next input symbol ai and the state sm at the top of the stack, and then query The entry ACTION[sm, ai] in the LR parsing table performs the corresponding action:

  1. If ACTION[sm, ai]=move into s, then push the state s to the top of the stack, and the pattern becomes (s0s1…sms, ai+1ai+2…an$);
  2. If ACTION[sm, ai]=reduce according to A→β, then pop r (r is the length of β) states from the top of the stack and push state s (s=GOTO[sm-r, A]) into At the top of the stack, the pattern becomes (s0s1…sm-rs, aiai+1…an$). Note that the current input symbol does not change when the reduction action is performed;
  3. If ACTION[sm, ai]=accept, the parsing process is complete;
  4. If ACTION[sm, ai]=Error, the parser found a syntax error and called an error recovery routine.

To sum up, an LR parser, like an LL parser, is also table-driven. The only difference between the two LR parsers is their parse tables.

Now for the grammar G and the input symbol string id*id+id, the corresponding LR parsing table has been obtained above, and the LR parsing process is analyzed:

20171031_img7

Among them, the top of the state stack is on the right, the symbol is the grammar symbol associated with each state in the stack (the initial state 0 has no associated grammar symbol), and, starting from the last line to the first line, the symbol of each line Concatenate with the input to obtain a sentence pattern of grammar G, and connect these sentence patterns with deduce4symbols (removing repeated sentence patterns) to obtain a rightmost derivation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325942310&siteId=291194637