Grammatical analysis-bottom-up

  This chapter is a sister chapter to the previous chapter, I thought it would be very relevant. However, it is found that they are relatively independent after being taken as a whole. They belong to two methods, using different ideas to achieve the same effect.

  I still have to complain about it~ I feel that there is a lot of content in this chapter. But after sorting out, the content is piece by piece. Let's start now!

  Before officially starting, I first sort out the main line of the content. It's normal to not understand at the first moment. I hope to understand after reading this article.

  Main line: The main idea from the bottom up isMove-in. The key to move-in-status is how to choose a suitable statute string and which non-terminal symbol to statute.Operator precedence analysisIt tells you when to perform the reduction, and the LR analysis method tells you which production to use for the reduction.LR analysisIt includes many kinds: ==LR(0), SLR(1), LALR(1)== and so on. The value of their existence is continuous optimization and improvement of a certain aspect of the previous method.

  If you think my main line is very obscure, then please continue to look down!

  1. Analyze basic issues from top to bottom

  Statute

  Statutes are mentioned in many places. The statute in grammatical analysis can be simply understood as "right to left" (of course this is just my understanding), let's look at the formal definition:

Insert picture description here
  It can be understood from the definition that the statute uses the left part of the production instead of the right part of the action. From the perspective of whether it is a terminal symbol, the left part must be a non-terminal symbol, and the right part can be both types. Let it keep on deciding (under the right circumstances), and finally you will get the only non-terminal symbol, the start symbol.

  Let me give an example to help understand the entire process of the statute.

Insert picture description here

  First look at the actions of this example question including "in" and "regulation". "Guide" stands for the above-mentioned protocol. Here is an explanation of the entry: In the grammatical analysis, the stack (characteristically last in first out) is used as the data structure represented, and the "in" here means entering the stack.

  From this example, it is enough to understand two things:

    (1) The protocol is to transform one or more characters at the top of the stack into the corresponding left part according to a certain production

    (2) The end of the statute is the start symbol.

  People who are good at thinking will surely find that there are two problems left in this process:

    (1) When will it be entered and when will it be regulated?

    (2) When formulating, if multiple productions meet the conditions, which one to choose?

  (If you can find this, you are really great. Like me, I will only say right to the teacher~) For these two questions, I will answer them later, please stay calm.

  Phrase, direct phrase, handle, prime phrase, leftmost prime phrase

  Let's take a look at what the definition says:

Insert picture description here
  Phrases have two characteristics:

    (1) The phrase is derived from a certain non-terminal symbol A through one or several steps.

    (2) The non-terminal symbol A must be derived from the start symbol (here the type is *)

  But there is a simpler way to solve for phrases, direct phrases, and handles. Is to use the parse tree .

  The interpretation of the parse tree method is that there may be phrases for all non-leaf nodes (non-terminal symbols). The composition of the phrase is all the leaf nodes emitted by a certain non-terminal symbol node, written from left to right. The direct phrase is that the non-leaf node is the one directly above the leaf node. The handle is the leftmost one in the direct phrase. The relationship diagram of the three is as follows:

Insert picture description here
  For phrase solving, you can follow the layer method. Starting from the root node, find the phrase for each non-terminal symbol.

  Here is an example, the grammar is as follows:

Insert picture description here
  The sentence pattern is i1*i2+i3

  Draw the grammatical analysis tree:
Insert picture description here
  Parse: Solve for each non-leaf node, for E (row 1) phrases (ie leaf node combination) as i1 i2+i3, for E (row 2) as i1 i2, T (row 2) Is i3. For T (3 rows), i1*i3, F (3 rows) is i3, T (four rows) is i1, F (four rows) is i2. F (five rows) is i1. Eliminate the same from the inside. The bottom leaf nodes of i1, i2, and i3 form a direct phrase, and the handle is the leftmost one in the direct phrase is i1.

  answer:
Insert picture description here

  A prime phrase is another restriction on a phrase, that is, a prime phrase must contain at least one terminal symbol, and it must be the smallest (meaning that it is still a prime phrase without its subset):
Insert picture description here
  For example:

Insert picture description here
Insert picture description here
Neither   E+T F nor E+T F+i meets the minimum requirements.

  The relationship between the leftmost prime phrase and the handle: the handle is not necessarily the leftmost prime phrase (because the handle does not necessarily contain non-terminal symbols).

  2. Operator priority grammar

  Arithmetic grammar

Insert picture description here
  Two features:

    (1) There are no two consecutive non-terminal symbols

    (2) The right part of any production does not contain empty words

  The narrative method of flashback is used here (the advantage of flashback: countless words are omitted here): First, let's say that the operator-first grammar can realize the selection of the appropriate statute substring. The formal statement is as follows:

Insert picture description here
  To put it simply, the priority of the character at the top of the stack and the input character is compared each time. If the priority of the character at the top of the stack is less than or equal to the input character, then the input character will be pushed onto the stack, otherwise it means that the specification can be made. So the question is, how many characters in the stack should be selected for specification? The way to find a reducible string is to select the terminator from the top of the stack ( if the top of the stack is a non-terminal, then select the adjacent terminator below it ), that is, start from the top of the stack and compare the adjacent terminator in the stack If the precedence relationship between the terminal symbols below the stack is lower than the precedence relationship between the adjacent terminal symbols above it, the symbol before it to the top of the stack is regulated. (Note that there are two possibilities for the adjacent terminator relationship here: less than and equal to, less than or equal to the statute, and equal to it regardless) until the terminator has a higher priority than the current character. The characters between (including the top character of the stack) can be used as the protocol string. And so on, until the beginning character.

  Having said this, the question turned into how to find the priority relationship between the characters? Of course there is a unique method.

  Let me talk about the expression method of priority in the operator priority grammar, which is different from the traditional greater than and less than signs.

   Operator: The operator here refers to the terminal symbol in the grammar. There are three types of relationships between terminal symbols a and b:

   a<·ba has a priority less than b

  a·>ba has a higher priority than b

  a=·ba's priority is equal to b

   There are two points to note:

    (1) The comparison priority must be non-terminal

    (2) There is a symbol in the sign (Don’t forget)

  The priority of operators is naturally defined from the production. Let’s look at the formal description from the perspective of the production:

Insert picture description here
  Although the method defined above can compare operator precedence, the solution is very fragmented. The following introduces a systematic method of solving the operator priority relationship, and expresses the relationship in the form of an operator priority analysis table, that is, a table (horizontal axis relationship vertical axis).

  FIRSTVT§ and LASTVT§

  In fact, FIRSTVT and LASTVT are an evolution of the relationship described above. Both are for non-terminal symbols. When constructing the operator priority analysis table, FIRSTVT and LASTVT are required for each non-terminal symbol. The formal description of solving FIRSTVT and LASTVT is given below:

Insert picture description here
Insert picture description here
  To put it simply, FIRSTVT is to find the first or second terminal symbol that starts from each non-terminal symbol (in this case, the first one must be a non-terminal symbol). For oneself is B, the first is non-terminal A, then all the characters in FIRSTVT (A) are added to B.

   Simply put, LASTVT is to find the last one is a terminal or the second to last is a terminal (at this time, the first to last must be a non-terminal). If you are B, and the last one is non-terminal A, then add all the characters in LASTVT(A) to B.

   There is another point that needs special attention. To use the stack structure, you also need to introduce a'#' as a terminator. This also participates in the comparison of operator priority (although there is no such thing).

  The following is an example to illustrate the construction process of the operator priority analysis table: It
Insert picture description here
  seems that at first glance, I don't know where to start! In fact, the basic step to find the operator priority analysis table is to first find the FIRSTVT and LASTVT of each nonterminal.

  In order to save space, I just ask for the following, FIRSTVT§ and FIRSTVT(F), LASTVT§ and FIRSTVT(F), and the rest is the same.

  FIRSTVT§, according to the requirements of the first 1/2 terminator of FIRSTVT, it is found that only the left side of P->(E)|i is composed of P. Including the (in P->(E), since P->i is the terminal symbol directly, i is also shortlisted. The final result is FIRSTVT§={ ),i}. For F, the first of the two candidates is non-terminal P, so all the options in P are also included in F, and in F->P⬆F, the first one is non-terminal and the second is terminal. The situation of the character. So ⬆ is classified as F. The result FIRSTVT(F) is {(, i, ⬆ }.

   LASTVT§, in accordance with the requirements of the penultimate 1/2 terminator, in P->(E)|i) and i are both qualified. LASTVT(F) has two candidates, F points directly to P, and the LASTVT set of P is also included in F. For F->P⬆F, the up arrow is the second to last and the first to last is a non-terminal symbol. To meet the conditions. Add ⬆ to LASTVT(F). In the end, LASTVT§={ ),i},LASTVT(F)={ ),i, ⬆}.

Insert picture description here
  After obtaining FIRSTVT and LASTVT, it is time to transform into an analysis table. The conversion between the two is still a little bit troublesome!

  The method formula is: determine a terminal symbol (there is a non-terminal symbol on the left or right), the terminal symbol is less than all symbols in the FIRSTVT of the non-terminal symbol on the right, and greater than all symbols in the LASTVT of the non-terminal symbol on the left. When filling the form, fill in horizontally with less than symbol, and fill in vertically with greater than symbol.

  After reading the formula, I don’t actually understand it that much. Let’s take a look at the example:

Insert picture description here
  In E->E+T, + is located between E and T, it should be greater than LASTVT(E) and less than FIRSTVT(T). FIRSTVT(T) includes { ,⬆,(,i}, so fill in the horizontal (+,xxx) <·. LASTVT(E) includes {+, ,⬆,i,)}. So fill in ·> in (xxx,+). Pay special attention to the comparison method of # and each priority. The final form of the protocol is #S# (where S is the starting character), so # is completed by FIRSTVT(S) and LASTVT(S). The method is the same as above, #Horizontal is smaller than FIRSTVT(S), and longitudinal is larger than LASTVT(S).

  Note that (E) satisfies the form of the equal sign, that is (=· ). The equal sign should be judged first when constructing the operator priority analysis table. The places left with vacant land are incomparable and can be considered as grammatical errors if they occur.

  Priority function

  The priority function is another way of expressing priority (basically it can replace the operator priority analysis table):
Insert picture description here
  so if you know the magnitude of f(a) and g(b), you can infer the priority relationship. To solve the size, first draw a graph of the priority function construction, if a>=·b, draw a line from fa (with arrow a->b) to gb, if a<=·b, draw a line from gb (With arrow b->a) to fa. The picture drawn is similar to this:

Insert picture description here

  The value of Fa is the total number of points that the point can reach all other points (including itself), the same is true for gb.

  You may find it weird to ask for priority but use priority. In fact, the value of the priority function is to save the control. When using the analysis table, it takes up (n+2) (n+2) space, and n represents the number of terminator. Using this priority function, the result can be expressed at n 2, like this:

![Insert picture description here](https://img-blog.csdnimg.cn/20201027001058636.png#pic_center

  Three, LR analysis method

  LR analysis method is a kind of method. First, make a basic introduction to LR analysis method and then give an example to see how to use LR analysis method to perform top-down statute syntax analysis!

  The model of LR analysis method is shown in the figure:

Insert picture description here  The big change from the previous one is that the stack becomes two groups, one group represents terminal symbols and the other represents non-terminal symbols. The LR analysis method is composed of a series of operations that push and pop each character in the input string according to the rules defined in the analysis table. It can be seen that the analysis table is the core content of the LR analysis method.

  Introduce several related concepts of LR analysis:

   The LR analysis table has two major states: ACTION (action) and GOTO (transition). ACTION is relative to terminal symbols, and GOTO is relative to non-terminal symbols. Let's take a look at the finished product:

Insert picture description here
  What do the s and r in this box represent? This is related to the four actions specified in ACTION:

Insert picture description here
  So in ACTION, the front letter represents the action, and the back number represents the state. S stands for shift, r stands for reduction. The number after s is the state number, and the number after r is the label of the production.

  Let's take an example to see how the LR analysis method realizes bottom-up grammatical analysis.

Insert picture description here
  Solution (the PPT is already very detailed, I will directly paste the map): After
Insert picture description here
  reading this, I hope you have understood the entire process of LR analysis, but the main work in LR analysis is the construction of the analysis table. Let's introduce the construction method of analysis table below! In the course content, only the structure of LR(0) analysis table and SLR(1) analysis table are designed. The following will mainly talk about these two, and there will be a brief introduction to ALSR(1).

  SLR(1) is constructed on the basis of LR(0), so the foundation of LR(0) should be firmly laid.

  LR(0) analysis table structure

  Before the formal introduction, first understand a few basic concepts.

  Project (obviously, just add some ~ add some before and after)

Insert picture description here

  Takumi Grammar

Insert picture description here
  In fact, the function of the extension grammar is to make the initial state unique, which is convenient for our later analysis. This is the 0 state.

  Closure of item set I (CLOSURE(I))

Insert picture description here

  The transfer function of item set I:

Insert picture description here
  The following is an example to illustrate the problem, how to use some of the concepts mentioned above to construct an analysis table:

  Example questions (this grammar has been expanded and processed, and the starting character is currently S'):

Insert picture description here
  1. List all productions and number them ( note that in this section, | is handled separately, A->a|b is divided into A->a and A->b according to two numbers )

  2. List all items:

    S‘->·E、S‘->E·

    E->·aA、E->a·A、E->aA·

    E->·bB、E->b·B、E->bB·

    A->·cA、A->c·A、A->cA·

    A->·d、A->d·

    B->·cB、B->c·B、B->cB·

    E->·d、E->d·

  3. Draw the DFA diagram according to the item set closure:
Insert picture description here
  briefly explain how the DFA diagram is drawn: take the start state as 0 state, first put the production including the start character in it, if the dot is directly followed by a non-terminal If the non-terminal symbol is the left part of the production and the point on the leftmost point is also placed in the 0 state. The extension begins later, the first symbol after the dot (regardless of whether it is a terminal symbol) leads to a line, marking the new state, and the first formula is written in the dot and moved one digit later. If you are faced with a non-terminal symbol and the processing method above same. It should be noted that multiple non-terminal symbols may need to be added, such as A->·E, E->·B, B->·C …. So every E, B, C must be written.

  Convert DFA diagram into analysis table:

Insert picture description here
  This part of the conversion analysis table looks very troublesome, but in fact, just fill it in according to the DFA diagram. The current introduction is LR(0) analysis. As mentioned above, LR(0) analysis has one characteristic: there is no conflict. Corresponding to the state of a row in the analysis table can be a protocol operation.

  First, the total number of states written in the DFA is arranged from 0 as the vertical axis. The horizontal axis is divided into two parts, the first part is the ACTION part, the content is all terminal symbols plus a #, the second part is GOTO, the content is all non-terminal symbols.

  The method to fill in the content is to select any state ( usually start with 0 state, here also choose 0 state ) starting from 0 state are a, b, E. After a, b enter the state 2, 3 respectively. In the 2,3 state, there are cases where there are other characters after the dot and there is no case where the dot is at the end. This recognition is the state of migration . In the analysis table, it is represented by s+subsequent status, such as s2. Enter the state 1 after the non-terminal symbol E, and fill in 1 at the position of the 0 state corresponding to the non-terminal symbol E. The above method is sufficient for the preliminary analysis of most states. But if there is a sentence pattern with a dot at the end, the analysis is different. For state 1, the dot is at the end and contains the starting character. This is the end of our statute. When this situation is encountered, write "acc" in the position of (state, #), which means that the analysis has ended and the analysis is correct. For the state 10, the point is at the end, and the left part is not the beginning character, which means that it represents the common protocol item. Use r+number, the number is the number before the production. For LR(0) analysis, when there is a specification item, the specification can be written in all non-terminal symbols, and non-terminal symbols are not processed. (The reason will be explained later! )

  SLR (1) analysis table

  The first question after seeing this should be: Why do we need SLR(1) analysis? We must first introduce two simple conflicts. Introduce it from an example!

  If there is such a state:

Insert picture description here
  Our next step is the statute? Or move in? (It all seems reasonable, the respective supports are 1, 2 productions). When the computer does not know what to do next, conflicts arise!

  The above situation is called shifting-protocol conflict , that is, when a state has both shifting operation and protocol operation.

  Don’t worry after reading one, then look at the next one:

Insert picture description here
  What to do when this happens in the same state? Am I contracting into A? Or is it reduced to B? (The computer will cry again in the toilet). This is the second type of conflict: Statute-Statute conflict.

   There is a problem, don't worry. One thing drops one thing, and the way to overcome these two "monsters" is SLR (1) analysis.

  SLR(1)分析 —— simple left right

  Simple LR analysis, to put it bluntly, is to combine the FOLLOW set to more accurately determine the next operation. So why is the FOLLOW set possible?

  Let's first look at how the FOLLOW set resolves the shift-in-protocol conflict: in fact, after clicking, we need to compare with the current input character, which is to make a corresponding selection based on the input that meets a certain condition. If the next character is in the FOLLOW set, it means that the next character is under the control of A, and the specification can be performed at this time. What about another extreme situation? This character is in the FOLLOW set and happens to be the next character in the shift operation. At this time, the situation is more complicated, and it is beyond the scope of SLR (1) analysis. Therefore, the same state for the same input character can also be used as a criterion for judging whether it meets the SLR (1) analysis.

  It is also easier to understand statutes-statute conflicts. That is, the next input character is in the FOLLOW set of the production, and the corresponding reduction is used. Be more special: what if they are all there? Still the same sentence: there are more advanced methods behind it!

  Compared with LR(0) analysis, SLR(1) analysis is more complicated. The structure of the analysis table is also more demanding. When I talked about the statute before, write the statute operation under all the terminal symbols, and in SLR(1), it is only expressed under the terminal symbols that exist in the FOLLOW set.

  It looks roughly like this~:

Insert picture description here
  I wanted to talk about the more complicated LR analysis method, but it's annoying to think that I don't know how to write so much. Just be lazy and add it later if you think of it!

  Thank you Mr. Xie for your patient modification of the Compilation Principles course!

Due to the limited level of the author, if there is any error, please correct me in the comment section below, thank you!

Guess you like

Origin blog.csdn.net/gls_nuaa/article/details/109301983