Compilation principle: LL(1) grammar discrimination (the simplest and most detailed introduction)

Article directory

basic concept

The status of syntax analysis : syntax analysis is one of the core functions of the compiler.

The role of grammatical analysis : The role of grammatical analysis is to identify whether the string of word symbols given by lexical analysis is a correct sentence for a given grammar.

Syntactic analysis methods : can be divided into two categories: top-down analysis methods and bottom-up analysis methods.

Top-down analysis method : starting from the beginning symbols of the grammar, deduce sentences that exactly match the input word symbols.

Deterministic analysis and uncertain analysis :

Deterministic analysis : the production used in each derivation step is uniquely determined.
Uncertain analysis : There may be multiple productions in each derivation step.

Start symbol set FIRST :

For objects : the start symbol set is for a specified string.
Set element : A set of all possible beginning terminals of the specified string.
Special case : If a string can derive the empty string ε, then ε is also an element in the start symbol set.

followed by the symbol set FOLLOW :

For objects : The following symbol set is for a specific non-terminal symbol.
Set element : Specifies the set of terminal symbols that can follow immediately after the non-terminal symbol.
Special case : If the non-terminal can be used as the end, then the terminator # is also an element of the following symbol set.

Select symbol set SELECT :

Object-oriented : The selection symbol set is specific to a specific production.
Set element : A set of terminal symbols that can be selected for replacement in the next step using the specified production.

LL(1) grammar :

Features of LL(1) grammars : Ability to use deterministic top-down parsing techniques.
Grammatical meaning : the first L means scanning the input string from left to right; the second L means that the analysis process uses the leftmost derivation; 1 means that you only need to look at a symbol to the right to know which production to choose for the next derivation.
Judgment necessary and sufficient conditions :
- First determine whether the grammar is a context-free grammar: the left part of each production in the grammar is required to be a non-terminal.
- Divide all productions in the grammar into multiple classes according to the same principle as the left part.
- For the productions in each category, judge whether there is an intersection between the SELECT sets of the productions. If there is no intersection, it means that the grammar is an LL(1) grammar.

Solving Algorithm

FIRST Set Solving

The FIRST set is for a string and can be solved in the following way. In the initial state, the FIRST set is empty.

If the string starts with a terminator or is an empty string, put the terminator (empty string) into the FIRST collection.
If the string begins with a non-terminal, then use the non-terminal as the right part of the production to replace the non-terminal, resulting in several new productions.
If there are unprocessed productions, repeat the above steps until all productions have been processed.

The following is an example to illustrate how to solve the FIRST set.

For the first production in this question S→AB, we need to find the FIRST set of its right part AB.

First, the beginning of the AB string is a non-terminal A, so it needs to be replaced by a production with A as the left part. In this question, the production with A as the left part is A→εsum A→b, so the result after substitution is S→Bsum S→bB.

For S→bB, it starts with the terminal b, so put b into the FIRST set.

For S→B, it starts with a nonterminal B, so a production with B as the left part is used for substitution. In this question, the production formula with B as the left part is B→εsum B→aD, then the production formula after replacement is S→εsum S→aD.

For S→ε, it deduces the empty string, so put ε in the FIRST set.

For S→aD, it starts with the terminal a, so put a into the FIRST set.

At this point there are no unprocessed productions, so the FIRST set calculation ends, {b,ε,a}.

FOLLOW Set Solving

The FOLLOW set is for a non-terminal symbol, and can be solved in the following way: In the initial state, the FOLLOW set is empty.

If the non-terminal is a start symbol, put the terminator # into the non-terminal's FOLLOW set.
View all productions whose right-hand sides contain this nonterminal:
- If there is a string immediately to the right of the non-terminal in these productions, then add all elements in the FIRST set of this string except the empty string to the FOLLOW set. If the string can be deduced from an empty string, then all the elements in the FOLLOW set on the left side of the production are placed in the FOLLOW set of the current non-terminal.
- If there are no strings in these productions to the immediate right of the nonterminal, all elements of the FOLLOW set on the left of the production are placed in the FOLLOW set of the current nonterminal.

The following uses an example to illustrate how to find the FOLLOW set. The FOLLOW set in the initial state is empty.

For the non-terminal symbol S in this question, since it is a start symbol, put the terminator # into the FOLLOW set. Next, find all productions whose right-hand side contains S, only in this problem D→aS. Here, since there is no character string on the right side of S, all the elements in the FOLLOW set of D on the left side of the production are put into the FOLLOW set of non-terminal S.

Next, it is necessary to recursively calculate the FOLLOW set of the non-terminal D.

For the non-terminal symbol D in this question, there are two productions that contain D on the right, namely B→aDand C→AD. For these two productions, there is no string on the right side of D, so D's FOLLOW set is the union of B's FOLLOW set and C's FOLLOW set. Therefore, the following needs to continue to recursively solve the FOLLOW set of B and the FOLLOW set of C.

For the non-terminal symbol B in this question, there is only one production that contains B on the right, that is S→AB. Here there is no string on the right of B, so the FOLLOW set of B is the FOLLOW set of S.

S→bCFor the non-terminal C in this question, there is only one production that contains C on the right . Here there is no string on the right of C, so the FOLLOW set of C is the FOLLOW set of S.

Therefore, the FOLLOW set of D is the FOLLOW set of S, that is, FOLLOW(S)=FOLLOW(S)∪{#}, which can be solved FOLLOW(S)=FOLLOW(B)=FOLLOW(C)=FOLLOW(D)={#}.

It can be seen that to solve the FOLLOW set of a non-terminal symbol, it often needs to use the FOLLOW set of other non-terminal symbols.

SELECT Set Solver

The SELECT set is for a production, which can be solved in the following way:

First, it needs to be divided into two cases: the case where the right side of the production can derive an empty string and the case where an empty string cannot be derived.

1. If the right part of a production cannot produce an empty string, then the SELECT set of this production is said to be the FIRST set of its right part.

2. If an empty string can be deduced from the right part of a production, then the SELECT set of the production needs to find the FOLLOW set on the left and the FIRST set on the right. Take the union of the two and subtract the empty string from the union ε.

Therefore, solving the SELECT set problem of a production is transformed into the process of finding the FISRT set of a string and the FOLLOW set of a non-terminal character, so it is only necessary to solve the FIRST set and the FOLLOW set.

LL(1) Grammar Determination

Let’s enter the solution process formally. First, let’s make it clear that to determine whether a grammar is an LL(1) grammar, the tools that need to be used are the necessary and sufficient conditions for LL(1) grammar discrimination.

The first step is to determine whether the grammar is a context-free grammar. Since the left part of all productions in the grammar in this question is a nonterminal, the grammar is a context-free grammar.

In the second step , the productions are classified according to the principle of the same left part. For example, classify S→AB and S→bC into one category, A→ε and A→b into one category, and so on.

In the third step , for each production of the grammar, calculate the SELECT set. This step is the most complicated step in LL(1) grammar discrimination.

Step 4 : Determine whether there is an intersection between the SELECT sets of different productions with the same left part. If neither intersection exists, then the grammar is an LL(1) grammar.

Let's illustrate with an example :

Given the following grammar, please determine whether it is an LL(1) grammar.

S→AB
S→bC
A→ε
A→b
B→ε
B→aD
C→AD
C→b
D→aS
D→c

Detailed analysis :

Determine whether it is a context-free grammar : After inspection, the left part of all productions is a non-terminal, so it is a context-free grammar.
Classify the productions : classify S→AB and S→bC into one category, A→ε and A→b into one category, and so on.
Compute the SELECT set for each production :
- SELECT(S→AB)= {b,a,#}
- SELECT(S→bC)={b}
- SELECT(A→ε)={a,c,#}
- SELECT(A→b)={b}
- SELECT(B→ε)={#}
- SELECT(B→aD)={a}
- SELECT(C→AD)={a,b,c}
- SELECT(C→b)={b}
- SELECT(D→aS)={a}
- SELECT(D→c)={c}
Determine whether the SELECT sets of the same type of productions intersect : In this question, the SELECT sets of the same type are disjoint, so the grammar is LL(1) grammar.