Write your own compiler: basic principles of syntax parsing

In the previous series of chapters we completed lexical analysis. The basic task of lexical parsing is to determine whether a given string conforms to specific rules, and if so, assign a tag (token) to the string. After the lexical parsing is completed, the next work must be assigned to the syntactic parsing. The latter's task is to determine whether the combination of a series of tags meets specific specifications.

The rule followed by grammar parsing is called the Backus-Knoll paradigm, which consists of three parts. The leftmost part is called non-terminal symbols, followed by an arrow, and to the right of the arrow is a series of terminal symbols or non-terminal symbols. What it means is that the concept on the left can be decomposed into a combination of a series of concepts on the right. To give a specific example:
person -> head half body lower body
head -> hair eyes ears nose mouth
upper body -> hands chest stomach
lower body -> butt legs

Let's look at the example above. In the first expression, the left side is an abstract concept "person", and the right side of the arrow is the component of a person. In other words, "person" can be decomposed into three relatively more specific concepts, that is, head, Upper body, lower body. Then the concept "head" can be further decomposed into a combination of other concepts, such as hair, eyes, etc. What needs to be noted here is that all concepts that appear on the left side of the arrow are called "non-terminal symbols", and all concepts that only appear on the right side of the arrow and never appear on the left side of the arrow are called "terminal symbols". Non-terminal symbols can be decomposed, but terminal symbols cannot be decomposed any further. The set formed by the above series of expressions is called "grammar". In grammatical parsing, special emphasis is placed on "context-free grammar". This concept means that grammatical rules only stipulate that lexical parsing only analyzes the combination rules of tags. As for the combination of these tags, It doesn't care what the combination means.

For example:
Sentence -> Subject Predicate
Object The above grammar describes that a Chinese sentence can be divided into three parts: subject, predicate and object. However, the above decomposition cannot tell us what the content of a specific sentence is, that is, the grammar only Care about the logical construction of sentences rather than the meaning they convey. It should also be noted here that the order of a series of concepts on the right side of the arrow is important. The order is an integral part of the grammatical rules. For example, the logical "head" must satisfy that the nose is behind the eyes. If this order is reversed, then the "head" Not a human head, but an alien head.

The basic process of syntax parsing is to give a string, first perform lexical analysis on it to obtain a series of tag combinations, and then see whether these tag combinations can be decomposed smoothly according to the given syntax expression. Let's look at a specific example. The following is for Addition arithmetic expression syntax for combining numbers and plus signs:
STMT -> EXPR SEMI
EXPR -> FACTOR PLUS EXPR | FACTOR
FACTOR -> NUMBER

Now we have a string: "1+2;", first perform lexical analysis on this string to get NUMBER PLUS NUMBER SEMI, now we need to judge whether it obeys the above syntax, first starting from the first expression, because the last one The label is SEMI, so it satisfies the SEMI on the right side of the first expression. Now we need to determine whether NUMBER PLUS NUMBER can satisfy the rules of EXPR.

Let's look at the right decomposition of EXPR. First, FACTOR PLUS EXPR matches the PLUS in NUMBER PLUS NUMBER, so next it is necessary to determine whether the first NUMBER can meet the requirements of FACTOR, and at the same time determine whether the second NUMBER meets the requirements of EXPR.

Let's look at the decomposition of FACTOR. It can be directly decomposed into NUMBER, so the first NUMBER meets the requirements of FACTOR. Now let's look at the second NUMBER. Since EXPR can be decomposed into FACTOR, FACTOR can be decomposed into NUMBER. , thus the second NUMBER satisfies the EXPR regulations, so NUMBER PLUS NUMBER can be parsed by the above grammar, so the tag combination it contains satisfies the given grammar rules.

The above derivation method is also called leftmost derivation, because we always get the right side of the expression first, and then take out the non-terminal symbols from left to right, first to see if the given label meets the specification. At the same time, it should be noted that we start from the top rule and decompose it from top to bottom. This method is also called top-down derivation. We will see the first grammar parsing method later. Its characteristic is to first obtain a set of tags, then parse them one by one starting from the leftmost tag, and then use the leftmost derivation method described above, so this kind of grammar parsing is called LL Grammar parsing algorithm, both L correspond to left in English, which means left. Among them, LL(1) means that one more tag will be checked in advance during parsing, and LL(k) means that k more tags will be checked in advance. The reason why you check the label subject in advance is because a non-terminal symbol may have multiple expressions, such as
EXPR -> FACTOR PLUS EXPR | FACTOR
, which actually corresponds to two expressions, one is EXPR -> FACTOR PLUS EXPR, and the other is EXPR ->FACTOR, which one should be selected during parsing? You can decide by viewing one or more tags in advance. We will also study the LR parsing algorithm later. The first L means to start parsing from the leftmost of the tag string, and R means to use the rightmost method when parsing, which is exactly the opposite of the leftmost method we mentioned earlier.

Another thing to note is that in the grammatical expression given above, the symbols on the left can be parsed into one or more symbols on the right. In fact, there is a possibility that the right side can be parsed into 0 symbols, or else Remember the epsilon conversion during lexical parsing earlier. It means that you can jump to the next state without entering any symbols in the current state. Similarly, we allow the following grammatical expression.
Please add image description
At this time, it means that the parsing can be completed by doing nothing. , it is not difficult to understand that the C language compiler can compile and parse a .c source file with empty content.

The things described in this section are relatively abstract, and it is likely to bring you more confusion. Fortunately, we have all been exposed to grammatical parsing at the beginning and when doing lexical parsing, so the previous exercises should help us. To understand these theories, in the following chapters, we will practice some relatively simple grammatical analysis. Only through practice can we better understand and master the abstract theory. For more information, please search Coding Disney at station b.

Guess you like

Origin blog.csdn.net/tyler_download/article/details/135072586