Compilation Principle Notes (2) Grammar and Language

1. Intuitive concepts of grammar

Grammar overview : Grammar is a description of the sentence structure of a language, and it is a tool for describing infinite sets with finite sets.

2. Symbols and strings

The concept of the alphabet : the alphabet is a non-empty finite set of elements, and the elements in the alphabet are called symbols, so the alphabet is also called a symbol set.

The concept of strings :

  • Definition of symbol string : Any finite sequence composed of symbols in the alphabet is called a symbol string.
  • Length of the symbol string : The number of symbols contained in the symbol string is called the length of the symbol string. A symbol string that does not contain any symbols is called an empty string, denoted by ε, and its length is 0.
  • Head and tail, proper head and proper tail of symbol strings :
    • Head and tail : if z=xy is a symbol string, then x is the head of z and y is the tail of z;
    • Intrinsic head and intrinsic tail : if x is non-null, then y is an intrinsic tail; if y is non-null, then x is an intrinsic head.
  • Connection of symbol strings : Let x and y be symbol strings, and their connection xy is the symbol string obtained by writing the symbol of y after the symbol of x.
  • The power of the symbol string : Let x be a symbol string, and the symbol string obtained by connecting x itself n times is called the power of the symbol string x.
  • Set of symbol strings : If all elements in the set A are symbol strings on a certain alphabet, then A is said to be a set of symbol strings on the alphabet.
    • The product of symbol string sets : the product of two symbol string sets A and B is: AB={xy | x∈A and y∈B}, that is, AB is composed of all symbol strings satisfying that x belongs to A and y belongs to B collection.
  • Closure of the alphabet : The closure of the alphabet Σ is denoted by Σ*, which is the set of all finite-length strings on the alphabet.
    • The positive closure of the alphabet : delete the empty string ε from the closure of the alphabet Σ, and the resulting set is called the positive closure of the alphabet, denoted as Σ + .

3. Formal definitions of grammar and language

The concept of a production (rule) :

  • Representation of productions : ordered pairs of the form α→β or α::=β. The notation used herein is read as "defined as".
  • Left part and right part : α is called the left part of the production, and β is called the right part of the production.

The concept of grammar : Grammar G is defined as a quadruple (Vn,Vt,P,S).

  • Tuple content explanation : Vn is a set of non-terminal symbols, Vt is a set of terminal symbols, and P is a set of productions (Vn, Vt, and P are all required to be non-empty). S is an identifier or a start symbol, a non-terminal symbol that indicates the beginning of derivation, and must appear as the left part of a production in at least one production.
  • Requirements for the production : Both the left and right parts of the production are elements in the closure of the union of the non-terminal set and the terminal set, and the left part is required to contain at least one non-terminal.
  • Shorthand representation of grammar : In many cases, it is not necessary to express a grammar in the form of a quaternion, but only write the production formula, and agree to use uppercase letters to represent non-terminal symbols and lowercase letters to represent terminal symbols.

Derived concepts :

  • Direct derivation and direct reduction : Suppose α→β is G=(Vn,Vt,P,S)a production of the grammar, x and y are any symbols in V*, if the symbol strings v and w satisfy: v=xαy,w=xβy, it is said that v directly produces w. We say that w is a direct derivation of v, or that w directly reduces to v.
  • Derivation and reduction : If v can obtain w through a series of direct derivations, then v is said to derive w, or w is said to derive v. The number of direct derivations used consecutively is called the derivation length.
  • Representation of derivation and specification : direct derivation is represented by a thick arrow; derivation adds a " +" sign to the arrow of direct derivation; it is also possible to uniformly add a " " sign above the thick arrow of direct derivation *to unify direct derivation and derivation.

Concepts of sentence patterns, sentences, languages, grammatical equivalence :

  • The concept of sentence pattern : suppose it G[S]is a grammar, if the symbol string x is derived from the identification symbol, then it is said that x is the sentence pattern of grammar G[S].
  • The concept of a sentence : If a sentence pattern is only composed of terminal symbols, then this sentence pattern is called a sentence.
  • Definition of language : The language produced by a grammar G is the set of all sentences produced by the grammar, denoted as L(G).
  • Equivalence of grammars : Two grammars are said to be equivalent if they produce exactly the same language.

4. Types of grammar

Chomsky Grammar Classification : Chomsky established a description of formal languages ​​in 1956, and divided grammars into Type 0, Type 1, Type 2, and Type 3 grammars. The difference between these four types of grammars lies in the constraints imposed on the productions.

Four different grammars :

  • Type 0 grammar :

    • Another name for grammar : Type 0 grammar is also called phrase grammar.
    • Production restriction : The left side of a production must contain at least one nonterminal.
  • Type 1 grammar :

    • Another name for grammar : Type 1 grammar is also called context-sensitive grammar.
    • Production restrictions : All productions satisfy that the length of the left part is less than or equal to the length of the right part (except S→ε).
  • Type 2 grammar :

    • Another name for grammar : Type 2 grammar is also called context-free grammar.
    • Production restriction : the left part of the production is a nonterminal.
  • Type 3 grammar :

    • Another name for grammar : Type 3 grammar is also called regular grammar.
    • Production constraints : Every production is of the form A→aB or A→a.

Relationship between the four grammars : The definitions of the four grammar types are progressively more restrictive. All Type 3 grammars are Type 2 grammars, all Type 2 grammars are Type 1 grammars, and all Type 1 grammars are Type 0 grammars.

Languages ​​generated by four grammars : Type 0, Type 1, Type 2, and Type 3 grammars are called Type 0 languages, context-sensitive languages, context-free languages, and regular languages, respectively.

5. Context-free grammar and its syntax tree

The relationship between context-free grammar and programming language : context-free grammar has sufficient ability to describe the grammatical structure of today's programming language.

The concept of syntax tree (derivation tree) :

  • The role of the syntax tree : The syntax tree is an intuitive tool for describing the derivation of the sentence pattern of the context-free grammar.
  • The syntax tree satisfies the conditions :
    • Each node has a label, which is a symbol of V; the label of the root is S.
    • If a node labeled A has a descendant other than itself, then A must be a nonterminal.
    • If the order of direct descendants of node n from left to right is n1 n2 n3..., and their labels are A1 A2 A3..., then there must be a production: A→A1 A2 A3...AK.
  • Restriction of the grammar tree : the grammar tree only indicates which production is used and which non-terminal is used in a certain derivation process, and does not specify the order in which the production is applied.

Leftmost and rightmost derivations :

  • Leftmost derivation : If a directly deduces b at any step in the derivation process, which is to replace the leftmost non-terminal symbol of a, then this derivation method is called leftmost derivation.
  • Rightmost derivation : Similar to leftmost derivation, if the rightmost non-terminal symbol of a is replaced during the derivation process, this derivation method is called rightmost derivation. In formal languages, rightmost derivation is also called canonical derivation, and the derived sentence pattern is called canonical sentence pattern (right sentence pattern).

Note : A sentence pattern corresponds to more than one syntax tree; a sentence pattern does not necessarily have only one leftmost derivation or rightmost derivation.

Ambiguities in grammar and language :

  • Ambiguity of grammar : If there is a sentence in a grammar, it has two different leftmost (rightmost) derivations, then the grammar is said to be ambiguous.
  • Inherent ambiguity of language : A language is said to be inherently ambiguous if every grammar that produces a context-free language is ambiguous. For a programming language, it is often hoped that its grammar is unambiguous, because it is hoped that each statement analysis of it is unique.

6. Analysis of sentence patterns

Overview of sentence analysis :

  • Sentence pattern analysis is to identify whether a string of symbols is a sentence pattern of a certain grammar, which is the construction process of a certain derivation.
  • Furthermore, when a symbol string is given, it tries to construct a derivation or syntax tree for the symbol string according to the rules of a certain grammar, so as to identify whether it is a sentence pattern of the grammar.

Sentence analysis in compilation : Sentence analysis in compilation refers to the process of identifying whether an input symbol string is a grammatically correct program. In language compiling and implementation, the program that completes sentence pattern analysis is called an analysis program or a recognition program, and the analysis algorithm is also called a recognition algorithm. It can be divided into two methods: top-down analysis and bottom-up analysis:

  • Top-down analysis method : Starting from the beginning symbol of the grammar, each production is used repeatedly to find a derivation that matches the input symbol string.
  • Bottom-up analysis method : starting from the input symbol string, the reduction is carried out step by step until the reduction reaches the start symbol of the grammar.

Phrase concept :

  • Definition of Phrase : Let G be a grammar and S be the start symbol of the grammar. If a sentence pattern in the grammar contains a non-terminal symbol A, and the non-terminal symbol A can deduce a sentence pattern B, then B is said to be the sentence pattern relative to the non-terminal symbol A. phrase.
  • Definition of direct phrase (simple phrase) : For a phrase, if a non-terminal symbol A directly deduces a sentence type B, then B is said to be a direct phrase of the sentence type relative to the non-terminal symbol A.
  • Definition of handle : A direct phrase of a right sentence type is called the handle of the sentence type. The concept of a handle only applies to right-sentence types.

7. Some notes on the practical application of grammar

Harmful and redundant rules : In practical applications, the grammar should be restricted from harmful and redundant rules.

  • Harmful rule : productions with the same left and right sides. The reason why such a production is harmful is that it only causes grammatical ambiguity.
  • Redundant rules : Productions that are not used in the derivation of all sentences in the grammar. Appears in two forms.
    • Nonterminals in the grammar that do not appear on the right side of any rule are said to be unreachable.
    • Non-terminals in a grammar for which no terminal can be derived are called non-terminals.

Guess you like

Origin blog.csdn.net/hanmo22357/article/details/130500256