Interpreter mode - the implementation of a custom language

1 Introduction

1.1, grammar rules and abstract syntax tree

Interpreter patterns describe how to define a grammar for a simple language, how to represent a sentence in that language, and how to interpret those sentences. Before formally analyzing the structure of the interpreter schema, let's learn how to represent the grammar rules of a language and how to construct an abstract syntax tree.

In the addition/subtraction interpreter, each input expression, such as "1+2+3-4+1", contains 3 language units, which can be defined using the following grammar rules:

exression::=value|opoeration
opoeration::=opoeration'+'opoeration|opoeration'-'opoeration
value::=an integer // 一个整数值

This grammar rule contains 3 statements. The first item represents the composition of the expression, where value and operation are the definitions of the following two language units. The strings defined by each statement, such as operation and value, are called language constructs or language units. The symbol "∷=" means "defined as", and the language units on the left are explained and defined through the right, and the language units correspond to terminal expression and non-terminal expression. For example, operation in this rule is a non-terminal expression, and its constituent elements can still be expressions, which can be further decomposed; while value is a terminal expression, whose constituent elements are the most basic language units, which cannot be further decomposed .

Some symbols can be used in the definition of grammar rules to indicate different meanings, such as using "|" to indicate or, using "{" and "}" to indicate combination, using "∗" to indicate 0 or more occurrences, etc. Among them, the most frequently used symbol is "|" representing or relation. For example, the grammar rule "boolValue::=0|1" indicates that the value of the terminal expression boolValue can be 0 or 1.

In addition to using grammatical rules to define a language, it is also possible to visually represent the composition of the language through a graphical method called an abstract syntax tree (Abstract Syntax Tree, AST). Each abstract syntax tree corresponds to a language instance, such as the statement "1+2+3-4+1" in the addition/subtraction expression language, which can be represented by the abstract syntax tree as shown in the figure below.
insert image description here
In the abstract syntax tree, complex statements can be composed through terminal expression value and non-terminal expression operation. The language instance of each grammar rule can be represented as an abstract syntax tree, that is, each specific sentence can be represented by an abstract syntax tree similar to that shown in Figure 18-2. In the figure, the instances of the terminal expression class are used as the leaf nodes of the tree, and the instances of the non-terminal expression class are used as non-leaf nodes. They can combine the instances of the terminal expression class and the subexpression as its child node. An Abstract Syntax Tree describes how to form a complex sentence. Through the analysis of the abstract syntax tree, the terminal and non-terminal classes in the language can be identified.

1.2. Overview

Languages ​​such as C++, Java, and C# cannot directly interpret strings like "1+2+3-4+1" (if they can be interpreted directly as numerical expressions), users must define a set of grammar rules to achieve Interpretation of these statements, that is, designing a custom language. In actual development, these simple custom languages ​​can be designed based on existing programming languages. If the underlying programming language is an object-oriented language, then the interpreter pattern can be used to implement a custom language.

The interpreter pattern is a design pattern that is relatively infrequently used but difficult to learn. It is used to describe how to use an object-oriented language to form a simple language interpreter. In some cases, a new language can be created to better describe certain types of problems. The language has its own expressions and structures, the rules of grammar, and instances of these problems will correspond to sentences in the language. At this point, the interpreter pattern can be used to design this new language. The study of interpreter mode can deepen the understanding of object-oriented thinking, and master the interpretation process of grammar rules in programming languages.

1.3. Definition

Interpreter Pattern: Define the grammar of a language and build an interpreter to interpret sentences in the language. The "language" here refers to the code that uses the specified format and grammar. The Interpreter pattern is a behavioral pattern.

2. Analysis

2.1, UML class diagram

Since expressions can be divided into terminal expressions and non-terminal expressions, the structure of the interpreter mode is somewhat similar to that of the combination mode, but the interpreter mode contains more constituent elements, and its structure is shown in the following figure .
insert image description here
It can be seen that the following four roles are included in the interpreter pattern structure diagram:

  1. AbstractExpression (abstract expression): The abstract interpretation operation is declared in the abstract expression, which is the common parent class of all terminal expressions and non-terminal expressions.
  2. TerminalExpression (terminal expression): It is a subclass of abstract expression, which implements the interpretation operation associated with the terminal symbols in the grammar, and each terminal symbol in the sentence is an instance of this class. Usually, there are only a few terminal expression classes in an interpreter schema, and their instances can form relatively complex sentences through non-terminal expressions.
  3. NonterminalExpression (nonterminal expression): It is also a subclass of abstract expression, which implements the interpretation operation of nonterminal symbols in the grammar. Since a non-terminal expression can contain a terminal expression, and can also continue to contain a non-terminal expression, the interpretation operation is generally done recursively.
  4. Context (environment class): The environment class is also called the context class, which is used to store some global information outside the interpreter, and usually it temporarily stores the statements that need to be interpreted.

2.2. Code example

In interpreter mode, each type of terminal and nonterminal has a concrete class corresponding to it. Just because the class is used to represent each grammatical rule, the system will have better flexibility and scalability. For all terminals and non-terminals, it is first necessary to abstract a common parent class, that is, the abstract expression class, and its typical code is as follows:

/**
 * @Description: 抽象表达式类
 * @Author: yangyongbing
 * @CreateTime: 2023/08/02  12:51
 * @Version: 1.0
 */
abstract class AbstractExpression {
    
    
    public abstract void interpret(Context context);

}

Both the TerminalExpression and NonTerminalExpression classes are subclasses of the AbstractExpression class. For terminator expressions, the code is very simple, mainly dealing with terminator elements, and its typical code is as follows:

/**
 * @Description: 终结符表达式
 * @Author: yangyongbing
 * @CreateTime: 2023/08/02  12:56
 * @Version: 1.0
 */
public class TerminalExpression extends AbstractExpression{
    
    
    @Override
    public void interpret(Context context) {
    
    

    }
}

For non-terminal expressions, the code is relatively complex, because expressions can be combined into more complex structures through non-terminal symbols. For a non-terminal expression class containing two operand elements, the typical code is as follows:

/**
 * @Description: 非终结符表达式类
 * @Author: yangyongbing
 * @CreateTime: 2023/08/02  12:58
 * @Version: 1.0
 */
public class NonterminalExpression extends AbstractExpression {
    
    

    private AbstractExpression left;
    private AbstractExpression right;

    public NonterminalExpression(AbstractExpression left, AbstractExpression right) {
    
    
        this.left = left;
        this.right = right;
    }

    @Override
    public void interpret(Context context) {
    
    
        // 递归调用每一个组成部分的interpret()方法
        // 在递归调用时指定组成部分的连接方式,即非终结符的功能
    }
}

In addition to the above classes used to represent expressions, an environment class Context is usually provided in the interpreter mode to store some global information. The Context can contain a collection object such as HashMap or ArrayList (or directly use a collection class such as HashMap as an environment class) to store a series of public information, such as the mapping relationship between variable names and values ​​(key/value), etc. To obtain relevant information from it when performing specific interpretation operations. Its typical code snippet is as follows:

import java.util.HashMap;

/**
 * @Description: 环境类
 * @Author: yangyongbing
 * @CreateTime: 2023/08/02  12:53
 * @Version: 1.0
 */
public class Context {
    
    
    private HashMap map = new HashMap();

    public void assign(String key, String value) {
    
    
        // 往环境类中设值
    }

    public String lookup(String key) {
    
    
        // 获取存储在环境类中的值
    }
}

When the system does not need to provide global public information, the environment class can be omitted, and it can also decide whether to need the environment class according to the actual situation.

3. Summary of interpreter mode

The Interpreter mode provides a solution for the design and implementation of custom languages. It is used to define a set of grammatical rules and interpret sentences in the language through this set of grammatical rules. Although the frequency of use of the interpreter mode is not particularly high, it is still widely used in fields such as regular expressions and XML document interpretation. Similar to the interpreter mode, many source code processing tools based on abstract syntax trees have been born. For example, the Eclipse AST in Eclipse can be used to represent the grammatical structure of the Java language, and users can create their own grammatical rules by extending its functions.

3.1. Main advantages

  1. Grammars are easy to change and extend. Since classes are used in the interpreter mode to represent the grammatical rules of the language, the grammar can be changed or extended through mechanisms such as inheritance.
  2. Each grammar rule can be expressed as a class, so a simple language can be implemented conveniently.
  3. Implementing the grammar is easier. The implementation of each expression node class in the abstract syntax tree is similar, and the code writing of these classes is not particularly complicated, and some tools can also be used to automatically generate node class code.
  4. It is convenient to add new interpretation expressions. If the user needs to add a new interpretation expression, he only needs to add a new terminal expression or non-terminal expression class, and the original expression class code does not need to be modified, which conforms to the principle of opening and closing.

3.2. Main disadvantages

  1. Difficult to maintain for complex grammars. In the interpreter mode, each rule needs to define at least one class, so if a language contains too many grammar rules, the number of classes will increase dramatically, making the system difficult to manage and maintain. At this time, you can consider using a grammar analyzer etc. to replace the interpreter mode.
  2. The execution efficiency is low. Due to the use of a large number of loops and recursive calls in the interpreter mode, it is very slow when interpreting more complex sentences, and the debugging process of the code is also troublesome.

3.3. Applicable scenarios

(1) A sentence in a language that needs to be interpreted and executed can be represented as an abstract syntax tree.
(2) Some recurring problems can be expressed in a simple language.
(3) The grammar of a language is relatively simple.
(4) Execution efficiency is not a key issue.
Note : Efficient interpreters are usually not implemented by directly interpreting abstract syntax trees, but need to convert them into other forms, and the execution efficiency of using interpreter mode is not high.

Guess you like

Origin blog.csdn.net/YYBDESHIJIE/article/details/132052893