Exploring the Presto SQL engine - using Antlr skillfully

insert image description here

1. Background

Since big data was first included in the government work report in 2014, big data has been developed for 8 years. The types of big data also extend from transaction data to interaction data and sensor data. The data scale has also reached the PB level.

The scale of big data is so large that the acquisition, storage, management, and analysis of data are beyond the capabilities of traditional database software tools. In this context, various big data-related tools have emerged one after another to meet the needs of various business scenarios. From Hadoop ecological Hive, Spark, Presto, Kylin, Druid to non-Hadoop ecological ClickHouse, Elasticsearch, etc...

These big data processing tools have different characteristics and different application scenarios, but the interfaces or operating languages ​​they provide are similar, that is, each component supports the SQL language. Only based on different application scenarios and characteristics, the respective SQL dialects are realized. This requires related open source projects to implement SQL parsing by themselves. In this context, ANTLR, a grammar parser generator born in 1989, ushered in a golden age.

2. Introduction

ANTLR is an open source grammar parser generator that has been around for over 30 years. is an open source project that has stood the test of time. From source code to machine executable, a program basically needs 3 stages: writing, compiling, and executing.

In the compilation phase, lexical and grammatical analysis is required. The problem that ANTLR focuses on is to analyze the source code lexically and syntactically to generate a tree-like analyzer. ANTLR supports parsing of almost all mainstream programming languages. As you can see from antlr/grammars-v4, ANTLR supports dozens of programming languages ​​such as Java, C, Python, SQL, etc. Usually we do not have the need to extend the programming language, so in most cases these language compilation support is more for learning and research, or used in various development tools (NetBeans, Intellij) to verify grammatical correctness, and formatting code.

For the SQL language, the application breadth and depth of ANTLR will be greater. This is because Hive, Presto, SparkSQL, etc. need to customize the development of SQL execution, such as implementing a distributed query engine and realizing unique data in various big data scenarios. characteristics etc.

3. Realize the four operations based on ANTLR4

Currently we mainly use ANTLR4. In the book "The Definitive ANTLR4 Reference", various interesting application scenarios based on ANTLR4 are introduced. For example: implement a calculator that supports four arithmetic operations; implement parsing and extraction of formatted text such as JSON;

Convert JSON to XML; extract interfaces from Java source code, etc. This section takes the implementation of the four arithmetic calculators as an example, introduces the simple application of Antlr4, and paves the way for the later implementation of SQL parsing based on ANTLR4. In fact, supporting numerical operations is also a basic capability that every programming language must have.

3.1 Self-coded implementation

In the absence of ANTLR4, what should we do if we want to implement the four arithmetic operations? One way of thinking is to implement it based on the stack. For example, without considering exception handling, the simple four arithmetic operation codes are as follows:

package org.example.calc;
 
import java.util.*;
 
public class CalcByHand {
    
    
    // 定义操作符并区分优先级,*/ 优先级较高
    public static Set<String> opSet1 = new HashSet<>();
    public static Set<String> opSet2 = new HashSet<>();
    static{
    
    
        opSet1.add("+");
        opSet1.add("-");
        opSet2.add("*");
        opSet2.add("/");
    }
    public static void main(String[] args) {
    
    
        String exp="1+3*4";
        //将表达式拆分成token
        String[] tokens = exp.split("((?<=[\\+|\\-|\\*|\\/])|(?=[\\+|\\-|\\*|\\/]))");
 
        Stack<String> opStack = new Stack<>();
        Stack<String> numStack = new Stack<>();
        int proi=1;
        // 基于类型放到不同的栈中
        for(String token: tokens){
    
    
            token = token.trim();
 
            if(opSet1.contains(token)){
    
    
                opStack.push(token);
                proi=1;
            }else if(opSet2.contains(token)){
    
    
                proi=2;
                opStack.push(token);
            }else{
    
    
                numStack.push(token);
                // 如果操作数前面的运算符是高优先级运算符,计算后结果入栈
                if(proi==2){
    
    
                    calcExp(opStack,numStack);
                }
            }
        }
 
        while (!opStack.isEmpty()){
    
    
            calcExp(opStack,numStack);
        }
        String finalVal = numStack.pop();
        System.out.println(finalVal);
    }
     
    private static void calcExp(Stack<String> opStack, Stack<String> numStack) {
    
    
        double right=Double.valueOf(numStack.pop());
        double left = Double.valueOf(numStack.pop());
        String op = opStack.pop();
        String val;
        switch (op){
    
    
            case "+":
                 val =String.valueOf(left+right);
                break;
            case "-":
                 val =String.valueOf(left-right);
                break;
            case "*":
                val =String.valueOf(left*right);
                break;
            case "/":
                val =String.valueOf(left/right);
                break;
            default:
                throw new UnsupportedOperationException("unsupported");
        }
        numStack.push(val);
    }
}

The amount of code is not large, and the data structure-stack feature is used, and the operator priority needs to be controlled by itself. The feature does not support bracket expressions, nor does it support expression assignment. Next look at the implementation using ANTLR4.

3.2 Implementation based on ANTLR4

The basic process of programming with ANTLR4 is fixed, usually divided into the following three steps:

  • Based on the requirements, write the semantic rules of the custom grammar according to the rules of ANTLR4, and save it as a file with the suffix of g4.

  • Use the ANTLR4 tool to process the g4 file to generate lexical analyzer, syntax analyzer code, and dictionary files.

  • Write code to inherit the Visitor class or implement the Listener interface to develop your own business logic code.

Based on the above process, let's analyze the details with the help of existing cases.

Step 1: Define the grammar file based on the rules of ANTLR4, and the file name is suffixed with g4. For example, the syntax rule file for implementing a calculator is named LabeledExpr.g4. Its content is as follows:

grammar LabeledExpr; // rename to distinguish from Expr.g4
 
prog:   stat+ ;
 
stat:   expr NEWLINE                # printExpr
    |   ID '=' expr NEWLINE         # assign
    |   NEWLINE                     # blank
    ;
 
expr:   expr op=('*'|'/') expr      # MulDiv
    |   expr op=('+'|'-') expr      # AddSub
    |   INT                         # int
    |   ID                          # id
    |   '(' expr ')'                # parens
    ;
 
MUL :   '*' ; // assigns token name to '*' used above in grammar
DIV :   '/' ;
ADD :   '+' ;
SUB :   '-' ;
ID  :   [a-zA-Z]+ ;      // match identifiers
INT :   [0-9]+ ;         // match integers
NEWLINE:'\r'? '\n' ;     // return newlines to parser (is end-statement signal)
WS  :   [ \t]+ -> skip ; // toss out whitespace

(Note: This file case comes from "The Definitive ANTLR4 Reference")
Briefly interpret the LabeledExpr.g4 file. ANTLR4 rules are defined based on regular expression definitions. Rules are understood top-down, and each semicolon-terminated statement represents a rule. For example, the first line: grammar LabeledExpr; indicates that our grammar name is LabeledExpr, and this name needs to be consistent with the file name. Java coding also has a similar rule: the class name is consistent with the class file.

  • The rule prog indicates that prog is one or more stats.

  • The rule stat matches three sub-rules: blank line, expression expr, and assignment expression ID'='expr.

  • The expression expr adapts to five sub-rules: multiplication and division, addition and subtraction, integer, ID, and parenthesis expressions. Obviously, this is a recursive definition.

The last thing to define is the basic elements that make up the compound rules, such as:

Rule ID: [a-zA-Z]+ indicates that the ID is limited to uppercase and lowercase English strings;
INT: [0-9]+; indicates that the rule of INT is one or more numbers between 0-9, of course, this definition is actually Not strict. To be more strict, its length should be limited.

On the basis of understanding regular expressions, the g4 grammar rules of ANTLR4 are relatively easy to understand.

Defining ANTLR4 rules needs to pay attention to a situation, that is, a string may support multiple rules at the same time, such as the following two rules:

ID: [a-zA-Z]+;
FROM: ‘from’;

Obviously, the string "from" satisfies the above two rules at the same time, and the way ANTLR4 handles it is determined in accordance with the order of definition. Here ID is defined before FROM, so the string from will be first matched to the ID rule.

In fact, in the definitions and regulations, after writing the g4 file, ANTLR4 has completed 50% of the work for us: it has helped us realize the entire architecture and interface, and the remaining development work is to implement specific implementations based on interfaces or abstract classes . There are two ways to implement the generated syntax tree, one is the Visitor mode, and the other is the Listener (listener mode).

3.2.1 Using the Visitor mode

Step 2: Use the ANTLR4 tool to parse the g4 file and generate code. That is, the ANTLR tool parses the g4 file and automatically generates the basic code for us. The flow diagram is as follows:

insert image description here
The command line is as follows:

antlr4 -package org.example.calc -no-listener -visitor .\LabeledExpr.g4

After the command is executed, the generated files are as follows:


$ tree .
.
├── LabeledExpr.g4
├── LabeledExpr.tokens
├── LabeledExprBaseVisitor.java
├── LabeledExprLexer.java
├── LabeledExprLexer.tokens
├── LabeledExprParser.java
└── LabeledExprVisitor.java

First develop the entry class Calc.java. The Calc class is the entry point of the whole program, and the core code of calling the lexer and parser classes of ANTLR4 is as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
EvalVisitor eval = new EvalVisitor();
eval.visit(tree);

Next, define a class that inherits the LabeledExprBaseVisitor class, and the overriding method is as follows:

insert image description here
It can be seen from the figure that the generated code and the rule definition are corresponding. For example, visitAddSub corresponds to the AddSub rule, and visitId corresponds to the id rule. And so on... The code to implement addition and subtraction is as follows:


/** expr op=('+'|'-') expr */
@Override
public Integer visitAddSub(LabeledExprParser.AddSubContext ctx) {
    int left = visit(ctx.expr(0));  // get value of left subexpression
    int right = visit(ctx.expr(1)); // get value of right subexpression
    if ( ctx.op.getType() == LabeledExprParser.ADD ) return left + right;
    return left - right; // must be SUB
}

Pretty intuitive. Once the code is written, it's time to run Calc. Run the main function of Calc, enter the corresponding operation expression on the interactive command line, and press Ctrl+D to see the operation result. For example, 1+3*4=13.

3.2.2 Using Listener mode

Similarly, we can also use the Listener mode to implement the four arithmetic operations. The command line is as follows:

antlr4 -package org.example.calc -listener .\LabeledExpr.g4

The execution of this command will also generate skeleton code for us. On the basis of the framework code, we can develop the entry class and interface implementation class. First develop the entry class Calc.java. The Calc class is the entry point of the entire program, and the codes for calling the lexer and parser classes of ANTLR4 are as follows:

ANTLRInputStream input = new ANTLRInputStream(is);
LabeledExprLexer lexer = new LabeledExprLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LabeledExprParser parser = new LabeledExprParser(tokens);
ParseTree tree = parser.prog(); // parse
 
ParseTreeWalker walker = new ParseTreeWalker();
walker.walk(new EvalListener(), tree);
	

It can be seen that the calling logic for generating ParseTree is exactly the same. The code to implement the Listener is slightly more complicated, and it also needs to use the data structure of the stack, but only one operand stack is needed, and there is no need to control the priority by itself. Take AddSub as an example:


@Override
public void exitAddSub(LabeledExprParser.AddSubContext ctx) {
    
    
    Double left = numStack.pop();
    Double right= numStack.pop();
    Double result;
    if (ctx.op.getType() == LabeledExprParser.ADD) {
    
    
        result = left + right;
    } else {
    
    
        result = left - right;
    }
    numStack.push(result);
}

Take the operand directly from the stack and perform the operation.

3.2.3 Summary

Regarding the difference between the Listener mode and the Visitor mode, there is a clear explanation in the book "The Definitive ANTLR 4 Reference":

Listener mode:
insert image description here
Visitor mode

insert image description here

  • The Listener mode traverses by itself through the walker object, without considering the upper-lower relationship of its syntax tree. - Vistor needs to control the child nodes to be accessed by itself. If a child node is omitted, the entire child node will not be accessible.
  • The method in the Listener mode has no return value, and the Vistor mode can set any return value.
  • The access stack of the Listener mode is clear, and the Vistor mode is a method call stack. If there is an error in the implementation, it may lead to StackOverFlow.

Through this simple example, we drive Antlr4 to implement a simple calculator. Learned the application process of ANTLR4. Understand the definition method of g4 grammar file, Visitor mode and Listener mode. Through ANTLR4, we generated ParseTree, and accessed this ParseTree based on Visitor mode and Listener mode, and realized four operations.

Based on the above examples, we can find that if there is no ANTLR4, we can achieve the same function by writing our own algorithm. However, using ANTLR does not need to care about the parsing process of the expression string, but only pays attention to the specific business implementation, which is very worry-free and trouble-free.

More importantly, ANTLR4 provides more imaginative abstract logic than self-implementation, and has risen to the height of methodology, because it is not limited to solving a certain problem, but solves a class of problems. It can be said that ANTLR has the same gap as the ordinary area formula and calculus in the field of mathematics compared to its own hard-coded problem-solving ideas.

4. Refer to Presto source code to develop SQL parser

The use of ANTLR4 to implement the four operations was introduced earlier, the purpose of which is to understand the application of ANTLR4. Next, see the poor dagger, showing our real purpose: to study how ANTLR4 implements the parsing of SQL statements in Presto.

Supporting the complete SQL syntax is a huge project. There is a complete SqlBase.g4 file in presto, which defines all SQL syntax supported by presto, covering DDL syntax and DML syntax. The file system is relatively large, and it is not suitable for learning and exploring a specific detail point.

In order to explore the process of SQL parsing and understand the logic behind SQL execution, I chose to do my own coding experiments on the basis of simply reading relevant documents. To this end, define a small goal: implement a SQL parser. Use this parser to implement the select field from table syntax, and query the specified field from the local csv data source.

4.1 Crop the SelectBase.g4 file

Based on the same process as the implementation of the four arithmetic operators, first define the SelectBase.g4 file. Because of the Presto source code as a reference system, our SelectBase.g4 does not need to be developed by ourselves, but only needs to be cut based on the Presto g4 file. The cropped content is as follows:


grammar SqlBase;
 
tokens {
    DELIMITER
}
 
singleStatement
    : statement EOF
    ;
 
statement
    : query                                                            #statementDefault
    ;
 
query
    :  queryNoWith
    ;
 
queryNoWith:
      queryTerm
    ;
 
queryTerm
    : queryPrimary                                                             #queryTermDefault
    ;
 
queryPrimary
    : querySpecification                   #queryPrimaryDefault
    ;
 
querySpecification
    : SELECT  selectItem (',' selectItem)*
      (FROM relation (',' relation)*)?
    ;
 
selectItem
    : expression  #selectSingle
    ;
 
relation
    :  sampledRelation                             #relationDefault
    ;
 
expression
    : booleanExpression
    ;
 
booleanExpression
    : valueExpression             #predicated
    ;
 
valueExpression
    : primaryExpression                                                                 #valueExpressionDefault
    ;
 
primaryExpression
    : identifier                                                                          #columnReference
    ;
 
sampledRelation
    : aliasedRelation
    ;
 
aliasedRelation
    : relationPrimary
    ;
 
relationPrimary
    : qualifiedName                                                   #tableName
    ;
 
qualifiedName
    : identifier ('.' identifier)*
    ;
 
identifier
    : IDENTIFIER             #unquotedIdentifier
    ;
 
SELECT: 'SELECT';
FROM: 'FROM';
 
fragment DIGIT
    : [0-9]
    ;
 
fragment LETTER
    : [A-Z]
    ;
 
IDENTIFIER
    : (LETTER | '_') (LETTER | DIGIT | '_' | '@' | ':')*
    ;
 
WS
    : [ \r\n\t]+ -> channel(HIDDEN)
    ;
 
// Catch-all for anything we can't recognize.
// We use this to be able to ignore and recover all the text
// when splitting statements with DelimiterLexer
UNRECOGNIZED
    : .
    ;

Compared with the more than 700 lines of rules in the presto source code, we cut it to 1/10 of its size. The core rules of this file are: SELECT selectItem (',' selectItem)* (FROM relation (',' relation)*)?

By understanding the g4 file, we can also understand the composition of our query statement more clearly. For example, usually our most common query data source is a data table. But in SQL syntax, our query data table is abstracted into relation.

This relation may come from a specific data table, or a subquery, or JOIN, or data sampling, or the unnest of an expression. In the field of big data, such an extension will greatly facilitate data processing.

For example, to use the unnest syntax to parse data of complex types, the SQL is as follows:

insert image description here
Although SQL is relatively complex, by understanding the g4 file, you can clearly understand its structural division. Back to the SelectBase.g4 file, we also use the Antlr4 command to process the g4 file and generate code:

antlr4 -package org.example.antlr -no-listener -visitor .\SqlBase.g4

This generates the basic skeleton code. The next step is to handle the business logic by yourself.

4.2 Traversing the syntax tree to encapsulate SQL structure information

Next, define the node type of the syntax tree based on the SQL syntax, as shown in the following figure.

insert image description here
Through this class diagram, you can clearly see the basic elements in the SQL syntax.

Then implement your own analysis class AstBuilder based on the visitor mode (here, in order to simplify the problem, it is still cut from the presto source code). Take processing querySpecification rule code as an example:


@Override
public Node visitQuerySpecification(SqlBaseParser.QuerySpecificationContext context)
{
    
    
    Optional<Relation> from = Optional.empty();
    List<SelectItem> selectItems = visit(context.selectItem(), SelectItem.class);
 
    List<Relation> relations = visit(context.relation(), Relation.class);
    if (!relations.isEmpty()) {
    
    
        // synthesize implicit join nodes
        Iterator<Relation> iterator = relations.iterator();
        Relation relation = iterator.next();
 
        from = Optional.of(relation);
    }
 
    return new QuerySpecification(
            getLocation(context),
            new Select(getLocation(context.SELECT()), false, selectItems),
            from);
}

Through the code, we have parsed out the query data source and specific fields, and encapsulated them in the QuerySpecification object.

4.3 Using the Statement object to realize data query

Through the previous example of implementing the four arithmetic operators, we know that ANTLR parses the sentences entered by the user into ParseTree. Business developers implement relevant interfaces to parse ParseTree by themselves. Presto generates a ParseTree by parsing the input sql statement, traverses the ParseTree, and finally generates a Statement object. The core code is as follows:

SqlParser sqlParser = new SqlParser();
Statement statement = sqlParser.createStatement(sql);

With the Statement object, how do we use it? Combined with the previous class diagram, we can find:

  • Statement of type Query has QueryBody property.

  • QueryBody of QuerySpecification type has select attribute and from attribute.

Through this structure, we can clearly obtain the necessary elements to implement the select query:

  • Obtain the target table Table to be queried from the from property. It is agreed here that the table name is the same as the csv file name.

  • Get the target field SelectItem to be queried from the select attribute. It is agreed here that the first line of the csv is the title line.

The whole business process is clear. After parsing the sql statement to generate the statement object, follow the steps below:

s1: Get the data table and fields of the query.
s2: Specify the data file through the data table name, and read the data of the data file.
s3: Formatted output field names to the command line.
s4: Format the output field content to the command line.

In order to simplify the logic, the code only deals with the main line and does not handle exceptions.


/**
 * 获取待查询的表名和字段名称
 */
QuerySpecification specification = (QuerySpecification) query.getQueryBody();
Table table= (Table) specification.getFrom().get();
List<SelectItem> selectItems = specification.getSelect().getSelectItems();
List<String> fieldNames = Lists.newArrayList();
for(SelectItem item:selectItems){
    
    
    SingleColumn column = (SingleColumn) item;
    fieldNames.add(((Identifier)column.getExpression()).getValue());
}
 
/**
 * 基于表名确定查询的数据源文件
 */
String fileLoc = String.format("./data/%s.csv",table.getName());
 
/**
 * 从csv文件中读取指定的字段
 */
Reader in = new FileReader(fileLoc);
Iterable<CSVRecord> records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse(in);
List<Row> rowList = Lists.newArrayList();
for(CSVRecord record:records){
    
    
    Row row = new Row();
    for(String field:fieldNames){
    
    
        row.addColumn(record.get(field));
    }
    rowList.add(row);
}
 
/**
 * 格式化输出到控制台
 */
int width=30;
String format = fieldNames.stream().map(s-> "%-"+width+"s").collect(Collectors.joining("|"));
System.out.println( "|"+String.format(format, fieldNames.toArray())+"|");
 
int flagCnt = width*fieldNames.size()+fieldNames.size();
String rowDelimiter = String.join("", Collections.nCopies(flagCnt, "-"));
System.out.println(rowDelimiter);
for(Row row:rowList){
    
    
    System.out.println( "|"+String.format(format, row.getColumnList().toArray())+"|");
}

The code is for demonstration purposes only, and abnormal logic is not considered for the time being, such as the query field does not exist, the csv file definition field name does not meet the requirements, etc.

4.4 Realize the effect display

In the data directory of our project, store the following csv files:
insert image description here
The sample data of the cities.csv file is as follows:


"LatD","LatM","LatS","NS","LonD","LonM","LonS","EW","City","State"
   41,    5,   59, "N",     80,   39,    0, "W", "Youngstown", OH
   42,   52,   48, "N",     97,   23,   23, "W", "Yankton", SD
   46,   35,   59, "N",    120,   30,   36, "W", "Yakima", WA
   42,   16,   12, "N",     71,   48,    0, "W", "Worcester", MA

Run the code to query the data. Use SQL statements to specify fields to query from csv files. Finally, the effect of achieving a similar SQL query is as follows:

SQL样例1:select City, City from cities

insert image description here
SQL sample 2: select name, age from employee
insert image description here
This section describes how to cut the g4 rule file based on the Presto source code, and then use SQL statements to query data from the csv file based on Antlr4. Relying on the tailoring of the Presto source code for coding experiments, understanding the Presto source code can play a certain role in the study of SQL engine implementation.

5. Summary

This article explains the application ideas and process of ANTLR4 in project development based on two cases of four arithmetic operators and using SQL to query csv data. The relevant code can be seen on github. Understanding the usage of ANTLR4 can help understand the definition rules and execution process of SQL, and assist in writing efficient SQL statements in business development. At the same time, it is also helpful for understanding the principles of compilation, defining your own DSL, and abstracting business logic. What is achieved on paper is always shallow, and I know that this matter must be done. It is also a pleasure to study the source code implementation through the method described in this article.

6 References

1、《The Definitive ANTLR4 Reference》

2. Presto official documents

https://prestodb.io/docs/current/

3. "ANTLR 4 Concise Tutorial"

https://wizardforcel.gitbooks.io/antlr4-short-course/content/calculator-listener.html

4. Calc class source code

5. EvalVisitor class source code

6. Presto source code

https://github.com/prestodb/presto


github:

https://github.com/shgy/db-practice

Guess you like

Origin blog.csdn.net/qq_31557939/article/details/128273214