SQL open source analysis tool -Apache Calcite

concept

Apache Calcite is an open source SQL analytic tools, a variety of SQL statements can be parsed into abstract syntax surgery AST (Abstract Syntax Tree), after the adoption of the operation AST can put algorithms in relational SQL to express reflected in the specific code.

Calcite in his lifetime for the Optiq (also Farrago), written for the Java language, through 10 years of development, has become its top-level Apache project in 2013, and continues to develop, the founder of the project is Julian Hyde, which has for many years SQL engine development experience, currently working in Hortonworks, is responsible for the development and maintenance of Calcite project.

Currently, Calcite as SQL parsing and processing engine has Hive, Drill, Flink, Phoenix and Storm, it is certain that there will be more and more data processing engine uses Calcite as SQL parsing tools.

 

Features

In conclusion Calcite has the following main features:

  • SQL parsing
  • SQL check
  • Query Optimization
  • SQL Builder
  • data links

 Calcite parse SQL step

The above figure, there is generally Calcite SQL parsing the following steps:

  • Parser . This step Calcite by Java CC will resolve to not check the SQL AST
  • The Validate . The main role of this step is a school certificate Parser step AST is legitimate, validating SQL scheme, fields, functions, and so if there is; SQL statement is legal, etc. After this step is completed to generate the RelNode tree (on RelNode tree, please. see below)
  • The Optimize . The main role of the step of optimizing RelNode tree, and converted to physical execution plan. Mainly related to SQL optimization rules such as: rule-based optimization (RBO) and cost-based (CBO) optimization; on Optimze This step is optional in principle, by RelNode tree after the Validate already converted directly to physical implementation plan, but modern SQL parser basically include this step in order to optimize SQL execution plan. The results obtained in this step is the physical implementation plan.
  • The Execute , namely the implementation phase. This phase is mainly to do is: convert physical execution plan into the program can be executed in a particular platform. The Hive and Flink at this stage in a physical execution plan CodeGen generate the corresponding executable code.

 

Calcite related components

Calcite mainly in the following concepts:

  • Catelog : The main definition of SQL semantic metadata associated with the namespace.
  • Parser SQL : SQL mainly to be converted into AST.
  • Validator SQL : Catalog to the school certificate by AST.
  • Optimizer Query : AST will be transformed into a physical implementation plan, optimize physical execution plan.
  • Generator SQL : reverse the physical implementation plan converted to SQL statements.


1) category

Catalog: The main definitions are accessed by SQL namespace, including the following:

  1. schema: schema defines a set of main and table, schame not necessarily mandatory, for example have two of the same name table T1, T2, it is necessary to distinguish between the two tables schema, such as A.T1, B.T1
  2. Table: correspondence table of a relational database, data representative of a class, the calcite in the RelDataTypedefinition of
  3. RelDataType Representative data definition table, such as data column name of the table, the type and the like. 

 

Schema:

public interface Schema {
  
  Table getTable(String name);

  Set<String> getTableNames();

  Set<String> getFunctionNames();

  Schema getSubSchema(String name);

  Set<String> getSubSchemaNames();
  
  Expression getExpression(SchemaPlus parentSchema, String name);
  
  boolean isMutable();

Table:
public interface Table {
  
  RelDataType getRowType(RelDataTypeFactory typeFactory);

  Statistic getStatistic();
  
  Schema.TableType getJdbcTableType();
}

 Representative wherein RelDataType Row data types, for the Statistic data tables, in particular for the CBO table of consideration for calculation table.

2) SQL Parser

Written by Java CC, will be converted to SQL AST.

  • Java CC refers to the Java Compiler Compiler, a particular domain-specific language can be converted into Java language
  • The flag Calcite (Token) is expressed in SqlNodeand Sqlnodeby unparsethe reverse conversion method into SQL

cast(id as float)

Java CC can be expressed as

<CAST>
<LPAREN>
e = Expression(ExprContext.ACCEPT_SUBQUERY)
<AS>
dt = DataType() {agrs.add(dt);}
<RPAREN>
....

3) Query Optimizer

First look

INSERT INTO tmp_node
SELECT s1.id1, s1.id2, s2.val1
FROM source1 as s1 INNER JOIN source2 AS s2
ON s1.id1 = s2.id1 and s1.id2 = s2.id2 where s1.val1 > 5 and s2.val2 = 3; 

By Calcite into:

LogicalTableModify(table=[[TMP_NODE]], operation=[INSERT], flattened=[false])
  LogicalProject(ID1=[$0], ID2=[$1], VAL1=[$7])
    LogicalFilter(condition=[AND(>($2, 5), =($8, 3))])
      LogicalJoin(condition=[AND(=($0, $5), =($1, $6))], joinType=[INNER])
        LogicalTableScan(table=[[SOURCE1]])
        LogicalTableScan(table=[[SOURCE2]])

Is unoptimized RelNode tree can be found is the bottom TableScan, raw data table is read, followed LogicalJoin, Joiner type is INNER JOIN, then took LogicalJoin do LogicalFilter operation, corresponding to the SQL WHERE condition, Finally, do Project is projected to operate.

But we can observe for INNER JOIN, WHERE condition that can be pushed down, as

LogicalTableModify(table=[[TMP_NODE]], operation=[INSERT], flattened=[false])
  LogicalProject(ID1=[$0], ID2=[$1], VAL1=[$7])
      LogicalJoin(condition=[AND(=($0, $5), =($1, $6))], joinType=[inner])
        LogicalFilter(condition=[=($4, 3)])  
          LogicalProject(ID1=[$0], ID2=[$1],      ID3=[$2], VAL1=[$3], VAL2=[$4],VAL3=[$5])
            LogicalTableScan(table=[[SOURCE1]])
        LogicalFilter(condition=[>($3,5)])    
          LogicalProject(ID1=[$0], ID2=[$1], ID3=[$2], VAL1=[$3], VAL2=[$4],VAL3=[$5])
            LogicalTableScan(table=[[SOURCE2]])

This can reduce the amount of data JOIN improve the efficiency of SQL

The amount of data in the actual process conditions may be pushed down with less Join JOIN of

INSERT INTO tmp_node
SELECT s1.id1, s1.id2, s2.val1
FROM source1 as s1 LEFT JOIN source2 AS s2
ON s1.id1 = s2.id1 and s1.id2 = s2.id2 and s1.id3 = 5

 s1.id3 = 5 This condition can be pushed down to the filtered data s1, but in certain scenarios, some not pushed down, the following sql:

INSERT INTO tmp_node
SELECT s1.id1, s1.id2, s2.val1
FROM source1 as s1 LEFT JOIN source2 AS s2
ON s1.id1 = s2.id1 and s1.id2 = s2.id2 and s2.id3 = 5

If s1, s2 is the flow table (dynamic table refer to the concept of streaming Flink), then it can not be pushed down, the push down as s1, then, since there is no data driver join After filtration operation, and thus get the desired results ( see Flink / Sparking-Streaming)

Then the next we might have a question, under what circumstances can do similar pushed down, the push-up operation, is performed according to what principles? As shown below

T1 JOIN T2 JOIN T3

JOIN sequence similar to this case is the figure of the former or the latter? This relates to a method used Optimizer, Optimizer main purpose is to reduce the amount of data processed SQL, to reduce the consumption of resources and maximize the efficiency of such as SQL: useless cut column, combined projection, is converted into subqueries JOIN, JOIN reordered projection pushdown, pushdown filtration. There are two main types of optimization methods: CBO (CBO) based syntax (RBO) and based on

  1. RBO(Rule Based Optimization)

Popular point, then that is pre-defined set of rules, then to optimize the execution plan based on these rules.
Such as

  • ProjectFilterRule

    This Rule usage scenarios for the Filter on the Project, Filter can be pushed down. If a certain tree RelNode

 

    LogicalFilter
      LogicalProject
        LogicalTableScan

Can be optimized to

 

    LogicalProject
      LogicalFilter
        LogicalTableScan
  • FilterJoinRule

    This is a Filter Rule is used over the scene Join, then do can do first Filter Join, to reduce the number of Join

And so on, there are many similar rules. But try RBO optimization experience to some extent, there is not a formula to determine which on better optimization. Implementation is in the CalciteHepPlanner

  1. CBO(Cost Based Optimization)

One o'clock popular saying is: Calculate all possible SQL execution plan by some algorithm "price", select one of the lower cost of implementation plan, as described above for the three tables JOIN, which generally RBO implementation plan can not judge better optimization, only calculate the cost of each JOIN method.

Calcite will each operation (e.g. LogicaJoin, LocialFilter, LogicalProject, LogicalScan) the actual number Schema into concrete cost comparison with different execution plan cost, and then select a relatively small program as a final result, the reason said relatively small, because if you want to completely traverse calculate all the possible costs may outweigh the benefits, spend more manpower and resources, so just say choose a relatively optimal execution plan. CBO aim is "to avoid using the worst execution plan, rather than to find the best"

Calcite is currently in optimizing the use of CBO, for the implementation VolcanoPlanner, the specific details of this algorithm can refer to the original code


Calcite use

Since Calcite is written in the Java language, it is only necessary to introduce the appropriate Jar package to the project or program, the following is an example of a run:

public class TestOne {
    public static class TestSchema {
        public final Triple[] rdf = {new Triple("s", "p", "o")};
    }

    public static void main(String[] args) {
        SchemaPlus schemaPlus = Frameworks.createRootSchema(true);
        
        //给schema T中添加表
        schemaPlus.add("T", new ReflectiveSchema(new TestSchema()));
        Frameworks.ConfigBuilder configBuilder = Frameworks.newConfigBuilder();
        //设置默认schema
        configBuilder.defaultSchema(schemaPlus);

        FrameworkConfig frameworkConfig = configBuilder.build();

        SqlParser.ConfigBuilder paresrConfig = SqlParser.configBuilder(frameworkConfig.getParserConfig());
        
        //SQL 大小写不敏感
        paresrConfig.setCaseSensitive(false).setConfig(paresrConfig.build());

        Planner planner = Frameworks.getPlanner(frameworkConfig);

        SqlNode sqlNode;
        RelRoot relRoot = null;
        try {
            //parser阶段
            sqlNode = planner.parse("select \"a\".\"s\", count(\"a\".\"s\") from \"T\".\"rdf\" \"a\" group by \"a\".\"s\"");
            //validate阶段
            planner.validate(sqlNode);
            //获取RelNode树的根
            relRoot = planner.rel(sqlNode);
        } catch (Exception e) {
            e.printStackTrace();
        }

        RelNode relNode = relRoot.project();
        System.out.print(RelOptUtil.toString(relNode));
    }
}

Triple table corresponding class definition:

public class Triple {
    public String s;
    public String p;
    public String o;

    public Triple(String s, String p, String o) {
        super();
        this.s = s;
        this.p = p;
        this.o = o;
    }

}

See specific code: https://github.com/yuqi1129/calcite-test

 

Calcite use Mysql Demo

Configuring calcite local mysql query table in the student database under dbtest_1

model json as follows:

{
  version: '1.0',
  defaultSchema: 'dbtest_1',
  schemas: [
    {
      name: 'dbtest_1',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost:3306/dbtest_1',
        jdbcUser: 'xxx',
        jdbcPassword: 'xxx'
      }
    }
  ]
}

Test code:

package com.learn.mysql;

import org.apache.calcite.jdbc.CalciteConnection;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;

/**
 * @Description:
 * @Date:Create in 下午5:38 2018/9/15
 * @Modified By:
 */
public class QueryMysql {
  public static void main(String[] args) throws SQLException, ClassNotFoundException {
    Class.forName("com.mysql.jdbc.Driver");
    Properties info = new Properties();
    info.put("model",
      "inline:"
        + "{\n"
        + "  version: '1.0',\n"
        + "  defaultSchema: 'dbtest_1',\n"
        + "  schemas: [\n"
        + "    {\n"
        + "      name: 'dbtest_1',\n"
        + "      type: 'custom',\n"
        + "      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',\n"
        + "      operand: {\n"
        + "        jdbcDriver: 'com.mysql.jdbc.Driver',\n"
        + "        jdbcUrl:'jdbc:mysql://localhost:3306/dbtest_1',\n"
        + "        jdbcUser: 'xxx',\n"
        + "        jdbcPassword: 'xxx'\n"
        + "      }\n"
        + "    }\n"
        + "  ]\n"
        + "}");

    Connection connection =
      DriverManager.getConnection("jdbc:calcite:", info);
    // must print "directory ... not found" to stdout, but not fail
    Statement statement = connection.createStatement();
    CalciteConnection calciteConnection =
      connection.unwrap(CalciteConnection.class);
    ResultSet resultSet =
      statement.executeQuery("select * from \"dbtest_1\".\"student\"");

    ResultSet tables =
      connection.getMetaData().getTables(null, null, null, null);

    final StringBuilder buf = new StringBuilder();
    while (resultSet.next()) {
      int n = resultSet.getMetaData().getColumnCount();
      for (int i = 1; i <= n; i++) {
        buf.append(i > 1 ? "; " : "")
          .append(resultSet.getMetaData().getColumnLabel(i))
          .append("=")
          .append(resultSet.getObject(i));
      }
      System.out.println(buf.toString());
      buf.setLength(0);
    }
    resultSet.close();
    statement.close();
    connection.close();
  }
}

result

id=1; age=18; name=李四
id=2; age=18; name=张三
id=4; age=12; name=studentA

rely

 <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
        </dependency>

        <dependency>
            <groupId>org.apache.calcite</groupId>
            <artifactId>calcite-core</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.calcite.avatica</groupId>
                    <artifactId>avatica-core</artifactId>
                </exclusion>
            </exclusions>
            <version>1.12.0</version>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
            <version>2.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.1.0</version>
        </dependency>

        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-annotations</artifactId>
            <version>2.1.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.calcite.avatica/avatica -->
        <dependency>
            <groupId>org.apache.calcite.avatica</groupId>
            <artifactId>avatica</artifactId>
            <version>1.9.0</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.10</version>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>
            <version>RELEASE</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.41</version>
        </dependency>
    </dependencies>
Published 121 original articles · won praise 24 · views 130 000 +

Guess you like

Origin blog.csdn.net/CodeAsWind/article/details/104799684