Apache Calcite dissertation study notes

Recent attention in the realization of big data processing technologies and open-source products, found that many projects have mentioned something called the Apache Calcite. The same thing is not surprising to see one or two, may again be different periods of data processing products mentioned must attract attention. For this reason also found some information on the things on this 2018 paper published in the SIGMOD I think it brought the most appropriate entry, and the following is what I think and summary of this thesis.

What is

What is the explanation Calcite with paper title is most appropriate - A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources (a basic framework for query processing optimization for heterogeneous data sources). Calcite provides standard SQL language, multi-query optimization and the ability to connect a variety of data sources. From a functional point of view it has a lot of typical database management system functions, such as SQL parsing, SQL check, SQL query optimization, SQL generation, data connection inquiries, etc., but does not include the core functions of data processing, data storage and other DBMS. On the other hand, because of Calcite independent of data processing and storage of such design, it makes it a perfect choice for coordination among the plurality of data sources and a data processing engine.

Calcite is called optiq before, optiq initially for Apache Hive project, the cost for the Hive-based optimization model, namely CBO (cost based optimizations). May 2014 optiq independent, to become the incubator of the Apache community project, in September 2014 officially changed its name to Calcite. The goal of the project is one size fits all (a scheme to adapt to all the needs of the scene), hoping to provide a unified query engine for different computing platforms and data sources.

Calcite is the main function of SQL parsing (parse) and optimization (optimazation). First, it will parse SQL statements into an abstract syntax tree (AST Abstract Syntax Tree), and based on the cost of certain rules or algorithms to optimize the relationship between AST, final push to the respective data processing engine for execution.

why

The next question is, why do we need such a SQL syntax analysis and optimization of libraries it?

If you are ready to study from a distributed computing products, must ultimately resolve SQL-like functions performed, and the realization of such functions there is a certain technical threshold, designers need to have a deeper understanding of relational algebra and other areas. SQL parsing results also need to try and mainstream ANSI-SQL agreement, which would also reduce the company's promotion costs, user learning costs. In addition, a large distributed data processing times calculated by the following scenario, a SQL can often resolve semantic trees and the like into a plurality of syntax tree, but the logic operation into account the different data structures, the order of processing of the underlying data, and the like are connected inside the filter , depending on the specific implementation of efficiency between these syntax trees are often very different, different SQL statements, the underlying execution environment, the pros and cons of options exist also different.

Therefore, how to optimize the execution path of the syntax tree is a very important issue. On these two points, the large data processing in batch computing, stream computing, interactive query and other areas will be more or less, there are some common problems, when after the relational algebra query behind, query processing and optimization package abstract issues, the possible to generate a common framework.

If you are a data user, you may be faced with the need to integrate multiple heterogeneous data sources (traditional relational databases, search engines such as ES, caching products such as MongoDB, distributed computing frameworks such as Spark, etc.), is also possible at this time faced with cross-platform distributed query optimization and execution issues.

Locate

So Apache Calcite emerged thesis it is positioned as a complete query processing system, but Calcite is designed to be very flexible, practical projects are generally used in two ways:

  1. Calcite lib as the library, into their own projects.

    Calcite of their products to the system list

  2. Implement an adapter (Adapter), by reading the program integrated with the data source Calcite adapter.

    The system uses a list Calcite adapter

Aggregate function

DBMS five parts

In general, we can put a database management system is divided into five parts as above, Calcite beginning in the design and implementation concerns identified only three parts, shown in blue, while the gray data management and data storage to be open to external computing the storage engine to achieve. The aim of this is a characteristic of the data itself usually results in data management and data storage section that is diverse (files, relational databases, column databases, distributed storage, etc.) and complex, Calcite gave up two parts and focus on the more general upper module, so the complexity of the system under control, focus on their own do, do, the better the art can be made deeper.

Calcite did not repeat to create the wheel, have brought something that is available with ready-made, such as parsing this part of the direct use of the open source JavaCC SQL statements into Java code, converted to an abstract syntax tree for the next phase in SQL use. Another example is in order to achieve flexible metadata functionality, Calcite need support to compile Java code is running, and the default JavaC too, need a more lightweight compiler, here on the use of open source Janino.

This function focus, not repeat create the wheel, simple design ideas to make enough products to achieve Calcite simple and stable enough.

Flexible pluggable architecture

Calcite architecture

Calcite is the architecture of FIG mentioned paper, the optimizer uses the relationship Calcite operator tree as represented therein, the internal optimization engine consists of three major components: the rule, metadata providers and planning engine. The dotted line represents the interaction of the external Calcite, it can be seen from FIG. There are several ways this interaction.

The top figure represents an external JDBC client applications to access normally entered as SQL statements, internal Calcite access through JDBC Client JDBC Server. Next JDBC Server will pass SQL statements through SQL Parser and Validator modules do SQL parsing and validation, and the frame next to the Expressions Builder to support Calcite do SQL parsing and validation docking. Then followed Operator Expressions module to handle relational expression, Metadata Providers to support external custom metadata, Pluggable Rules used to define the optimization rules, the core of the Query Optimizer is focused on query optimization.

Calcite interior includes a query parser and validator, it can be converted into SQL queries relational operator tree. Calcite is not included since the data storage layer, which provides a mechanism by way of an adapter or the like and view definition table, thus Calcite can be used in the upper layer is stored in an external storage engines engine. Calcite can not only provide optimized SQL database language support system, also provides optimized support for the system already has its own language parsing and interpretation.

Since the function modules more independent, reasonable, Calcite can not fully integrated, which allows you to select and use only one part of the integrated function. Basically, every module also supports custom, which allows users to achieve more flexible customization features.

How to do it

Calcite has the following general SQL parsing steps: 1. Analytical (Parser), Calcite by the Java CC unverified SQL parsed into the AST 2. Verify (Validate), this step is a major role in the validation step the AST is legitimate, such as authentication SQL scheme, fields, functions, and so on if there are, SQL statements, and so the legality of, after the completion of this step 3. tree generated RelNode optimization (the optimize), the main role is to optimize the step RelNode tree, convert it into a physical implementation plan. The SQL optimization There are two general rules: rule-based optimizer (RBO), cost based optimization (CBO) said that in principle this step is optional, after RelNode tree after the Validate actually can be directly converted physical implementation plan, but modern SQL parser basically this step, in order to optimize SQL execution plan. The result of this step is to get the physical implementation plan. 4. Execute (Execute), this step is mainly to do is to convert the program into a physical implementation plan can be executed in a specific platform. The Hive, Flink at this stage are the physical execution plan CodeGen generate the corresponding executable code.

Here is an example of a query Demo Calcite, we modeled write a SQL query, but internal data storage and did not use any DB, but the data stored in memory by the JVM. There may be a simple visual perception of this example by the use of Calcite.

maven introduced

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.study.calcite</groupId>
    <artifactId>demo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!--calcite 核心-->
        <dependency>
            <groupId>org.apache.calcite</groupId>
            <artifactId>calcite-core</artifactId>
            <version>1.19.0</version>
        </dependency>

    </dependencies>
</project>
复制代码

Schema definition of structure

Schema defines a structure for storing data representing a structure of what is, in the example defines a schema of JavaHrSchema called, it can be likened to a database DB instance inside. Schema within the Department and the Employee Table two, they can be understood as a database table, for example in memory the last two tables to initialize some data.

package org.study.calcite.demo.inmemory;

/**
 * 定义 Schema 结构
 *
 * @author niwei
 */
public class JavaHrSchema {

    public static class Employee {
        public final int emp_id;
        public final String name;
        public final int dept_no;

        public Employee(int emp_id, String name, int dept_no) {
            this.emp_id = emp_id;
            this.name = name;
            this.dept_no = dept_no;
        }
    }

    public static class Department {
        public final String name;
        public final int dept_no;

        public Department(int dept_no, String name) {
            this.dept_no = dept_no;
            this.name = name;
        }
    }

    public final Employee[] employee = {
            new Employee(100, "joe", 1),
            new Employee(200, "oliver", 2),
            new Employee(300, "twist", 1),
            new Employee(301, "king", 3),
            new Employee(305, "kelly", 1)
    };

    public final Department[] department = {
            new Department(1, "dev"),
            new Department(2, "market"),
            new Department(3, "test")
    };
}
复制代码

Java code examples

The next step is to write a SQL statement and executed, the premise is to do these things tell Schema, Table Definition Calcite current to operate, which requires Calcite to add a data source. API offers from Calcite point of view and in fact in JDBC database access code is very similar, the students wrote this code is certainly familiar, not introduced.

package org.study.calcite.demo.inmemory;

import org.apache.calcite.adapter.java.ReflectiveSchema;
import org.apache.calcite.jdbc.CalciteConnection;
import org.apache.calcite.schema.SchemaPlus;

import java.sql.*;
import java.util.Properties;

public class QueryDemo {
    public static void main(String[] args) throws Exception {
        Class.forName("org.apache.calcite.jdbc.Driver");
        Properties info = new Properties();
        info.setProperty("lex", "JAVA");
        Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
        CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);

        SchemaPlus rootSchema = calciteConnection.getRootSchema();
        /**
         * 注册一个对象作为 schema ,通过反射读取 JavaHrSchema 对象内部结构,将其属性 employee 和 department 作为表
         */
        rootSchema.add("hr", new ReflectiveSchema(new JavaHrSchema()));
        Statement statement = calciteConnection.createStatement();
        ResultSet resultSet = statement.executeQuery(
                "select e.emp_id, e.name as emp_name, e.dept_no, d.name as dept_name "
                        + "from hr.employee as e "
                        + "left join hr.department as d on e.dept_no = d.dept_no");
        /**
         * 遍历 SQL 执行结果
         */
        while (resultSet.next()) {
            for (int i = 1; i <= resultSet.getMetaData().getColumnCount(); i++) {
                System.out.print(resultSet.getMetaData().getColumnName(i) + ":" + resultSet.getObject(i));
                System.out.print(" | ");
            }
            System.out.println();
        }

        resultSet.close();
        statement.close();
        connection.close();

    }
}
复制代码

Guess you like

Origin juejin.im/post/5d2ed6a96fb9a07eea32a6ff