Apache Calcite streamlined entry and learning guide

1 Basic introduction to Apache Calcite

Apache Calcite is a dynamic data management framework that includes many parts of a typical database management system, but omits some key functions: data storage, data processing algorithms, and metadata storage.

Based on Apache Calcite, we can develop SQL query engine for any third-party storage engine.

  • Official website address

https://calcite.apache.org/

  • project address

https://github.com/apache/calcite

2 Apache Calcite learning guide

If you want to understand Calcite, in fact, you might as well have a look at the official documentation. Although the official document will mention a lot of concepts that you may not have touched before, but fortunately, the content of the document is not much, so that you can leave an impression on some of the key terms that may be involved in SQL execution. Using Calcite is still helpful. After all, if you really want to use Calcite well, or just use Calcite, these key terms need to be mastered and understood.

However, just looking at the official documentation is not enough. When you look back at the Calcite documentation, you will find that it is completely written for "high-end players". It is a highly abstract summary of Calcite. It is not written for beginners to learn, so that you want to pass It is quite difficult to run a QuickStart from official documents. I personally feel that it is really not easy to achieve without certain tossing ability or no experience in understanding SQL execution. Therefore, you can't just look at the official documentation, you also need to get more information about it through other channels. Regarding how to quickly master Apache Calcite for beginners, here are some of my personal experiences:

  • 1. Simple to use first

    Calcite as a data management framework, first of all, you have to use it to slowly understand what it does. In theory, through the Calcite component, you can access any data source you want to access in SQL, such as your files (no matter what format you are in), Java objects, and third-party storage engines (Redis, Kafka, Elasticsearch) ) Wait, so I used "any" to illustrate its ability, which is its real ability.

    This document will teach you how to use Calcite to access CSV files, Java internal objects, and Elasticsearch data sources in SQL.

  • 2. Production use and thinking

    So once you know that Calcite can access any data source through SQL, I know that students who have ideas will already consider:

    • (1) If there are various storage systems in my business system, can I use Calcite to build a unified data query system (a query entry that queries multiple different data sources at the same time)?

    Users do not need to perceive where the data is stored. In their view, this is a query system that only provides SQL query entry, which shields the differences of the various storage systems it is connected to;

    • (2) If there is a popular data storage system or engine in the business, but it does not support SQL queries, can I borrow Calcite to develop a SQL engine for it?

    The answer is yes, Calcite is a component, essentially a framework, it provides various extension points to allow you to achieve such functions.

    Of course, if you want to use Calcite to develop a good SQL engine for a certain storage system, it still requires considerable effort. For example, VolcanoPlanner needs a good understanding. It is a pity that I have no energy to study it until now. As for my idea of ​​developing a SQL engine for Elasticsearch, it has not been realized for a long time.

    The so-called "borrowing Calcite to develop a good SQL engine for a certain storage system" actually has a professional term in Calcite called "data source adapter".

    Calcite itself also provides adapters for multiple storage engines, such as Druid, Elasticsearch, SparkSQL, etc. Of course, open source is not necessarily available. The reason why I have been mentioning the need to rewrite an Elasticsearch adapter is because I think Calcite itself The ES adapter provided is relatively weak, I believe students who have used it will experience it.

  • 3. Deep use and thinking

    In fact, if you just want to know how Calcite is used and what functions can be used, we might as well stand on the shoulders of giants and see how open source projects in the industry use it.

    A good reference is Apache Druid. Its SQL engine is developed and constructed based on Apache Calcite. Therefore, for the use of more advanced functions of Calcite, we might as well study the source code of the Apache Druid-SQL module. I believe it will be very big. reward.

  • 4.VolcanPlanner

    If I have time and energy to study its implementation in Calcite, I personally think it will be very good.

This document will teach you how to use Calcite to access CSV files, Java internal objects, and Elasticsearch data sources in SQL.

For more implementation details of Calcite, it is still better to think about its various module functions according to actual application scenarios. For example, if you want to understand a certain function principle, you can look at its source code structure and details. I believe this is a personal ability. The improvements are extremely helpful.

3 Access to different data sources through Apache Calcite

First build a maven project, and then introduce the dependency of Calcite:

<dependency>
  <groupId>org.apache.calcite</groupId>
  <artifactId>calcite-core</artifactId>
  <version>1.20.0</version>
</dependency>
<dependency>
  <groupId>org.apache.calcite</groupId>
  <artifactId>calcite-example-csv</artifactId>
  <version>1.20.0</version>
</dependency>
<dependency>
  <groupId>org.apache.calcite</groupId>
  <artifactId>calcite-elasticsearch</artifactId>
  <version>1.20.0</version>
</dependency>

3.1 Access to CSV data source

First prepare a CSV file:

EMPNO:long,NAME:string,DEPTNO:int,GENDER:string,CITY:string,EMPID:int,AGE:int,SLACKER:boolean,MANAGER:boolean,JOINEDAT:date
100,"Fred",10,,,30,25,true,false,"1996-08-03"
110,"Eric",20,"M","San Francisco",3,80,,false,"2001-01-01"
110,"John",40,"M","Vancouver",2,,false,true,"2002-05-03"
120,"Wilma",20,"F",,1,5,,true,"2005-09-07"
130,"Alice",40,"F","Vancouver",2,,false,true,"2007-01-01"

Calcite will map each csv file into a SQL table. The header of the csv file specifies the data type of the column and maps to the corresponding SQL type according to certain rules. If not specified, it will be mapped to VARCHAR uniformly.

The file is named depts.csv, and Caclite will construct a table named depts.

Then write the following code to access the data in SQL through Calcite:

// Author: xpleaf
public class CsvDemo {

    public static void main(String[] args) throws Exception {
        // 0.获取csv文件的路径,注意获取到文件所在上层路径就可以了
        String path = Objects.requireNonNull(CsvDemo.class.getClassLoader().getResource("csv").getPath());

        // 1.构建CsvSchema对象,在Calcite中,不同数据源对应不同Schema,比如CsvSchema、DruidSchema、ElasticsearchSchema等
        CsvSchema csvSchema = new CsvSchema(new File(path), CsvTable.Flavor.SCANNABLE);

        // 2.构建Connection
        // 2.1 设置连接参数
        Properties info = new Properties();
        // 不区分sql大小写
        info.setProperty("caseSensitive", "false");
        // 2.2 获取标准的JDBC Connection
        Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
        // 2.3 获取Calcite封装的Connection
        CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);

        // 3.构建RootSchema,在Calcite中,RootSchema是所有数据源schema的parent,多个不同数据源schema可以挂在同一个RootSchema下
        // 以实现查询不同数据源的目的
        SchemaPlus rootSchema = calciteConnection.getRootSchema();

        // 4.将不同数据源schema挂载到RootSchema,这里添加CsvSchema
        rootSchema.add("csv", csvSchema);

        // 5.执行SQL查询,通过SQL方式访问csv文件
        String sql = "select * from csv.depts";
        Statement statement = calciteConnection.createStatement();
        ResultSet resultSet = statement.executeQuery(sql);

        // 6.遍历打印查询结果集
        System.out.println(ResultSetUtil.resultString(resultSet));
    }

}

Executing the code, the output results are as follows:

100, Fred, 10, , , 30, 25, true, false, 1996-08-03
110, Eric, 20, M, San Francisco, 3, 80, null, false, 2001-01-01
110, John, 40, M, Vancouver, 2, null, false, true, 2002-05-03
120, Wilma, 20, F, , 1, 5, null, true, 2005-09-07
130, Alice, 40, F, Vancouver, 2, null, false, true, 2007-01-01

Thinking:

CSV is an example mentioned in the official document. If you need to have an understanding of the use of Calcite source code as a whole (especially how to develop an adapter), you can compare the concepts and classes mentioned in the document based on this demo and analyze the source code. To understand, such as:

  • 1. How is the Schema constructed, and what is its location and specific role in Calcite;
  • 2. How is Table constructed, and what is its location and specific role in Calcite;
  • 3. How to do SQL Parse, Validate, Optimize and execute during query execution;

You can check it out through this demo. Of course, although I have covered a few words here, in fact, if you want to study this process, it may take you a lot of time. I suggest not to rush into it. It's too big, take it slow, not in a hurry.

In addition, in fact, through the introduction of official documents, you should have a certain experience on how to develop a Caclite data source adapter. In fact, if you only implement a simple adapter (without considering too many SQL optimization rules), it is difficult. Still not big.

Through this example, including the following examples, I actually want to tell you how to use Calcite quickly (that is, I wrote you a QuickStart), so as to have an understanding of the overall use of Calcite. If you want to use it in more depth Calcite, suggest:

  • 1. Take a look at the UT in the Calcite source code, which provides a good reference case;
  • 2. However, method 1 may be more fragmented. You can study the source code of the Apache Druid-SQL module and see how it is used as a whole. Many of the advanced usage skills and methods in it are still very useful for reference. Look.

3.2 Access to Object data source

3.2.1 SparkSQL access to Object objects

Students who have used SparkSQL will know that in SparkSQL, object instances can be converted into DataFrame programmatically, and then the Table can be registered to access these object instances through SQL:

public class _01SparkRDD2DataFrame {
    public static void main(String[] args) {
        Logger.getLogger("org.apache.spark").setLevel(Level.OFF);
        SparkConf conf = new SparkConf()
                .setMaster("local[2]")
                .setAppName(_01SparkRDD2DataFrame.class.getSimpleName())
                .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                .registerKryoClasses(new Class[]{Person.class});
        JavaSparkContext jsc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(jsc);
        List<Person> persons = Arrays.asList(
                new Person(1, "name1", 25, 179),
                new Person(2, "name2", 22, 176),
                new Person(3, "name3", 27, 178),
                new Person(1, "name4", 24, 175)
        );

        DataFrame df = sqlContext.createDataFrame(persons, Person.class);   // 构造方法有多个,使用personsRDD的方法也是可以的

        // where age > 23 and height > 176
        df.select(new Column("id"),
                  new Column("name"),
                  new Column("age"),
                  new Column("height"))
                .where(new Column("age").gt(23).and(new Column("height").lt(179)))
                .show();

        df.registerTempTable("person");

        sqlContext.sql("select * from person where age > 23 and height < 179").show();

        jsc.close();

    }
}

The above code example comes from xpleaf's article "Spark SQL Notes Arrangement (2): DataFrame Programming Model and Operation Cases"

Note that the case given here is still the usage of Spark 1.x, Spark 2.x and later versions may not recommend this usage, please refer to the official Spark documentation for details.

3.2.2 Calcite access Object object

So corresponding to Calcite, it also provides a similar way to access object instance data through SQL.

In order to demonstrate, we first build the Object object class:

public class HrSchema {
    public final Employee[] emps = {
            new Employee(100, 10, "Bill", 10000, 1000),
            new Employee(200, 20, "Eric", 8000, 500),
            new Employee(150, 10, "Sebastian", 7000, null),
            new Employee(110, 10, "Theodore", 11500, 250),
    };

    @Override
    public String toString() {
        return "HrSchema";
    }

    public static class Employee {
        public int empid;
        public int deptno;
        public String name;
        public float salary;
        public Integer commission;

        public Employee(int empid, int deptno, String name, float salary,
                        Integer commission) {
            this.empid = empid;
            this.deptno = deptno;
            this.name = name;
            this.salary = salary;
            this.commission = commission;
        }

        @Override
        public String toString() {
            return "Employee [empid: " + empid + ", deptno: " + deptno
                    + ", name: " + name + "]";
        }

        @Override
        public boolean equals(Object obj) {
            return obj == this
                    || obj instanceof Employee
                    && empid == ((Employee) obj).empid;
        }
    }
}

Calcite will map the emps of HrSchema to a table.

Write the Calcite code as follows:

public class ObjectDemo {

    public static void main(String[] args) throws Exception {
        // 1.构建CsvSchema对象,在Calcite中,不同数据源对应不同Schema,比如CsvSchema、DruidSchema、ElasticsearchSchema等
        ReflectiveSchema reflectiveSchema = new ReflectiveSchema(new HrSchema());

        // 2.构建Connection
        // 2.1 设置连接参数
        Properties info = new Properties();
        // 不区分sql大小写
        info.setProperty("caseSensitive", "false");
        // 2.2 获取标准的JDBC Connection
        Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
        // 2.3 获取Calcite封装的Connection
        CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);

        // 3.构建RootSchema,在Calcite中,RootSchema是所有数据源schema的parent,多个不同数据源schema可以挂在同一个RootSchema下
        // 以实现查询不同数据源的目的
        SchemaPlus rootSchema = calciteConnection.getRootSchema();

        // 4.将不同数据源schema挂载到RootSchema,这里添加ReflectiveSchema
        rootSchema.add("hr", reflectiveSchema);

        // 5.执行SQL查询,通过SQL方式访问object对象实例
        String sql = "select * from hr.emps";
        Statement statement = calciteConnection.createStatement();
        ResultSet resultSet = statement.executeQuery(sql);

        // 6.遍历打印查询结果集
        System.out.println(ResultSetUtil.resultString(resultSet));
    }

}

Executing the code, the output results are as follows:

100, 10, Bill, 10000.0, 1000
200, 20, Eric, 8000.0, 500
150, 10, Sebastian, 7000.0, null
110, 10, Theodore, 11500.0, 250

Generally, when using Calcite to build a unified query system, the Object object table will be used to construct the metadata information table of the data table (that is, what fields are in the table, the type of the field, and the metadata information used to construct the data table), etc. The details can be Refer to the Apache Druid-SQL source code.

3.3 Access to Elasticsearch data source

3.3.1 Very Quick Start with Elasticsearch

There is no need to stress. If you have never touched Elasticsearch before and don’t have to worry about the cost of learning, you can simply understand it as a database, and you don’t need to think about it so complicated. Moreover, it works out of the box. Any deployment costs.

download:

https://www.elastic.co/cn/downloads/elasticsearch

Download the corresponding version according to the corresponding operating system.

After the download is complete, unzip, enter the bin directory, execute elasticsearch.bator elasticsearch(depending on your operating system) to start Elasticsearch, visit on the browser, localhost:9200and return the following information:

{
  "name": "yeyonghaodeMacBook-Pro.local",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "6sMhfd0fSgSnqk7M_CTmug",
  "version": {
    "number": "7.11.1",
    "build_flavor": "default",
    "build_type": "tar",
    "build_hash": "ff17057114c2199c9c1bbecc727003a907c0db7a",
    "build_date": "2021-02-15T13:44:09.394032Z",
    "build_snapshot": false,
    "lucene_version": "8.7.0",
    "minimum_wire_compatibility_version": "6.8.0",
    "minimum_index_compatibility_version": "6.0.0-beta1"
  },
  "tagline": "You Know, for Search"
}

It means that the service has been deployed successfully.

Next, we use postman to create an index (table) and write data to ES:

PUT http://localhost:9200/teachers/_doc/1
{
    "name":"xpleaf",
    "age":26,
    "rate":0.86,
    "percent":0.95,
    "join_time":1551058601000
}

After the data is written successfully, query the data through postman:

GET http://localhost:9200/teachers/_search
{
    "took": 115,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "teachers",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "name": "xpleaf",
                    "age": 26,
                    "rate": 0.86,
                    "percent": 0.95,
                    "join_time": 1551058601000
                }
            }
        ]
    }
}

3.3.2 Calcite access to Elasticsearch data source

Of course, you might say that ES itself also provides SQL capabilities, but in fact it is part of the x-pack component and is commercially available, so use it with caution, and I personally feel that the SQL capabilities it provides are relatively weak.

Of course, Calcite's Elasticsearch adapter is actually pretty general.

With the previous preparations, we write the following Calcite code:

public class ElasticsearchDemo {

    public static void main(String[] args) throws Exception {
        // 1.构建ElasticsearchSchema对象,在Calcite中,不同数据源对应不同Schema,比如CsvSchema、DruidSchema、ElasticsearchSchema等
        RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200)).build();
        ElasticsearchSchema elasticsearchSchema = new ElasticsearchSchema(restClient, new ObjectMapper(), "teachers");

        // 2.构建Connection
        // 2.1 设置连接参数
        Properties info = new Properties();
        // 不区分sql大小写
        info.setProperty("caseSensitive", "false");
        // 2.2 获取标准的JDBC Connection
        Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
        // 2.3 获取Calcite封装的Connection
        CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);

        // 3.构建RootSchema,在Calcite中,RootSchema是所有数据源schema的parent,多个不同数据源schema可以挂在同一个RootSchema下
        // 以实现查询不同数据源的目的
        SchemaPlus rootSchema = calciteConnection.getRootSchema();

        // 4.将不同数据源schema挂载到RootSchema,这里添加ElasticsearchSchema
        rootSchema.add("es", elasticsearchSchema);

        // 5.执行SQL查询,通过SQL方式访问object对象实例
        String sql = "select * from es.teachers";
        Statement statement = calciteConnection.createStatement();
        ResultSet resultSet = statement.executeQuery(sql);

        // 6.遍历打印查询结果集
        System.out.println(ResultSetUtil.resultString(resultSet));
    }

}

Executing the code, the output results are as follows:

{name=xpleaf, age=26, rate=0.86, percent=0.95, join_time=1551058601000}

4 The Next

Through the previous basic introduction and QuickStart, I believe you already have the most basic understanding of Apache Calcite. Of course, if you want to really use Calcite in the production environment, use it to customize our unified query system. Just understanding these is definitely far It's far from enough, it's really a long way to go, but it's okay, it doesn't matter, I will introduce more advanced usages of Calcite later when I have the opportunity.

In fact, many advanced usages are learned by studying the source code of Apache Druid-SQL, so I will always emphasize that if you have more time and energy, you may wish to read its source code.

Appendix 1: ResultSetUtil

public class ResultSetUtil {

    public static String resultString(ResultSet resultSet) throws SQLException {
        return resultString(resultSet, false);
    }

    public static String resultString(ResultSet resultSet, boolean printHeader) throws SQLException {
        List<List<Object>> resultList = resultList(resultSet, printHeader);
        return resultString(resultList);
    }

    public static List<List<Object>> resultList(ResultSet resultSet) throws SQLException {
        return resultList(resultSet, false);
    }

    public static String resultString(List<List<Object>> resultList) throws SQLException {
        StringBuilder builder = new StringBuilder();
        resultList.forEach(row -> {
            String rowStr = row.stream()
                    .map(columnValue -> columnValue + ", ")
                    .collect(Collectors.joining());
            rowStr = rowStr.substring(0, rowStr.lastIndexOf(", ")) + "\n";
            builder.append(rowStr);
        });
        return builder.toString();
    }

    public static List<List<Object>> resultList(ResultSet resultSet, boolean printHeader) throws SQLException {
        ArrayList<List<Object>> results = new ArrayList<>();
        final ResultSetMetaData metaData = resultSet.getMetaData();
        final int columnCount = metaData.getColumnCount();
        if (printHeader) {
            ArrayList<Object> header = new ArrayList<>();
            for (int i = 1; i <= columnCount; i++) {
                header.add(metaData.getColumnName(i));
            }
            results.add(header);
        }
        while (resultSet.next()) {
            ArrayList<Object> row = new ArrayList<>();
            for (int i = 1; i <= columnCount; i++) {
                row.add(resultSet.getObject(i));
            }
            results.add(row);
        }
        return results;
    }

}

Appendix 2: Demo source code address

https://github.com/xpleaf/calcite-tutorial

Guess you like

Origin blog.51cto.com/xpleaf/2639844