Calcite-[1]-Tutorial-2

Tutorial

Tutorial原文

通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema,然后提供SQL查询接口。
包含如果重要概念:

  • 使用SchemaFactory和Schema接口构建用户自定义schema
  • 以model JSON文件声明schemas
  • 在model JSON文件中声明views视图
  • 使用Table接口构建用户自定义表
  • 使用ScannableTable接口实现一个简单的Table,直接枚举所有的行rows
  • Table的高级实现FilterableTable

    使用FilterableTable实现稍高级的Table实现,可以根据简单的predicates谓词实现行数据的过滤

  • Table的高级实现TranslatableTable

通过planner rules将table转化为一个RelNode relational expression
Extension to Table that specifies how it is to be translated to a {org.apache.calcite.rel.RelNode relational expression}.
If Table does not implement this interface, it will be converted to an EnumerableTableScan
Generally a Table will implement this interface to create a particular subclass of RelNode, and also register rules that act on that particular subclass of RelNode

Download and build

$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install -DskipTests -Dcheckstyle.skip=true
$ cd example/csv

First queries

使用工程内置的sqlline查询((If you are running Windows, the command is sqlline.bat.)
):

$ ./sqlline
sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin

执行一个metadata 查询:

sqlline> !tables

>
(JDBC experts, note: sqlline’s !tables command is just executing DatabaseMetaData.getTables() behind the scenes. It has other commands to query JDBC metadata, such as !columns and !describe.)

+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  |  TABLE_TYPE   | REMARKS  | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null       | SALES        | DEPTS       | TABLE         | null     | null |
| null       | SALES        | EMPS        | TABLE         | null     | null |
| null       | SALES        | HOBBIES     | TABLE         | null     | null |
| null       | metadata     | COLUMNS     | SYSTEM_TABLE  | null     | null |
| null       | metadata     | TABLES      | SYSTEM_TABLE  | null     | null |
+------------+--------------+-------------+---------------+----------+------+

可见有5张表:

  • SALES schema : EMPS、DEPTS、 HOBBIES
  • metadata schema:COLUMNS、 TABLES

一般system tables在Calcite实现,其他的表以特性schema形式实现。上实例中EMPS和DEPTS源自于target/test-classes 文件夹下的EMPS.csv和 DEPTS.csv文件(通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema)

Calcite实现了完备的SQL查询,如table scan:

sqlline> SELECT * FROM emps;
+--------+--------+---------+---------+----------------+--------+-------+---+
| EMPNO  |  NAME  | DEPTNO  | GENDER  |      CITY      | EMPID  |  AGE  | S |
+--------+--------+---------+---------+----------------+--------+-------+---+
| 100    | Fred   | 10      |         |                | 30     | 25    | t |
| 110    | Eric   | 20      | M       | San Francisco  | 3      | 80    | n |
| 110    | John   | 40      | M       | Vancouver      | 2      | null  | f |
| 120    | Wilma  | 20      | F       |                | 1      | 5     | n |
| 130    | Alice  | 40      | F       | Vancouver      | 2      | null  | f |
+--------+--------+---------+---------+----------------+--------+-------+---+

JOIN and GROUP BY:

sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;
+------------+---------+
|    NAME    | EXPR$1  |
+------------+---------+
| Sales      | 1       |
| Marketing  | 2       |
+------------+---------+

ALUES operator 生成单行,可以简便的对expressions和内置的SQL functions进行测试:

sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0  |
+---------+
| 13      |
+---------+

Schema

Calcite在不感知CSV 文件的情况下(As a “database without a storage layer”, Calcite doesn’t know about any file formats.)如何感知table?
主要实现步骤如下:

(1) 根据model文件(JSON format)中指定的schema factory类定义schema。
(2) schema创建tables,其中每个table知道如果scan CSV 文件得到数据.
(3) 在JDBC connect string给出这个model文件的路径(如jdbc:calcite:model=target/test-classes/model.json admin admin),提交查询query
(4) 当Calcite解析完query,继而生成plann,当query执行时Calcite调起tables读取数据。

model文件例如:

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      }
    }
  ]
}

该model定义了一个名为‘SALES’的schema,具体实现工厂类为org.apache.calcite.adapter.csv.CsvSchemaFactory,CsvSchemaFactory主要实现SchemaFactory, 传入model文件中的参数以create()方法实现一个schema实例(本例CsvSchema,CsvSchema实现Schema接口):

public Schema create(SchemaPlus parentSchema, String name,
    Map<String, Object> operand) {
  String directory = (String) operand.get("directory");//model中的属性
  String flavorName = (String) operand.get("flavor");//model中的属性
  CsvTable.Flavor flavor;
  if (flavorName == null) {
    flavor = CsvTable.Flavor.SCANNABLE;
  } else {
    flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase());
  }
  return new CsvSchema(
      new File(directory),
      flavor);
}

schema的主要工作.

  • 构建一系列tables(这些table实现Table接口)
  • 也可以实现sub-schemas和table-functions等,这些属于高级特性,在本例calcite-example-csv 没有相应的实现

本例中CsvSchema构建CsvTable及其子类的实例table,相关代码如下,CsvSchema实现AbstractSchema的getTableMap() 方法:

protected Map<String, Table> getTableMap() {
  // Look for files in the directory ending in ".csv", ".csv.gz", ".json",
  // ".json.gz".
  File[] files = directoryFile.listFiles(
      new FilenameFilter() {
        public boolean accept(File dir, String name) {
          final String nameSansGz = trim(name, ".gz");
          return nameSansGz.endsWith(".csv")
              || nameSansGz.endsWith(".json");
        }
      });
  if (files == null) {
    System.out.println("directory " + directoryFile + " not found");
    files = new File[0];
  }
  // Build a map from table name to table; each file becomes a table.
  final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder();
  for (File file : files) {
    String tableName = trim(file.getName(), ".gz");
    final String tableNameSansJson = trimOrNull(tableName, ".json");
    if (tableNameSansJson != null) {
      JsonTable table = new JsonTable(file);
      builder.put(tableNameSansJson, table);
      continue;
    }
    tableName = trim(tableName, ".csv");
    final Table table = createTable(file);
    builder.put(tableName, table);
  }
  return builder.build();
}

/** Creates different sub-type of table based on the "flavor" attribute. */
private Table createTable(File file) {
  switch (flavor) {
  case TRANSLATABLE:
    return new CsvTranslatableTable(file, null);
  case SCANNABLE:
    return new CsvScannableTable(file, null);
  case FILTERABLE:
    return new CsvFilterableTable(file, null);
  default:
    throw new AssertionError("Unknown flavor " + flavor);
  }
}

CsvSchema 扫描directory文件路径,找到所有“.csv”文件,为每个文件创建对应的一张表,本例中directory = target/test-classes/sales,文件路径下包含EMPS.csv and DEPTS.csv,对应实现EMPS和DEPTS两张表。

小结

我们不需要在model中定义任何表,schema自动生成这些表,主要流程如下:

model文件——指定——> XXXSchemaFactory——实现SchemaFactory构建——>XXXSchema—–实现AbstractSchema构建table—->XXXTable—-实现Table接口—>实现Scan文件任务

Tables and views in schema

除了schema自动生成表,还可以利用schema的tables属性定义额外的表。下面是如何在schema中定义view视图的例子:

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      },
      tables: [
        {
          name: 'FEMALE_EMPS',
          type: 'view',
          sql: 'SELECT * FROM emps WHERE gender = \'F\''
        }
      ]
    }
  ]
}

其中:

  • ‘view’ 标识 FEMALE_EMPS 为视图

对于视图,我们可以像普通表一样查询:


sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
|  NAME  |    NAME    |
+--------+------------+
| Wilma  | Marketing  |
+--------+------------+

Custom tables

Custom table: 用户代码直接实现的表(而不是通过schema实现)

model-with-custom-table.json中的一个例子,如下:

{
  version: '1.0',
  defaultSchema: 'CUSTOM_TABLE',
  schemas: [
    {
      name: 'CUSTOM_TABLE',
      tables: [
        {
          name: 'EMPS',
          type: 'custom',
          factory: 'org.apache.calcite.adapter.csv.CsvTableFactory',
          operand: {
            file: 'target/test-classes/sales/EMPS.csv.gz',
            flavor: "scannable"
          }
        }
      ]
    }
  ]
}

查询方式如下:

sqlline> !connect jdbc:calcite:model=target/test-classes/model-with-custom-table.json admin admin
sqlline> SELECT empno, name FROM custom_table.emps;
+--------+--------+
| EMPNO  |  NAME  |
+--------+--------+
| 100    | Fred   |
| 110    | Eric   |
| 110    | John   |
| 120    | Wilma  |
| 130    | Alice  |
+--------+--------+

The schema is a regular one, and contains a custom table powered by org.apache.calcite.adapter.csv.CsvTableFactory, which implements the Calcite interface TableFactory. Its create method instantiates a CsvScannableTable, passing in the file argument from the model file:
和上面给出过的schema类似,多出一个rg.apache.calcite.adapter.csv.CsvTableFactory标识表的创建工厂(CsvTableFactory实现TableFactory,create()方法中实现CsvScannableTable

public CsvTable create(SchemaPlus schema, String name,
    Map<String, Object> map, RelDataType rowType) {
  String fileName = (String) map.get("file");//model中的参数,直接指定文件路径,不需要扫描文件
  final File file = new File(fileName);
  final RelProtoDataType protoRowType =
      rowType != null ? RelDataTypeImpl.proto(rowType) : null;
  return new CsvScannableTable(file, protoRowType);
}

小结与对比

  • 作为实现custom schema的补充,实现custom table和上面的custom schema自动创建表一样最后都要实现一个类似的Table接口,不同的是custom table不需要实现 metadata discovery(CsvTableFactory和CsvSchema一样都要创建CsvScannableTable,只不过不需要扫描文件找.csv文件,文件路径是明确指定的)
  • Custom tables需要开发者为每张表提供明确的文件(路径),同时也给开发者提供了更过可控的可能性(为每个表给定不同的参数)

Comments in models

Models can include comments using /* … */ and // syntax:

{
  version: '1.0',
  /* Multi-line
     comment. */
  defaultSchema: 'CUSTOM_TABLE',
  // Single-line comment.
  schemas: [
    ..
  ]
}

(Comments are not standard JSON, but are a harmless extension.)

Optimizing queries using planner rules

  • Calcite支持通过添加planner rules的形式优化查询,在查询树(query parse tree)中查询和Planner rules匹配的节点node,用心的nodes做替换实现查询优化。
  • Planner rules也是可扩展的(如schemas、tables),因此若想用sql查询数据源,主要步骤,首先定义custom table或者schema,然后定义rules来优化查询

下面是一个简单的例子:

sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   EnumerableTableScan(table=[[SALES, EMPS]])        |
+-----------------------------------------------------+
sqlline> !connect jdbc:calcite:model=target/test-classes/smart.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   CsvTableScan(table=[[SALES, EMPS]])               |
+-----------------------------------------------------+

plan的差异源自,smart.json model file,smart.json中多处一行:

flavor: “translatable”

这使得CsvSchema会按照参数flavor = TRANSLATABLE , createTable方法建立一个CsvTranslatableTable实例而非CsvScannableTable。
CsvTranslatableTable 实现 ranslatableTable.toRel()创建CsvTableScan,作为 query operator tree中的一层,查询时会触发rules起作用。

实体中的rules如下,这些规则注册到CsvTableScan中:
CsvTableScan:: register

  @Override public void register(RelOptPlanner planner) {
    planner.addRule(CsvProjectTableScanRule.INSTANCE);
  }`

规则:
* constructor method中声明relational expressions的规则,这些规则会触发起作用。
* onMatch method生成新relational expression,调用 RelOptRuleCall.transformTo()来指示rule触发

public class CsvProjectTableScanRule extends RelOptRule {
  public static final CsvProjectTableScanRule INSTANCE =
      new CsvProjectTableScanRule();

  private CsvProjectTableScanRule() {
    super(
        operand(Project.class,
            operand(CsvTableScan.class, none())),
        "CsvProjectTableScanRule");
  }

  @Override
  public void onMatch(RelOptRuleCall call) {
    final Project project = call.rel(0);
    final CsvTableScan scan = call.rel(1);
    int[] fields = getProjectFields(project.getProjects());
    if (fields == null) {
      // Project contains expressions more complex than just field references.
      return;
    }
    call.transformTo(
        new CsvTableScan(
            scan.getCluster(),
            scan.getTable(),
            scan.csvTable,
            fields));
  }

  private int[] getProjectFields(List<RexNode> exps) {
    final int[] fields = new int[exps.size()];
    for (int i = 0; i < exps.size(); i++) {
      final RexNode exp = exps.get(i);
      if (exp instanceof RexInputRef) {
        fields[i] = ((RexInputRef) exp).getIndex();
      } else {
        return null; // not a simple projection
      }
    }
    return fields;
  }
}

The query optimization process

1)优化过程不是按照rules描述的顺序,查询优化会遍历tree的所有的分支
2)Calcite利用cost来选择plans
3)Calcite是按照给定规则来做优化选择,而是像其他优化器一样,在rule A and rule B之间,分别执行后做结果对比选择

>
Many optimizers have a linear optimization scheme. Faced with a choice between rule A and rule B, as above, such an optimizer needs to choose immediately. It might have a policy such as “apply rule A to the whole tree, then apply rule B to the whole tree”, or apply a cost-based policy, applying the rule that produces the cheaper result.

3) cost model 防止局部最小、做搜索剪支

JDBC adapter

JDBC adapter将JDBC数据源的schema映射为Calcite的schema
例如从MySQL “foodmart” database中读取schema:

{
  version: '1.0',
  defaultSchema: 'FOODMART',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

Current limitations:

  • The JDBC adapter currently only pushes down table scan operations; all other processing (filtering, joins, aggregations and so forth) occurs within Calcite.
  • Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If a Calcite query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Calcite will use the most efficient distributed query approach that it can.

The cloning JDBC adapter

The cloning JDBC adapter creates a hybrid database. The data is sourced from a JDBC database but is read into in-memory tables the first time each table is accessed. Calcite evaluates queries based on those in-memory tables, effectively a cache of the database.

For example, the following model reads tables from a MySQL “foodmart” database:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

Another technique is to build a clone schema on top of an existing schema. You use the source property to reference a schema defined earlier in the model, like this:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    },
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        source: 'FOODMART'
      }
    }
  ]
}

You can use this approach to create a clone schema on any type of schema, not just JDBC.

The cloning adapter isn’t the be-all and end-all. We plan to develop more sophisticated caching strategies, and a more complete and efficient implementation of in-memory tables, but for now the cloning JDBC adapter shows what is possible and allows us to try out our initial implementations.

猜你喜欢

转载自blog.csdn.net/hjw199089/article/details/79844002