Tutorial

通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema，然后提供SQL查询接口。
包含如果重要概念：

使用SchemaFactory和Schema接口构建用户自定义schema
以model JSON文件声明schemas
在model JSON文件中声明views视图
使用Table接口构建用户自定义表
使用ScannableTable接口实现一个简单的Table，直接枚举所有的行rows
Table的高级实现FilterableTable

使用FilterableTable实现稍高级的Table实现，可以根据简单的predicates谓词实现行数据的过滤
Table的高级实现TranslatableTable

通过planner rules将table转化为一个RelNode relational expression
Extension to Table that specifies how it is to be translated to a {org.apache.calcite.rel.RelNode relational expression}.
If Table does not implement this interface, it will be converted to an EnumerableTableScan
Generally a Table will implement this interface to create a particular subclass of RelNode, and also register rules that act on that particular subclass of RelNode

Download and build

$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install -DskipTests -Dcheckstyle.skip=true
$ cd example/csv

First queries

使用工程内置的sqlline查询((If you are running Windows, the command is sqlline.bat.)
)：

$ ./sqlline
sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin

执行一个metadata 查询：

sqlline> !tables

>
(JDBC experts, note: sqlline’s !tables command is just executing DatabaseMetaData.getTables() behind the scenes. It has other commands to query JDBC metadata, such as !columns and !describe.)

+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  |  TABLE_TYPE   | REMARKS  | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null       | SALES        | DEPTS       | TABLE         | null     | null |
| null       | SALES        | EMPS        | TABLE         | null     | null |
| null       | SALES        | HOBBIES     | TABLE         | null     | null |
| null       | metadata     | COLUMNS     | SYSTEM_TABLE  | null     | null |
| null       | metadata     | TABLES      | SYSTEM_TABLE  | null     | null |
+------------+--------------+-------------+---------------+----------+------+

可见有5张表：

SALES schema ： EMPS、DEPTS、 HOBBIES
metadata schema：COLUMNS、 TABLES

一般system tables在Calcite实现，其他的表以特性schema形式实现。上实例中EMPS和DEPTS源自于target/test-classes 文件夹下的EMPS.csv和 DEPTS.csv文件（通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema）

Calcite实现了完备的SQL查询，如table scan:

sqlline> SELECT * FROM emps;
+--------+--------+---------+---------+----------------+--------+-------+---+
| EMPNO  |  NAME  | DEPTNO  | GENDER  |      CITY      | EMPID  |  AGE  | S |
+--------+--------+---------+---------+----------------+--------+-------+---+
| 100    | Fred   | 10      |         |                | 30     | 25    | t |
| 110    | Eric   | 20      | M       | San Francisco  | 3      | 80    | n |
| 110    | John   | 40      | M       | Vancouver      | 2      | null  | f |
| 120    | Wilma  | 20      | F       |                | 1      | 5     | n |
| 130    | Alice  | 40      | F       | Vancouver      | 2      | null  | f |
+--------+--------+---------+---------+----------------+--------+-------+---+

JOIN and GROUP BY:

sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;
+------------+---------+
|    NAME    | EXPR$1  |
+------------+---------+
| Sales      | 1       |
| Marketing  | 2       |
+------------+---------+

ALUES operator 生成单行，可以简便的对expressions和内置的SQL functions进行测试:

sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0  |
+---------+
| 13      |
+---------+

Schema

Calcite在不感知CSV 文件的情况下（As a “database without a storage layer”, Calcite doesn’t know about any file formats.）如何感知table？
主要实现步骤如下：

(1) 根据model文件(JSON format)中指定的schema factory类定义schema。
(2) schema创建tables，其中每个table知道如果scan CSV 文件得到数据.
(3) 在JDBC connect string给出这个model文件的路径（如jdbc:calcite:model=target/test-classes/model.json admin admin）,提交查询query
(4) 当Calcite解析完query，继而生成plann，当query执行时Calcite调起tables读取数据。

model文件例如:

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      }
    }
  ]
}

该model定义了一个名为‘SALES’的schema，具体实现工厂类为org.apache.calcite.adapter.csv.CsvSchemaFactory，CsvSchemaFactory主要实现SchemaFactory, 传入model文件中的参数以create()方法实现一个schema实例(本例CsvSchema，CsvSchema实现Schema接口):

public Schema create(SchemaPlus parentSchema, String name,
    Map<String, Object> operand) {
  String directory = (String) operand.get("directory");//model中的属性
  String flavorName = (String) operand.get("flavor");//model中的属性
  CsvTable.Flavor flavor;
  if (flavorName == null) {
    flavor = CsvTable.Flavor.SCANNABLE;
  } else {
    flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase());
  }
  return new CsvSchema(
      new File(directory),
      flavor);
}

schema的主要工作.

构建一系列tables(这些table实现Table接口）
也可以实现sub-schemas和table-functions等，这些属于高级特性，在本例calcite-example-csv 没有相应的实现

本例中CsvSchema构建CsvTable及其子类的实例table，相关代码如下，CsvSchema实现AbstractSchema的getTableMap() 方法：

protected Map<String, Table> getTableMap() {
  // Look for files in the directory ending in ".csv", ".csv.gz", ".json",
  // ".json.gz".
  File[] files = directoryFile.listFiles(
      new FilenameFilter() {
        public boolean accept(File dir, String name) {
          final String nameSansGz = trim(name, ".gz");
          return nameSansGz.endsWith(".csv")
              || nameSansGz.endsWith(".json");
        }
      });
  if (files == null) {
    System.out.println("directory " + directoryFile + " not found");
    files = new File[0];
  }
  // Build a map from table name to table; each file becomes a table.
  final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder();
  for (File file : files) {
    String tableName = trim(file.getName(), ".gz");
    final String tableNameSansJson = trimOrNull(tableName, ".json");
    if (tableNameSansJson != null) {
      JsonTable table = new JsonTable(file);
      builder.put(tableNameSansJson, table);
      continue;
    }
    tableName = trim(tableName, ".csv");
    final Table table = createTable(file);
    builder.put(tableName, table);
  }
  return builder.build();
}

/** Creates different sub-type of table based on the "flavor" attribute. */
private Table createTable(File file) {
  switch (flavor) {
  case TRANSLATABLE:
    return new CsvTranslatableTable(file, null);
  case SCANNABLE:
    return new CsvScannableTable(file, null);
  case FILTERABLE:
    return new CsvFilterableTable(file, null);
  default:
    throw new AssertionError("Unknown flavor " + flavor);
  }
}

CsvSchema 扫描directory文件路径，找到所有“.csv”文件，为每个文件创建对应的一张表，本例中directory = target/test-classes/sales，文件路径下包含EMPS.csv and DEPTS.csv,对应实现EMPS和DEPTS两张表。

小结

我们不需要在model中定义任何表，schema自动生成这些表，主要流程如下：

model文件——指定——> XXXSchemaFactory——实现SchemaFactory构建——>XXXSchema—–实现AbstractSchema构建table—->XXXTable—-实现Table接口—>实现Scan文件任务

Tables and views in schema

除了schema自动生成表，还可以利用schema的tables属性定义额外的表。下面是如何在schema中定义view视图的例子：

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      },
      tables: [
        {
          name: 'FEMALE_EMPS',
          type: 'view',
          sql: 'SELECT * FROM emps WHERE gender = \'F\''
        }
      ]
    }
  ]
}

其中：

‘view’ 标识 FEMALE_EMPS 为视图

对于视图，我们可以像普通表一样查询：


sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
|  NAME  |    NAME    |
+--------+------------+
| Wilma  | Marketing  |
+--------+------------+

Custom tables

Custom table: 用户代码直接实现的表（而不是通过schema实现）

model-with-custom-table.json中的一个例子，如下:

{
  version: '1.0',
  defaultSchema: 'CUSTOM_TABLE',
  schemas: [
    {
      name: 'CUSTOM_TABLE',
      tables: [
        {
          name: 'EMPS',
          type: 'custom',
          factory: 'org.apache.calcite.adapter.csv.CsvTableFactory',
          operand: {
            file: 'target/test-classes/sales/EMPS.csv.gz',
            flavor: "scannable"
          }
        }
      ]
    }
  ]
}

查询方式如下:

sqlline> !connect jdbc:calcite:model=target/test-classes/model-with-custom-table.json admin admin
sqlline> SELECT empno, name FROM custom_table.emps;
+--------+--------+
| EMPNO  |  NAME  |
+--------+--------+
| 100    | Fred   |
| 110    | Eric   |
| 110    | John   |
| 120    | Wilma  |
| 130    | Alice  |
+--------+--------+

The schema is a regular one, and contains a custom table powered by org.apache.calcite.adapter.csv.CsvTableFactory, which implements the Calcite interface TableFactory. Its create method instantiates a CsvScannableTable, passing in the file argument from the model file:
和上面给出过的schema类似，多出一个rg.apache.calcite.adapter.csv.CsvTableFactory标识表的创建工厂（CsvTableFactory实现TableFactory，create()方法中实现CsvScannableTable）

public CsvTable create(SchemaPlus schema, String name,
    Map<String, Object> map, RelDataType rowType) {
  String fileName = (String) map.get("file");//model中的参数，直接指定文件路径，不需要扫描文件
  final File file = new File(fileName);
  final RelProtoDataType protoRowType =
      rowType != null ? RelDataTypeImpl.proto(rowType) : null;
  return new CsvScannableTable(file, protoRowType);
}

小结与对比

作为实现custom schema的补充，实现custom table和上面的custom schema自动创建表一样最后都要实现一个类似的Table接口，不同的是custom table不需要实现 metadata discovery（CsvTableFactory和CsvSchema一样都要创建CsvScannableTable，只不过不需要扫描文件找.csv文件，文件路径是明确指定的）
Custom tables需要开发者为每张表提供明确的文件（路径），同时也给开发者提供了更过可控的可能性（为每个表给定不同的参数）

Comments in models

Models can include comments using /* … */ and // syntax:

{
  version: '1.0',
  /* Multi-line
     comment. */
  defaultSchema: 'CUSTOM_TABLE',
  // Single-line comment.
  schemas: [
    ..
  ]
}

(Comments are not standard JSON, but are a harmless extension.)

Optimizing queries using planner rules

Calcite支持通过添加planner rules的形式优化查询，在查询树（query parse tree）中查询和Planner rules匹配的节点node，用心的nodes做替换实现查询优化。
Planner rules也是可扩展的（如schemas、tables），因此若想用sql查询数据源，主要步骤，首先定义custom table或者schema，然后定义rules来优化查询

下面是一个简单的例子:

sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   EnumerableTableScan(table=[[SALES, EMPS]])        |
+-----------------------------------------------------+
sqlline> !connect jdbc:calcite:model=target/test-classes/smart.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   CsvTableScan(table=[[SALES, EMPS]])               |
+-----------------------------------------------------+

plan的差异源自，smart.json model file，smart.json中多处一行：

flavor: “translatable”

这使得CsvSchema会按照参数flavor = TRANSLATABLE , createTable方法建立一个CsvTranslatableTable实例而非CsvScannableTable。
CsvTranslatableTable 实现 ranslatableTable.toRel()创建CsvTableScan，作为 query operator tree中的一层，查询时会触发rules起作用。

实体中的rules如下，这些规则注册到CsvTableScan中：
CsvTableScan:: register

  @Override public void register(RelOptPlanner planner) {
    planner.addRule(CsvProjectTableScanRule.INSTANCE);
  }`

规则:
* constructor method中声明relational expressions的规则，这些规则会触发起作用。
* onMatch method生成新relational expression，调用 RelOptRuleCall.transformTo()来指示rule触发

public class CsvProjectTableScanRule extends RelOptRule {
  public static final CsvProjectTableScanRule INSTANCE =
      new CsvProjectTableScanRule();

  private CsvProjectTableScanRule() {
    super(
        operand(Project.class,
            operand(CsvTableScan.class, none())),
        "CsvProjectTableScanRule");
  }

  @Override
  public void onMatch(RelOptRuleCall call) {
    final Project project = call.rel(0);
    final CsvTableScan scan = call.rel(1);
    int[] fields = getProjectFields(project.getProjects());
    if (fields == null) {
      // Project contains expressions more complex than just field references.
      return;
    }
    call.transformTo(
        new CsvTableScan(
            scan.getCluster(),
            scan.getTable(),
            scan.csvTable,
            fields));
  }

  private int[] getProjectFields(List<RexNode> exps) {
    final int[] fields = new int[exps.size()];
    for (int i = 0; i < exps.size(); i++) {
      final RexNode exp = exps.get(i);
      if (exp instanceof RexInputRef) {
        fields[i] = ((RexInputRef) exp).getIndex();
      } else {
        return null; // not a simple projection
      }
    }
    return fields;
  }
}

The query optimization process

1）优化过程不是按照rules描述的顺序，查询优化会遍历tree的所有的分支
2）Calcite利用cost来选择plans
3）Calcite是按照给定规则来做优化选择，而是像其他优化器一样，在rule A and rule B之间，分别执行后做结果对比选择

>
Many optimizers have a linear optimization scheme. Faced with a choice between rule A and rule B, as above, such an optimizer needs to choose immediately. It might have a policy such as “apply rule A to the whole tree, then apply rule B to the whole tree”, or apply a cost-based policy, applying the rule that produces the cheaper result.

3) cost model 防止局部最小、做搜索剪支

JDBC adapter

JDBC adapter将JDBC数据源的schema映射为Calcite的schema
例如从MySQL “foodmart” database中读取schema：

{
  version: '1.0',
  defaultSchema: 'FOODMART',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

Current limitations:

The JDBC adapter currently only pushes down table scan operations; all other processing (filtering, joins, aggregations and so forth) occurs within Calcite.

Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If a Calcite query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Calcite will use the most efficient distributed query approach that it can.

The cloning JDBC adapter

The cloning JDBC adapter creates a hybrid database. The data is sourced from a JDBC database but is read into in-memory tables the first time each table is accessed. Calcite evaluates queries based on those in-memory tables, effectively a cache of the database.

For example, the following model reads tables from a MySQL “foodmart” database:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

Another technique is to build a clone schema on top of an existing schema. You use the source property to reference a schema defined earlier in the model, like this:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    },
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        source: 'FOODMART'
      }
    }
  ]
}

You can use this approach to create a clone schema on any type of schema, not just JDBC.

The cloning adapter isn’t the be-all and end-all. We plan to develop more sophisticated caching strategies, and a more complete and efficient implementation of in-memory tables, but for now the cloning JDBC adapter shows what is possible and allows us to try out our initial implementations.

Calcite-[1]-Tutorial-2