Tutorial
通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema,然后提供SQL查询接口。
包含如果重要概念:
- 使用SchemaFactory和Schema接口构建用户自定义schema
- 以model JSON文件声明schemas
- 在model JSON文件中声明views视图
- 使用Table接口构建用户自定义表
- 使用ScannableTable接口实现一个简单的Table,直接枚举所有的行rows
Table的高级实现FilterableTable
使用FilterableTable实现稍高级的Table实现,可以根据简单的predicates谓词实现行数据的过滤
Table的高级实现TranslatableTable
通过planner rules将table转化为一个RelNode relational expression
Extension to Table that specifies how it is to be translated to a {org.apache.calcite.rel.RelNode relational expression}.
If Table does not implement this interface, it will be converted to an EnumerableTableScan
Generally a Table will implement this interface to create a particular subclass of RelNode, and also register rules that act on that particular subclass of RelNode
Download and build
$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install -DskipTests -Dcheckstyle.skip=true
$ cd example/csv
First queries
使用工程内置的sqlline查询((If you are running Windows, the command is sqlline.bat.)
):
$ ./sqlline
sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
执行一个metadata 查询:
sqlline> !tables
>
(JDBC experts, note: sqlline’s !tables command is just executing DatabaseMetaData.getTables() behind the scenes. It has other commands to query JDBC metadata, such as !columns and !describe.)
+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null | SALES | DEPTS | TABLE | null | null |
| null | SALES | EMPS | TABLE | null | null |
| null | SALES | HOBBIES | TABLE | null | null |
| null | metadata | COLUMNS | SYSTEM_TABLE | null | null |
| null | metadata | TABLES | SYSTEM_TABLE | null | null |
+------------+--------------+-------------+---------------+----------+------+
可见有5张表:
- SALES schema : EMPS、DEPTS、 HOBBIES
- metadata schema:COLUMNS、 TABLES
一般system tables在Calcite实现,其他的表以特性schema形式实现。上实例中EMPS和DEPTS源自于target/test-classes 文件夹下的EMPS.csv和 DEPTS.csv文件(通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema)
Calcite实现了完备的SQL查询,如table scan:
sqlline> SELECT * FROM emps;
+--------+--------+---------+---------+----------------+--------+-------+---+
| EMPNO | NAME | DEPTNO | GENDER | CITY | EMPID | AGE | S |
+--------+--------+---------+---------+----------------+--------+-------+---+
| 100 | Fred | 10 | | | 30 | 25 | t |
| 110 | Eric | 20 | M | San Francisco | 3 | 80 | n |
| 110 | John | 40 | M | Vancouver | 2 | null | f |
| 120 | Wilma | 20 | F | | 1 | 5 | n |
| 130 | Alice | 40 | F | Vancouver | 2 | null | f |
+--------+--------+---------+---------+----------------+--------+-------+---+
JOIN and GROUP BY:
sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;
+------------+---------+
| NAME | EXPR$1 |
+------------+---------+
| Sales | 1 |
| Marketing | 2 |
+------------+---------+
ALUES operator 生成单行,可以简便的对expressions和内置的SQL functions进行测试:
sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0 |
+---------+
| 13 |
+---------+
Schema
Calcite在不感知CSV 文件的情况下(As a “database without a storage layer”, Calcite doesn’t know about any file formats.)如何感知table?
主要实现步骤如下:
(1) 根据model文件(JSON format)中指定的schema factory类定义schema。
(2) schema创建tables,其中每个table知道如果scan CSV 文件得到数据.
(3) 在JDBC connect string给出这个model文件的路径(如jdbc:calcite:model=target/test-classes/model.json admin admin),提交查询query
(4) 当Calcite解析完query,继而生成plann,当query执行时Calcite调起tables读取数据。
model文件例如:
{
version: '1.0',
defaultSchema: 'SALES',
schemas: [
{
name: 'SALES',
type: 'custom',
factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
operand: {
directory: 'target/test-classes/sales'
}
}
]
}
该model定义了一个名为‘SALES’的schema,具体实现工厂类为org.apache.calcite.adapter.csv.CsvSchemaFactory,CsvSchemaFactory主要实现SchemaFactory
, 传入model文件中的参数以create()
方法实现一个schema实例(本例CsvSchema,CsvSchema实现Schema接口):
public Schema create(SchemaPlus parentSchema, String name,
Map<String, Object> operand) {
String directory = (String) operand.get("directory");//model中的属性
String flavorName = (String) operand.get("flavor");//model中的属性
CsvTable.Flavor flavor;
if (flavorName == null) {
flavor = CsvTable.Flavor.SCANNABLE;
} else {
flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase());
}
return new CsvSchema(
new File(directory),
flavor);
}
schema的主要工作.
- 构建一系列tables(这些table实现
Table
接口) - 也可以实现sub-schemas和table-functions等,这些属于高级特性,在本例calcite-example-csv 没有相应的实现
本例中CsvSchema构建CsvTable及其子类的实例table,相关代码如下,CsvSchema实现AbstractSchema
的getTableMap() 方法:
protected Map<String, Table> getTableMap() {
// Look for files in the directory ending in ".csv", ".csv.gz", ".json",
// ".json.gz".
File[] files = directoryFile.listFiles(
new FilenameFilter() {
public boolean accept(File dir, String name) {
final String nameSansGz = trim(name, ".gz");
return nameSansGz.endsWith(".csv")
|| nameSansGz.endsWith(".json");
}
});
if (files == null) {
System.out.println("directory " + directoryFile + " not found");
files = new File[0];
}
// Build a map from table name to table; each file becomes a table.
final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder();
for (File file : files) {
String tableName = trim(file.getName(), ".gz");
final String tableNameSansJson = trimOrNull(tableName, ".json");
if (tableNameSansJson != null) {
JsonTable table = new JsonTable(file);
builder.put(tableNameSansJson, table);
continue;
}
tableName = trim(tableName, ".csv");
final Table table = createTable(file);
builder.put(tableName, table);
}
return builder.build();
}
/** Creates different sub-type of table based on the "flavor" attribute. */
private Table createTable(File file) {
switch (flavor) {
case TRANSLATABLE:
return new CsvTranslatableTable(file, null);
case SCANNABLE:
return new CsvScannableTable(file, null);
case FILTERABLE:
return new CsvFilterableTable(file, null);
default:
throw new AssertionError("Unknown flavor " + flavor);
}
}
CsvSchema 扫描directory文件路径,找到所有“.csv”文件,为每个文件创建对应的一张表,本例中directory = target/test-classes/sales,文件路径下包含EMPS.csv and DEPTS.csv,对应实现EMPS和DEPTS两张表。
小结
我们不需要在model中定义任何表,schema自动生成这些表,主要流程如下:
model文件
——指定——> XXXSchemaFactory
——实现SchemaFactory构建——>XXXSchema
—–实现AbstractSchema构建table—->XXXTable—-实现Table接口—>实现Scan文件任务
Tables and views in schema
除了schema自动生成表,还可以利用schema的tables属性定义额外的表。下面是如何在schema中定义view视图的例子:
{
version: '1.0',
defaultSchema: 'SALES',
schemas: [
{
name: 'SALES',
type: 'custom',
factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
operand: {
directory: 'target/test-classes/sales'
},
tables: [
{
name: 'FEMALE_EMPS',
type: 'view',
sql: 'SELECT * FROM emps WHERE gender = \'F\''
}
]
}
]
}
其中:
- ‘view’ 标识 FEMALE_EMPS 为视图
对于视图,我们可以像普通表一样查询:
sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
| NAME | NAME |
+--------+------------+
| Wilma | Marketing |
+--------+------------+
Custom tables
Custom table: 用户代码直接实现的表(而不是通过schema实现)
model-with-custom-table.json中的一个例子,如下:
{
version: '1.0',
defaultSchema: 'CUSTOM_TABLE',
schemas: [
{
name: 'CUSTOM_TABLE',
tables: [
{
name: 'EMPS',
type: 'custom',
factory: 'org.apache.calcite.adapter.csv.CsvTableFactory',
operand: {
file: 'target/test-classes/sales/EMPS.csv.gz',
flavor: "scannable"
}
}
]
}
]
}
查询方式如下:
sqlline> !connect jdbc:calcite:model=target/test-classes/model-with-custom-table.json admin admin
sqlline> SELECT empno, name FROM custom_table.emps;
+--------+--------+
| EMPNO | NAME |
+--------+--------+
| 100 | Fred |
| 110 | Eric |
| 110 | John |
| 120 | Wilma |
| 130 | Alice |
+--------+--------+
The schema is a regular one, and contains a custom table powered by org.apache.calcite.adapter.csv.CsvTableFactory, which implements the Calcite interface TableFactory. Its create method instantiates a CsvScannableTable, passing in the file argument from the model file:
和上面给出过的schema类似,多出一个rg.apache.calcite.adapter.csv.CsvTableFactory
标识表的创建工厂(CsvTableFactory实现TableFactory,create()
方法中实现CsvScannableTable
)
public CsvTable create(SchemaPlus schema, String name,
Map<String, Object> map, RelDataType rowType) {
String fileName = (String) map.get("file");//model中的参数,直接指定文件路径,不需要扫描文件
final File file = new File(fileName);
final RelProtoDataType protoRowType =
rowType != null ? RelDataTypeImpl.proto(rowType) : null;
return new CsvScannableTable(file, protoRowType);
}
小结与对比
- 作为实现custom schema的补充,实现custom table和上面的custom schema自动创建表一样最后都要实现一个类似的Table接口,不同的是custom table不需要实现 metadata discovery(CsvTableFactory和CsvSchema一样都要创建CsvScannableTable,只不过不需要扫描文件找.csv文件,文件路径是明确指定的)
- Custom tables需要开发者为每张表提供明确的文件(路径),同时也给开发者提供了更过可控的可能性(为每个表给定不同的参数)
Comments in models
Models can include comments using /* … */ and // syntax:
{
version: '1.0',
/* Multi-line
comment. */
defaultSchema: 'CUSTOM_TABLE',
// Single-line comment.
schemas: [
..
]
}
(Comments are not standard JSON, but are a harmless extension.)
Optimizing queries using planner rules
- Calcite支持通过添加planner rules的形式优化查询,在查询树(query parse tree)中查询和Planner rules匹配的节点node,用心的nodes做替换实现查询优化。
- Planner rules也是可扩展的(如schemas、tables),因此若想用sql查询数据源,主要步骤,首先定义custom table或者schema,然后定义rules来优化查询
下面是一个简单的例子:
sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
| EnumerableTableScan(table=[[SALES, EMPS]]) |
+-----------------------------------------------------+
sqlline> !connect jdbc:calcite:model=target/test-classes/smart.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
| CsvTableScan(table=[[SALES, EMPS]]) |
+-----------------------------------------------------+
plan的差异源自,smart.json model file,smart.json中多处一行:
flavor: “translatable”
这使得CsvSchema会按照参数flavor = TRANSLATABLE
, createTable
方法建立一个CsvTranslatableTable实例而非CsvScannableTable。
CsvTranslatableTable 实现 ranslatableTable.toRel()
创建CsvTableScan,作为 query operator tree中的一层,查询时会触发rules起作用。
实体中的rules如下,这些规则注册到CsvTableScan中:
CsvTableScan:: register
@Override public void register(RelOptPlanner planner) {
planner.addRule(CsvProjectTableScanRule.INSTANCE);
}`
规则:
* constructor method中声明relational expressions的规则,这些规则会触发起作用。
* onMatch method生成新relational expression,调用 RelOptRuleCall.transformTo()来指示rule触发
public class CsvProjectTableScanRule extends RelOptRule {
public static final CsvProjectTableScanRule INSTANCE =
new CsvProjectTableScanRule();
private CsvProjectTableScanRule() {
super(
operand(Project.class,
operand(CsvTableScan.class, none())),
"CsvProjectTableScanRule");
}
@Override
public void onMatch(RelOptRuleCall call) {
final Project project = call.rel(0);
final CsvTableScan scan = call.rel(1);
int[] fields = getProjectFields(project.getProjects());
if (fields == null) {
// Project contains expressions more complex than just field references.
return;
}
call.transformTo(
new CsvTableScan(
scan.getCluster(),
scan.getTable(),
scan.csvTable,
fields));
}
private int[] getProjectFields(List<RexNode> exps) {
final int[] fields = new int[exps.size()];
for (int i = 0; i < exps.size(); i++) {
final RexNode exp = exps.get(i);
if (exp instanceof RexInputRef) {
fields[i] = ((RexInputRef) exp).getIndex();
} else {
return null; // not a simple projection
}
}
return fields;
}
}
The query optimization process
1)优化过程不是按照rules描述的顺序,查询优化会遍历tree的所有的分支
2)Calcite利用cost来选择plans
3)Calcite是按照给定规则来做优化选择,而是像其他优化器一样,在rule A and rule B之间,分别执行后做结果对比选择
>
Many optimizers have a linear optimization scheme. Faced with a choice between rule A and rule B, as above, such an optimizer needs to choose immediately. It might have a policy such as “apply rule A to the whole tree, then apply rule B to the whole tree”, or apply a cost-based policy, applying the rule that produces the cheaper result.
3) cost model 防止局部最小、做搜索剪支
JDBC adapter
JDBC adapter将JDBC数据源的schema映射为Calcite的schema
例如从MySQL “foodmart” database中读取schema:
{
version: '1.0',
defaultSchema: 'FOODMART',
schemas: [
{
name: 'FOODMART',
type: 'custom',
factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
operand: {
jdbcDriver: 'com.mysql.jdbc.Driver',
jdbcUrl: 'jdbc:mysql://localhost/foodmart',
jdbcUser: 'foodmart',
jdbcPassword: 'foodmart'
}
}
]
}
Current limitations:
- The JDBC adapter currently only pushes down table scan operations; all other processing (filtering, joins, aggregations and so forth) occurs within Calcite.
- Our goal is to push down as much processing as possible to the source system, translating syntax, data types and built-in functions as we go. If a Calcite query is based on tables from a single JDBC database, in principle the whole query should go to that database. If tables are from multiple JDBC sources, or a mixture of JDBC and non-JDBC, Calcite will use the most efficient distributed query approach that it can.
The cloning JDBC adapter
The cloning JDBC adapter creates a hybrid database. The data is sourced from a JDBC database but is read into in-memory tables the first time each table is accessed. Calcite evaluates queries based on those in-memory tables, effectively a cache of the database.
For example, the following model reads tables from a MySQL “foodmart” database:
{
version: '1.0',
defaultSchema: 'FOODMART_CLONE',
schemas: [
{
name: 'FOODMART_CLONE',
type: 'custom',
factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
operand: {
jdbcDriver: 'com.mysql.jdbc.Driver',
jdbcUrl: 'jdbc:mysql://localhost/foodmart',
jdbcUser: 'foodmart',
jdbcPassword: 'foodmart'
}
}
]
}
Another technique is to build a clone schema on top of an existing schema. You use the source property to reference a schema defined earlier in the model, like this:
{
version: '1.0',
defaultSchema: 'FOODMART_CLONE',
schemas: [
{
name: 'FOODMART',
type: 'custom',
factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
operand: {
jdbcDriver: 'com.mysql.jdbc.Driver',
jdbcUrl: 'jdbc:mysql://localhost/foodmart',
jdbcUser: 'foodmart',
jdbcPassword: 'foodmart'
}
},
{
name: 'FOODMART_CLONE',
type: 'custom',
factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
operand: {
source: 'FOODMART'
}
}
]
}
You can use this approach to create a clone schema on any type of schema, not just JDBC.
The cloning adapter isn’t the be-all and end-all. We plan to develop more sophisticated caching strategies, and a more complete and efficient implementation of in-memory tables, but for now the cloning JDBC adapter shows what is possible and allows us to try out our initial implementations.