Flink Table API & SQL concepts and general API

Official website link: https: //ci.apache.org/projects/flink/flink-docs-release-1.9/zh/dev/table/common.html#register-a-datastream-or-dataset-as-table

Table API & SQL concepts and general API

Apache Flink has two relationships API-Table API for unity and SQL- flow and batch processing. Table API is a Java language and Scala integrated query API, which allows a very intuitive way from a combination of queries relational operator (such as selection, filtering and coupling) of. Flink SQL support Apache Calcite-based implementation of the SQL standard. Batch regardless of the input is an input DataSetor an input stream DataStream, two interfaces specified in the query have the same semantics and specify the same results.

Table API and tight integration with SQL interface and Flink's DataStream and DataSet API. You can easily switch between all API and API-based library. For example, you can use the library CEP DataStream extracted from the pattern, and then analyzed using the Table API mode, or you can scan using SQL queries, filtered, and the polymerization batch table, and then run the algorithm of FIG Gelly data in the preprocessor.

Note, Table API functions and SQL has not been completed, under active development. [Table API, SQL] and [stream, batch] for each combination of inputs do not support all the operations.

Table API & SQL Program Structure

Batch and for all streaming Table API and SQL procedures follow the same pattern. The following sample code shows the general structure of Table API and SQL procedures.

// step1 : 为特定的执行计划批处理或流创建一个TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// step2 : 注册表 
tableEnv.registerTable("table1", ...)           // or
tableEnv.registerTableSource("table2", ...)     // or
tableEnv.registerExternalCatalog("extCat", ...)
// step 3 : 注册输出表
tableEnv.registerTableSink("outputTable", ...);

// step 4 : 查询API
val tapiResult = tableEnv.scan("table1").select(...)
// create a Table from a SQL query
val sqlResult  = tableEnv.sqlQuery("SELECT ... FROM table2 ...")

// step 5 : 将Table API结果表发送到TableSink,与SQL结果相同
tapiResult.insertInto("outputTable")

// step 6 : 执行程序
tableEnv.execute("scala_job")

Note: Table API and SQL queries can be easily integrated with DataStream or DataSet program and embedded therein. See Integration with API DataStream and DataSet , to learn how to convert DataSet DataStream and Tables, and vice versa.

Create a TableEnvironment

TableEnvironment Table API SQL is an integrated center concepts. It is responsible for:

  • Registration Table inside the Catalog
  • Registration External Catalog
  • Execute SQL queries
  • Registered user-defined (scalar, table, or aggregation) function
  • The DataStream or DataSet converted to Table
  • Holds a reference to the ExecutionEnvironment or StreamExecutionEnvironment

Table always bound to a specific TableEnvironment. TableEnvironments possible combinations of different tables in the same query, for example, they join or union.

To create TableEnvironment with StreamExecutionEnvironment or by calling the static BatchTableEnvironment.create ExecutionEnvironment and optional TableConfig the () or StreamTableEnvironment.create () method. TableConfig be used to configure TableEnvironment or custom query optimization and transformation process (See query optimization ).

Be sure to select your programming language that matches a particular planner BatchTableEnvironment / StreamTableEnvironment. If two programs are programs jar on the classpath (default behavior) should be clearly set the schedule you want to use in the current program.

// **********************
// FLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala.StreamTableEnvironment

val fsSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings)
// or val fsTableEnv = TableEnvironment.create(fsSettings)

// ******************
// FLINK BATCH QUERY
// ******************
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.scala.BatchTableEnvironment

val fbEnv = ExecutionEnvironment.getExecutionEnvironment
val fbTableEnv = BatchTableEnvironment.create(fbEnv)

// **********************
// BLINK STREAMING QUERY
// **********************
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.scala.StreamTableEnvironment

val bsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings)
// or val bsTableEnv = TableEnvironment.create(bsSettings)

// ******************
// BLINK BATCH QUERY
// ******************
import org.apache.flink.table.api.{EnvironmentSettings, TableEnvironment}

val bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val bbTableEnv = TableEnvironment.create(bbSettings)

Note: to create specific EnvironmentSettings lib directory only if a planner jar, you can use useAnyPlanner.

In the Catalog Registry

Catalog: all metadata information database and tables are stored in the internal Flink CataLog directory structure that is stored internal flink all metadata information related to the Table, including the table structure information / data source information.

TableEnvironment maintenance Catalog table by name is registered. There are two types of tables, the input tables and output tables. Enter the table can be referenced in Table API and SQL queries and provide input data. Output table or table API may be used to transmit the results of the SQL query to the external system.

Enter the table can be registered from various sources:

  • Existing Table object, usually the result of Table API or SQL queries.
  • A TableSource, for accessing external data, such as files, databases or messaging systems.
  • DataStream (only for flow operating) or DataSet (only for the transition from the old batch job scheduler) programs DataStream or DataSet.

You can use TableSink registered output table.

Registration Table

Table TableEnvironment in a register, as follows:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// table is the result of a simple projection query 
val projTable: Table = tableEnv.scan("X").select(...)

// register the Table projTable as table "projectedTable"
tableEnv.registerTable("projectedTable", projTable)

NOTE: Table is registered with handling a relational database system is known VIEWsimilar to that of the query table definition is not optimized, but another query is inlined reference in the registry. If multiple queries refer to the same registry, then for each reference in the query with the table and perform many times, the results are not about to share the registry.

Registration TableSource

By TableSourcefile (CSV, Apache [Parquet, Avro , ORC] , etc.), can access the external data storage in a storage system, such as a database (MySQL, HBase, etc.), having a specific coded or messaging system (Apache Kafka, RabbitMQ etc. ).

Flink is designed to provide a common data formats and storage systems TableSources. See the " [the Table Sources and Sinks] " page for a list of supported source table and how to build a custom table source.

TableSource in TableEnvironment registered as follows:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// create a TableSource
val csvSource: TableSource = new CsvTableSource("/path/to/file", ...)

// register the TableSource as table "CsvTable"
tableEnv.registerTableSource("CsvTable", csvSource)

Note: Blink implementation plan for the program TableEnvironment only accepts StreamTableSource, LookupableTableSource and InputFormatTableSource, and Blink program for batch process StreamTableSource must be bounded.

Registration TableSink

Registered TableSink can be used to result Table API or SQL queries sent to the external storage systems , such as databases, KV memory, message queues, or file system (different coding, e.g. CSV, Apache [Parquet, Avro, ORC], ...) .

Flink is designed to common data formats and storage systems provide TableSink. Please refer to the " Table source and receiver Documents" page for more information about the available receivers and how to implement a custom description of TableSink.

TableSink in TableEnvironment registered as follows:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// create a TableSink
val csvSink: TableSink = new CsvTableSink("/path/to/file", ...)

// define the field names and types
val fieldNames: Array[String] = Array("a", "b", "c")
val fieldTypes: Array[TypeInformation[_]] = Array(Types.INT, Types.STRING, Types.LONG)

// register the TableSink as table "CsvSinkTable"
tableEnv.registerTableSink("CsvSinkTable", fieldNames, fieldTypes, csvSink)

Registration External Catalog

External catalog can provide information about external database and table information, such as their names, architecture, information data statistics, as well as on how to access data stored in external databases, tables or files.

It can be achieved by ExternalCatalogcreating interfaces to external directory and TableEnvironment中register them, as shown below:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// create an external catalog
val catalog: ExternalCatalog = new InMemoryExternalCatalog

// register the ExternalCatalog catalog
tableEnv.registerExternalCatalog("InMemCatalog", catalog)

After registering in TableEnvironment, you can access all the tables defined in ExternalCatalog from Table API or SQL queries by specifying the full path table (for example catalog.database.table).

Current, Flink provides a InMemoryExternalCatalog for demonstration and testing. However, it may be connected HCatalog Metastore or the like to the directory using ExternalCatalog Table API interface.

Note: blink implementation plan does not support external directory.

Query a Table

Table API

Table API for Scala and Java Language Integrated Query API. Compared with SQL, the query is not specified as a string, but in the host language and gradually form.

Based on Table class, a table class represents Table (flow or batch), and application-related method of operation provides the API. The method returns a new Table object that represents the result of input operation relationship Table application. Some relations operated by a plurality of method calls composition, e.g. table.groupBy (...), select (); wherein groupBy (...) grouping the specified table, and SELECT (...) in the projection table packet.

Table API document describes the flow table and Batch Table Table API support all operations.

The following example shows a simple aggregate queries Table API:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// register Orders table

// scan registered Orders table
val orders = tableEnv.scan("Orders")
// compute revenue for all customers from France
val revenue = orders
  .filter('cCountry === "FRANCE")
  .groupBy('cID, 'cName)
  .select('cID, 'cName, 'revenue.sum AS 'revSum)

// emit or convert Table
// execute query

Note: Scala Table API using Scala symbol, the number to a tick ( ') to reference the beginning of the attribute table. Table API using Scala implicit. Introducing ensure org.apache.flink.api.scala. Org.apache.flink.table.api.scala._ and Scala to use implicit conversion.

SQL API

Flink's integrated SQL-based implementation of the SQL standard Apache Calcite. SQL query is specified as a regular string. SQL document describes SQL support Flink convection batch table and the table.

The following example shows how to specify the query and returns the result as the Table:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// register Orders table

// compute revenue for all customers from France
val revenue = tableEnv.sqlQuery("""
  |SELECT cID, cName, SUM(revenue) AS revSum
  |FROM Orders
  |WHERE cCountry = 'FRANCE'
  |GROUP BY cID, cName
  """.stripMargin)

// emit or convert Table
// execute query

The following example shows how to specify the query results Insert registry update query:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// register "Orders" table
// register "RevenueFrance" output table

// compute revenue for all customers from France and emit to "RevenueFrance"
tableEnv.sqlUpdate("""
  |INSERT INTO RevenueFrance
  |SELECT cID, cName, SUM(revenue) AS revSum
  |FROM Orders
  |WHERE cCountry = 'FRANCE'
  |GROUP BY cID, cName
  """.stripMargin)

// execute query

Mixing tables and SQL API

Table API and SQL queries can be easily mixed, as they are returned Table objects:

  • Table API queries can be defined in the Table object in SQL queries.
  • And refer to it in the FROM SQL clause of a query by registering the results in Table TableEnvironment, the result can be defined SQL query to Table API query.

Submitted Table

Table by table to submit written TableSink. TableSink common interface is used to support a variety of file formats (e.g., CSV, Apache Parquet, Apache Avro), the storage system (e.g. JDBC, Apache HBase, Apache Cassandra, Elasticsearch) or messaging systems (e.g. Apache Kafka, RabbitMQ).

Batch table can only be written BatchTableSink, while the flow meter is required AppendStreamTableSink, RetractStreamTableSink or UpsertStreamTableSink.

Please refer to the Table Sources & Sinks document, for more information about the available receivers and about how to implement a custom TableSink description.

Table.insertInto (String tableName) method will be submitted to the Table TableSink registered. This method looks TableSink from the directory by name, and verify that schema with the schema Table TableSink are the same.

The following example shows how to make the table:

// get a TableEnvironment
val tableEnv = ... // see "Create a TableEnvironment" section

// create a TableSink
val sink: TableSink = new CsvTableSink("/path/to/file", fieldDelim = "|")

// register the TableSink with a specific schema
val fieldNames: Array[String] = Array("a", "b", "c")
val fieldTypes: Array[TypeInformation] = Array(Types.INT, Types.STRING, Types.LONG)
tableEnv.registerTableSink("CsvSinkTable", fieldNames, fieldTypes, sink)

// compute a result Table using Table API operators and/or SQL queries
val result: Table = ...

// emit the result Table to the registered TableSink
result.insertInto("CsvSinkTable")

// execute the program

Translation and execute queries

For two execution plan, the execution of the query translation and behavior are different.

Table API based on the input and the input stream is a SQL query or batch input, they will be translated into DataStream or DataSet program. Queries are represented internally as a logical query plan, and is divided into two stages:

  1. Optimization logic program
  2. DataStream program or converted to DataSet

In the following cases, the translation Table API or SQL query:

  • Table is transmitted to TableSink, i.e. when calling Table.insertInto ().
  • Specify the SQL update query that calls TableEnvironment.sqlUpdate () time.
  • Table DataStream will convert or DataSet (See Integration with DataStream and the DataSet API ).

After the implementation of the translation, or the like conventional DataStream processing program like DataSet Table API or SQL query, and call StreamExecutionEnvironment.execute () or ExecutionEnvironment.execute ().

Integration with DataStream and DataSet API

Implementation plan on two streams can be integrated with DataStream API. Only the old program implementation plan to integrate with the DataSet API, Blink and Batch execution plan program can not be combined with both.

Note: Old Scheduler DataSet API discussed below, only the batch using.

Table API and SQL queries can be easily integrated with DataStream and DataSet program and embedded therein. For example, the query table outside (e.g. from the RDBMS), some preprocessing, such as filtering, projection, polymerized or coupled with the metadata, and further use or the DataSet API DataStream (and any API libraries built on these, CEP e.g. or Gelly). Conversely, you can also Table API or SQL query results to the DataStream or DataSet program.

It can be converted to Table DataStream or DataSet to achieve this interaction, and vice versa.

Scala implicit conversion

Scala Table API having DataSet, DataStream and Table class implicit conversion. . Org.apache.flink.api.scala._ packages and packages, which can be enabled by introducing org.apache.flink.table.api.scala converted to Scala DataStream API.

The DataStream or DataSet registered as a Table

Can be registered in a table or DataStream TableEnvironment DataSet. The results depend on the mode table registered DataStream or DataSet data types.

// get TableEnvironment 
// registration of a DataSet is equivalent
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

val stream: DataStream[(Long, String)] = ...

// register the DataStream as Table "myTable" with fields "f0", "f1"
tableEnv.registerDataStream("myTable", stream)

// register the DataStream as table "myTable2" with fields "myLong", "myString"
tableEnv.registerDataStream("myTable2", stream, 'myLong, 'myString)

Note: The name of the table shall not DataStream ^ DataStreamTable [0-9] + pattern matching, and not the name of the table DataSet ^ DataSetTable match [0-9] + mode. These modes are for internal use only.

The DataStream or DataSet converted to Table

In addition to registering or DataSet in TableEnvironment DataStream, it is also possible to directly convert Table. If you want to use the Table in Table API query, which will be very convenient.

// get TableEnvironment
// registration of a DataSet is equivalent
val tableEnv = ... // see "Create a TableEnvironment" section

val stream: DataStream[(Long, String)] = ...

// convert the DataStream into a Table with default fields '_1, '_2
val table1: Table = tableEnv.fromDataStream(stream)

// convert the DataStream into a Table with fields 'myLong, 'myString
val table2: Table = tableEnv.fromDataStream(stream, 'myLong, 'myString)

Table DataStream will convert or DataSet

When converted to the Table DataStream or DataSet, you need to specify the results DataStream or DataSet data type, data type you want to convert that line to the table. The most convenient type of conversion is usually Row. The following list outlines the functions of the different options:

  • ROW : fields by location, any number of field mapping, support for null values, there is no safe type of access.
  • POJO : field by name mapping (POJO fields must be named Table field), any number of fields, support null value, the type of security access.
  • Class Case : field by location mapping does not support null value, the type of security access.
  • Tuple : by location mapping field, the limit is 22 (Scala) or 25 (Java) field, does not support the null value, the type of security access.
  • Type Atomic : The table must have a single field, does not support the null value, the type of security access.

Table DataStream will convert

Table stream resulting query will be updated dynamically, i.e., as a new record in the query input stream reaches constantly changing. Thus, such a dynamic query into the table update DataStream requires encoding.

There are two modes can be converted to Table DataStream:

The Append Mode 1. : only if the dynamic table can use this mode only when INSERT changes by modification, i.e. it is only append operations, previously issued and the result is never updated.

Retract Mode 2. : Always use this mode. It uses Boolean flag changes to INSERT, and DELETE are encoded.

// get TableEnvironment. 
// registration of a DataSet is equivalent
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

// Table with two fields (String name, Integer age)
val table: Table = ...

// convert the Table into an append DataStream of Row
val dsRow: DataStream[Row] = tableEnv.toAppendStream[Row](table)

// convert the Table into an append DataStream of Tuple2[String, Int]
val dsTuple: DataStream[(String, Int)] dsTuple = 
  tableEnv.toAppendStream[(String, Int)](table)

// convert the Table into a retract DataStream of Row.
//   A retract stream of type X is a DataStream[(Boolean, X)]. 
//   The boolean field indicates the type of the change. 
//   True is INSERT, false is DELETE.
val retractStream: DataStream[(Boolean, Row)] = tableEnv.toRetractStream[Row](table)

Note: For a detailed discussion dynamic table and its properties, see " Dynamic the Tables " document.

The Table converted to DataSet

The Tabble convert DataSet, as follows:

// get TableEnvironment 
// registration of a DataSet is equivalent
val tableEnv = BatchTableEnvironment.create(env)

// Table with two fields (String name, Integer age)
val table: Table = ...

// convert the Table into a DataSet of Row
val dsRow: DataSet[Row] = tableEnv.toDataSet[Row](table)

// convert the Table into a DataSet of Tuple2[String, Int]
val dsTuple: DataSet[(String, Int)] = tableEnv.toDataSet[(String, Int)](table)

Table mapping data type to

Flink's DataStream and DataSet API supports multiple types. Tuples are (Flink built and the Java Scala tuples), POJO, Scala case class and Flink Row types of complex type allows nested data structure having a plurality of fields can be accessed in Table expression. Other types are treated as atomic type. In the following, we describe how to convert these Table API type internal line representation, and the example of Table DataStream converted to display.

Table mapping data type mode may occur in two ways: based on the field names or field-based position.

Location-based mapping

Location-based mapping can be used to impart a more meaningful name for the field while maintaining the order of fields. This mapping may be used for the composite sequence of data field types, and types of atoms having defined. Tuples, etc. rows and case has a complex data type field order. However, it must POJO class field based on the field names. Field can be projected, but not as renaming an alias.

When you define a mapping position based on the input data type specified name may not exist, otherwise it will be assumed that the mapping API should be based on the field names. If any field name is not specified, the default composite type field name and field sequential use, or the use of atomic type f0.

// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

val stream: DataStream[(Long, Int)] = ...

// convert DataStream into Table with default field names "_1" and "_2"
val table: Table = tableEnv.fromDataStream(stream)

// convert DataStream into Table with field "myLong" only
val table: Table = tableEnv.fromDataStream(stream, 'myLong)

// convert DataStream into Table with field names "myLong" and "myInt"
val table: Table = tableEnv.fromDataStream(stream, 'myLong, 'myInt)

Mapping based on the name

Name-based mapping may be used for any type of data, including POJO. This is the most flexible way to define the table schema mapping. All fields map reference by a name, and may be used as an alias to rename. Fields can be reordered and projection.

If any field name is not specified, the default composite type field name and field sequential use, or the use of atomic type f0.

// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

val stream: DataStream[(Long, Int)] = ...

// convert DataStream into Table with default field names "_1" and "_2"
val table: Table = tableEnv.fromDataStream(stream)

// convert DataStream into Table with field "_2" only
val table: Table = tableEnv.fromDataStream(stream, '_2)

// convert DataStream into Table with swapped fields
val table: Table = tableEnv.fromDataStream(stream, '_2, '_1)

// convert DataStream into Table with swapped fields and field names "myInt" and "myLong"
val table: Table = tableEnv.fromDataStream(stream, '_2 as 'myInt, '_1 as 'myLong)

Atomic type

Flink primitives (Integer, Double, String) or universal type (not analyzed and the type of decomposition), as atomic types. DataStream type DataSet atoms or to a table having a single attribute. Atom type inferred from the type of property, and may specify the name of the property.

// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

val stream: DataStream[Long] = ...

// convert DataStream into Table with default field name "f0"
val table: Table = tableEnv.fromDataStream(stream)

// convert DataStream into Table with field name "myLong"
val table: Table = tableEnv.fromDataStream(stream, 'myLong)

Tuples (Scala and Java) and Case Classes (Scala only)

Flink support built-tuple Scala, and provides its own tuple classes for Java. DataStreams DataSet two kinds of tuples, and can be converted to the table. Rename the fields may be provided by the names of all the fields (mapped according to the position). If any field name is not specified, the default field names. If you reference the original field name (Flink tuple f0, f1, ..., Scala tuple _1, _2, ...), the API mapping is based on assumed names, rather than on location. Name-based mapping allows the use of an alias (as) and the projection of the field reordering.

// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

val stream: DataStream[(Long, String)] = ...

// convert DataStream into Table with renamed default field names '_1, '_2
val table: Table = tableEnv.fromDataStream(stream)

// convert DataStream into Table with field names "myLong", "myString" (position-based)
val table: Table = tableEnv.fromDataStream(stream, 'myLong, 'myString)

// convert DataStream into Table with reordered fields "_2", "_1" (name-based)
val table: Table = tableEnv.fromDataStream(stream, '_2, '_1)

// convert DataStream into Table with projected field "_2" (name-based)
val table: Table = tableEnv.fromDataStream(stream, '_2)

// convert DataStream into Table with reordered and aliased fields "myString", "myLong" (name-based)
val table: Table = tableEnv.fromDataStream(stream, '_2 as 'myString, '_1 as 'myLong)

// define case class
case class Person(name: String, age: Int)
val streamCC: DataStream[Person] = ...

// convert DataStream into Table with default field names 'name, 'age
val table = tableEnv.fromDataStream(streamCC)

// convert DataStream into Table with field names 'myName, 'myAge (position-based)
val table = tableEnv.fromDataStream(streamCC, 'myName, 'myAge)

// convert DataStream into Table with reordered and aliased fields "myAge", "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'age as 'myAge, 'name as 'myName)

POJO (Java and Scala)

Flink support POJO as a composite type. POJO in determining the rules documented here.

In the case of the field name is not specified will be converted to POJO DataStream or DataSet Table, using the name of the original POJO field. Name Mapping need the original name, and can not be mapped by position. You can use an alias (using as keywords) on the field to rename, reorder, and projection.

// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

// Person is a POJO with field names "name" and "age"
val stream: DataStream[Person] = ...

// convert DataStream into Table with default field names "age", "name" (fields are ordered by name!)
val table: Table = tableEnv.fromDataStream(stream)

// convert DataStream into Table with renamed fields "myAge", "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'age as 'myAge, 'name as 'myName)

// convert DataStream into Table with projected field "name" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name)

// convert DataStream into Table with projected and renamed field "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name as 'myName)

Row

Row data type supports any number of fields and fields with null values. You can specify the name of the field when converting to Table Row DataStream or DataSet or by RowTypeInfo. Support line types by location and name mapping field. By providing the names of all the fields (based on a mapping location) to rename the fields, it may be a projection / sorting / rename (name-based mapping) individually select field.

// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section

// DataStream of Row with two fields "name" and "age" specified in `RowTypeInfo`
val stream: DataStream[Row] = ...

// convert DataStream into Table with default field names "name", "age"
val table: Table = tableEnv.fromDataStream(stream)

// convert DataStream into Table with renamed field names "myName", "myAge" (position-based)
val table: Table = tableEnv.fromDataStream(stream, 'myName, 'myAge)

// convert DataStream into Table with renamed fields "myName", "myAge" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name as 'myName, 'age as 'myAge)

// convert DataStream into Table with projected field "name" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name)

// convert DataStream into Table with projected and renamed field "myName" (name-based)
val table: Table = tableEnv.fromDataStream(stream, 'name as 'myName)

Query Optimization

To optimize the query translation and Apache Flink use Apache Calcite. The current projection includes optimizing execution and filter push-down, sub-queries and related to other types of query rewrite. Old Planner not been optimized order of connection, but they are performed (FROM clause the table order and / or order of the join predicate in the WHERE clause) in the order defined in the query.

By providing CalciteConfig object, you can adjust the rule set in different stages of optimization applications. You can be invoked through the builder CalciteConfig.createBuilder () to create this property, and by calling tableEnv.getConfig.setPlannerConfig (calciteConfig) to provide it to TableEnvironment.

Explaining a Table

Table API provides a mechanism to explain the logic and computing Table query plan optimization. This is accomplished by TableEnvironment.explain (table) method or TableEnvironment.explain () method is completed. explain (table) is returned to the program given in Table. explain the plan to return multiple receivers result (), mainly for Blink planner. It returns a string that describes three programs:

  1. Abstract syntax tree relational query, that query plan is not optimized logic,
  2. The logical query optimization plan,
  3. And the actual execution plan.

The following code shows an example of the execution plan and the use explain (table) given in Table respective output:

val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = StreamTableEnvironment.create(env)

val table1 = env.fromElements((1, "hello")).toTable(tEnv, 'count, 'word)
val table2 = env.fromElements((1, "hello")).toTable(tEnv, 'count, 'word)
val table = table1
  .where('word.like("F%"))
  .unionAll(table2)

val explanation: String = tEnv.explain(table)
println(explanation)
== Abstract Syntax Tree ==
LogicalUnion(all=[true])
  LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
    FlinkLogicalDataStreamScan(id=[1], fields=[count, word])
  FlinkLogicalDataStreamScan(id=[2], fields=[count, word])

== Optimized Logical Plan ==
DataStreamUnion(all=[true], union all=[count, word])
  DataStreamCalc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')])
    DataStreamScan(id=[1], fields=[count, word])
  DataStreamScan(id=[2], fields=[count, word])

== Physical Execution Plan ==
Stage 1 : Data Source
	content : collect elements with CollectionInputFormat

Stage 2 : Data Source
	content : collect elements with CollectionInputFormat

	Stage 3 : Operator
		content : from: (count, word)
		ship_strategy : REBALANCE

		Stage 4 : Operator
			content : where: (LIKE(word, _UTF-16LE'F%')), select: (count, word)
			ship_strategy : FORWARD

			Stage 5 : Operator
				content : from: (count, word)
				ship_strategy : REBALANCE

The following code shows an exemplary implementation plan and the use of multi-sink EXPLAIN () of the respective outputs:

val settings = EnvironmentSettings.newInstance.useBlinkPlanner.inStreamingMode.build
val tEnv = TableEnvironment.create(settings)

val fieldNames = Array("count", "word")
val fieldTypes = Array[TypeInformation[_]](Types.INT, Types.STRING)
tEnv.registerTableSource("MySource1", new CsvTableSource("/source/path1", fieldNames, fieldTypes))
tEnv.registerTableSource("MySource2", new CsvTableSource("/source/path2",fieldNames, fieldTypes))
tEnv.registerTableSink("MySink1", new CsvTableSink("/sink/path1").configure(fieldNames, fieldTypes))
tEnv.registerTableSink("MySink2", new CsvTableSink("/sink/path2").configure(fieldNames, fieldTypes))

val table1 = tEnv.scan("MySource1").where("LIKE(word, 'F%')")
table1.insertInto("MySink1")

val table2 = table1.unionAll(tEnv.scan("MySource2"))
table2.insertInto("MySink2")

val explanation = tEnv.explain(false)
println(explanation)

Program is the result of multiple sinks

== Abstract Syntax Tree ==
LogicalSink(name=[MySink1], fields=[count, word])
+- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
   +- LogicalTableScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]])

LogicalSink(name=[MySink2], fields=[count, word])
+- LogicalUnion(all=[true])
   :- LogicalFilter(condition=[LIKE($1, _UTF-16LE'F%')])
   :  +- LogicalTableScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]])
   +- LogicalTableScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]])

== Optimized Logical Plan ==
Calc(select=[count, word], where=[LIKE(word, _UTF-16LE'F%')], reuse_id=[1])
+- TableSourceScan(table=[[default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])

Sink(name=[MySink1], fields=[count, word])
+- Reused(reference_id=[1])

Sink(name=[MySink2], fields=[count, word])
+- Union(all=[true], union=[count, word])
   :- Reused(reference_id=[1])
   +- TableSourceScan(table=[[default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]]], fields=[count, word])

== Physical Execution Plan ==
Stage 1 : Data Source
	content : collect elements with CollectionInputFormat

	Stage 2 : Operator
		content : CsvTableSource(read fields: count, word)
		ship_strategy : REBALANCE

		Stage 3 : Operator
			content : SourceConversion(table:Buffer(default_catalog, default_database, MySource1, source: [CsvTableSource(read fields: count, word)]), fields:(count, word))
			ship_strategy : FORWARD

			Stage 4 : Operator
				content : Calc(where: (word LIKE _UTF-16LE'F%'), select: (count, word))
				ship_strategy : FORWARD

				Stage 5 : Operator
					content : SinkConversionToRow
					ship_strategy : FORWARD

					Stage 6 : Operator
						content : Map
						ship_strategy : FORWARD

Stage 8 : Data Source
	content : collect elements with CollectionInputFormat

	Stage 9 : Operator
		content : CsvTableSource(read fields: count, word)
		ship_strategy : REBALANCE

		Stage 10 : Operator
			content : SourceConversion(table:Buffer(default_catalog, default_database, MySource2, source: [CsvTableSource(read fields: count, word)]), fields:(count, word))
			ship_strategy : FORWARD

			Stage 12 : Operator
				content : SinkConversionToRow
				ship_strategy : FORWARD

				Stage 13 : Operator
					content : Map
					ship_strategy : FORWARD

					Stage 7 : Data Sink
						content : Sink: CsvTableSink(count, word)
						ship_strategy : FORWARD

						Stage 14 : Data Sink
							content : Sink: CsvTableSink(count, word)
							ship_strategy : FORWARD
Published 87 original articles · won praise 69 · views 130 000 +

Guess you like

Origin blog.csdn.net/lp284558195/article/details/104247788