22. Flink's table api and sql create table DDL

Flink series of articles

1. Links to a series of comprehensive articles such as Flink deployment, concept introduction, source, transformation, sink usage examples, introduction and examples of the four cornerstones

13. Basic concepts of Flink's table api and sql, introduction to general api and getting started examples 14. Data types of Flink's table api and sql
: built-in data types and their attributes
A detailed introduction to dynamic tables, time attribute configuration (how to process update results), temporal tables, joins on streams, certainty on streams, and query configuration

22. Flink's table api and sql create table DDL

30. The SQL client of Flink SQL (introduced the use of configuration files-tables, views, etc. through the examples of kafka and filesystem)



This article introduces Flink's table api and DDL operations and examples in sql.
This article is relatively simple, just introducing Flink’s DDL.

1. Overview of DDL

The CREATE statement is used to register a table, view, or function in the current or specified Catalog. Registered tables, views and functions can be used in SQL queries.
Currently Flink SQL supports the following CREATE statements:

  • CREATE TABLE
  • CREATE CATALOG
  • CREATE DATABASE
  • CREATE VIEW
  • CREATE FUNCTION

2. Execute the CREATE statement

CREATE statements can be executed using the executeSql() method in TableEnvironment. If the CREATE operation is executed successfully, the executeSql() method returns 'OK', otherwise an exception will be thrown.

1、java

The following example shows how to execute a CREATE statement in a TableEnvironment.

EnvironmentSettings settings = EnvironmentSettings.newInstance()...
TableEnvironment tableEnv = TableEnvironment.create(settings);

// 对已注册的表进行 SQL 查询
// 注册名为 “Orders” 的表
tableEnv.executeSql("CREATE TABLE Orders (`user` BIGINT, product STRING, amount INT) WITH (...)");
// 在表上执行 SQL 查询,并把得到的结果作为一个新的表
Table result = tableEnv.sqlQuery(
  "SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'");

// 对已注册的表进行 INSERT 操作
// 注册 TableSink
tableEnv.executeSql("CREATE TABLE RubberOrders(product STRING, amount INT) WITH (...)");
// 在表上执行 INSERT 语句并向 TableSink 发出结果
tableEnv.executeSql(
  "INSERT INTO RubberOrders SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'");

2、SQL Cli

Flink SQL> CREATE TABLE Orders (`user` BIGINT, product STRING, amount INT) WITH (...);
[INFO] Table has been created.

Flink SQL> CREATE TABLE RubberOrders (product STRING, amount INT) WITH (...);
[INFO] Table has been created.

Flink SQL> INSERT INTO RubberOrders SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%';
[INFO] Submitting SQL update statement to the cluster...

3. CREATE TABLE syntax

CREATE TABLE [IF NOT EXISTS] [catalog_name.][db_name.]table_name
  (
    { <physical_column_definition> | <metadata_column_definition> | <computed_column_definition> }[ , ...n]
    [ <watermark_definition> ]
    [ <table_constraint> ][ , ...n]
  )
  [COMMENT table_comment]
  [PARTITIONED BY (partition_column_name1, partition_column_name2, ...)]
  WITH (key1=val1, key2=val2, ...)
  [ LIKE source_table [( <like_options> )] | AS select_query ]
   
<physical_column_definition>:
  column_name column_type [ <column_constraint> ] [COMMENT column_comment]
  
<column_constraint>:
  [CONSTRAINT constraint_name] PRIMARY KEY NOT ENFORCED

<table_constraint>:
  [CONSTRAINT constraint_name] PRIMARY KEY (column_name, ...) NOT ENFORCED

<metadata_column_definition>:
  column_name column_type METADATA [ FROM metadata_key ] [ VIRTUAL ]

<computed_column_definition>:
  column_name AS computed_column_expression [COMMENT column_comment]

<watermark_definition>:
  WATERMARK FOR rowtime_column_name AS watermark_strategy_expression

<source_table>:
  [catalog_name.][db_name.]table_name

<like_options>:
{
   { INCLUDING | EXCLUDING } { ALL | CONSTRAINTS | PARTITIONS }
 | { INCLUDING | EXCLUDING | OVERWRITING } { GENERATED | OPTIONS | WATERMARKS } 
}[, ...]

Create a table based on the specified table name. If the table with the same name already exists in the catalog, it cannot be registered.

1、Columns

1、Physical / Regular Columns

Physical columns are regular columns known in the database. They define the name, type, and order of fields in the physical data. Therefore, the physical columns represent the payload read and written from the external system. Connectors and formats use these columns (in the order defined) to configure themselves. Other types of columns can be declared between physical columns without affecting the final physical schema.

The following statement creates a table containing only regular columns:

CREATE TABLE MyTable (
  `user_id` BIGINT,
  `name` STRING
) WITH (
  ...
);

2、Metadata Columns

Metadata columns are extensions to the SQL standard that allow access to specific fields of each row of a connector and/or formatted table. Metadata columns are indicated by metadata keywords. For example, metadata columns can be used to read and write timestamps from Kafka records for time-based operations. The connector and format documentation lists the available metadata fields for each component. However, declaring metadata columns in the table's schema is optional.

The following statement creates a table with an additional metadata column that references the metadata field timestamp:

CREATE TABLE MyTable (
  `user_id` BIGINT,
  `name` STRING,
  `record_time` TIMESTAMP_LTZ(3) METADATA FROM 'timestamp'    -- reads and writes a Kafka record's timestamp
) WITH (
  'connector' = 'kafka'
  ...
);

Each metadata field is identified by a string-based key and has the record's data type. For example, the Kafka connector exposes a metadata field with a key timestamp and data type TIMESTAMP_LTZ(3) that can be used to read and write records.

In the example above, the metadata column record_time becomes part of the table's schema and can be transformed and stored like a regular column:

INSERT INTO MyTable SELECT user_id, name, record_time + INTERVAL '1' SECOND FROM MyTable;

For convenience, the FROM clause can be omitted if the column names should be used as identification metadata keys:

CREATE TABLE MyTable (
  `user_id` BIGINT,
  `name` STRING,
  `timestamp` TIMESTAMP_LTZ(3) METADATA    -- use column name as metadata key
) WITH (
  'connector' = 'kafka'
  ...
);

As a convenience, the runtime will perform an explicit cast if the data type of the column differs from the data type of the metadata field. Of course, this requires that the two data types are compatible.

CREATE TABLE MyTable (
  `user_id` BIGINT,
  `name` STRING,
  `timestamp` BIGINT METADATA    -- cast the timestamp as BIGINT
) WITH (
  'connector' = 'kafka'
  ...
);

By default, the planner assumes that metadata columns are available for reading and writing. However, in many cases, external systems provide more read-only metadata fields than writable fields. Therefore, metadata columns can be excluded from persistence using the VIRTUAL keyword.

CREATE TABLE MyTable (
  `timestamp` BIGINT METADATA,       -- part of the query-to-sink schema
  `offset` BIGINT METADATA VIRTUAL,  -- not part of the query-to-sink schema
  `user_id` BIGINT,
  `name` STRING,
) WITH (
  'connector' = 'kafka'
  ...
);

In the example above, offset is a read-only metadata column, excluded in the query-to-sink schema. Therefore, the source-to-query schema (for SELECT) and query-to-sink (for INSERT INTO) schemas are different:

  • source-to-query schema:
MyTable(`timestamp` BIGINT, `offset` BIGINT, `user_id` BIGINT, `name` STRING)
  • query-to-sink schema:
MyTable(`timestamp` BIGINT, `user_id` BIGINT, `name` STRING)

3、Computed Columns

A computed column is a virtual column generated using the AS computed_column_expression syntax column_name.

Computed columns calculate expressions that can refer to other columns declared in the same table. Can access physical columns and metadata columns. The columns themselves are not physically stored in the table. The data type of a column is automatically derived from the given expression and does not have to be declared manually.

The planner will convert computed columns to regular projections after the source. For optimization or watermarking policy pushdown, evaluation may be distributed among operators, executed multiple times, or skipped when not needed for a given query.

For example, a calculated column can be defined as:

CREATE TABLE MyTable (
  `user_id` BIGINT,
  `price` DOUBLE,
  `quantity` DOUBLE,
  `cost` AS price * quanitity,  -- evaluate expression and supply the result to queries
) WITH (
  'connector' = 'kafka'
  ...
);

Expressions can contain any combination of columns, constants, or functions. The expression cannot contain subqueries.

Computed columns are commonly used in Flink to define temporal attributes in CREATE TABLE statements.

Processing time properties can be easily defined via proc AS PROCTIME() using the system's PROCTIME() function.
Event-time attribute timestamps can be preprocessed before the WATERMARK statement. For example, computed columns can be used if the original field is not of type TIMESTAMP(3) or is nested within a JSON string.

Similar to virtual metadata columns, computed columns are excluded from persistence. Therefore, computed columns cannot be the target of an INSERT INTO statement. Therefore, the source-to-query schema (for SELECT) and the query-to-sink (for INSERT INTO) schema are different:

  • source-to-query schema:
MyTable(`user_id` BIGINT, `price` DOUBLE, `quantity` DOUBLE, `cost` DOUBLE)
  • query-to-sink schema:
MyTable(`user_id` BIGINT, `price` DOUBLE, `quantity` DOUBLE)

2、WATERMARK

WATERMARK defines the event time attributes of the table in the form WATERMARK FOR rowtime_column_name AS watermark_strategy_expression.

rowtime_column_name defines an existing column as an attribute that marks event times for the table. The column must be of type TIMESTAMP(3) and be the top-level column in the schema. It can also be a calculated column.

watermark_strategy_expression defines the watermark generation strategy. It allows the use of any non-query expression, including calculated columns, to calculate the watermark; the return type of the expression must be TIMESTAMP(3), which represents the elapsed time since Epoch. The returned watermark will only be emitted if it is not empty and its value is greater than the previously emitted local watermark (to ensure that the watermark is incremented). The calculation of the watermark generation expression for each record will be completed by the framework. The framework will periodically emit the largest watermark generated. If the current watermark is still the same as the previous watermark, is empty, or the value of the returned watermark is less than the last emitted watermark, the new watermark will not be emitted. Watermark is emitted according to the interval configured in pipeline.auto-watermark-interval. If the watermark interval is 0ms, then each record will generate a watermark, and the watermark will be emitted when it is not empty and is greater than the last emitted watermark.

When using event time semantics, the table must contain event time attributes and watermark policies.

Flink provides three commonly used watermark strategies.

  • Strictly increasing timestamps: WATERMARK FOR rowtime_column AS rowtime_column.
    Emit a watermark of the maximum timestamp observed so far. Rows with timestamps greater than the maximum timestamp are considered not late.

  • Incrementing timestamp: WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL '0.001' SECOND.
    Emit a watermark minus 1 from the maximum timestamp observed so far. Rows with timestamps greater than or equal to the maximum timestamp are considered not late.

  • Bounded out-of-order timestamps: WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL 'string' timeUnit.
    A watermark that emits the largest timestamp observed so far minus the specified delay, eg, WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL '5' SECOND is a watermark strategy with a 5 second delay.

CREATE TABLE Orders (
    `user` BIGINT,
    product STRING,
    order_time TIMESTAMP(3),
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH ( . . . );

3、PRIMARY KEY

The primary key is used as a hint for Flink optimization. A primary key constraint states that a table or view has a column(s) that are unique and do not contain Null values. The columns declared in the primary key are all non-nullable. So the primary key can be used as a unique identifier at the row level of the table.

The primary key can be declared together with the definition of the column, or it can be declared independently as a restricted attribute of the table. Either way, the primary key cannot be defined repeatedly, otherwise Flink will report an error.

SQL standard primary key restrictions can have two modes: ENFORCED or NOT ENFORCED. It declares whether the input/output data will be checked for legality (whether it is unique). Flink does not store data, so it only supports NOT ENFORCED mode, that is, no check is performed, and users need to ensure uniqueness by themselves.

Flink assumes that the columns that declare the primary key do not contain Null values, and the Connector needs to ensure that the semantics are correct when processing data.

In the CREATE TABLE statement, creating a primary key will modify the nullable attribute of the column, and the columns declared by the primary key are all non-Nullable by default.

4、PARTITIONED BY

Partition the created table according to the specified column. If the table uses a filesystem sink, a directory will be created for each partition.

5、WITH Options

Table attributes are used to create table sources/sinks, and are generally used to find and create underlying connectors.

The keys and values ​​of the expression key1=val1 must be string literal constants. Please refer to Connecting External Systems Flink (16) Flink’s table api and sql to connect external systems: Connectors and formats for reading and writing external systems

The table name can be in the following three formats

  1. catalog_name.db_name.table_name, the table using catalog_name.db_name.table_name will be registered in the metastore together with the catalog named "catalog_name" and the database named "db_name"
  2. db_name.table_name, the table using db_name.table_name will be registered to the catalog in the currently executed table environment and the database will be named "db_name"
  3. table_name, for table_name, the data table will be registered in the currently running catalog and database
使用 CREATE TABLE 语句注册的表均可用作 table source 和 table sink。 在被 DML 语句引用前,我们无法决定其实际用于 source 抑或是 sink。

6、LIKE

The LIKE clause is derived from a variant/combination of two SQL features (Feature T171, “LIKE Syntax in Table Definitions” and Feature T173, “LIKE Syntax Extensions in Table Definitions”). The LIKE clause can create a new table based on the definition of an existing table, and can extend or exclude certain parts of the original table. Contrary to the SQL standard, the LIKE clause must be defined within and above the CREATE statement because the LIKE clause can be used to define parts of a table, not just the schema part.

With this clause, specified connector configuration properties can be reused (or overridden) or watermark definitions can be added to external tables, for example to tables defined in Apache Hive.

Examples are as follows:

CREATE TABLE Orders (
    `user` BIGINT,
    product STRING,
    order_time TIMESTAMP(3)
) WITH ( 
    'connector' = 'kafka',
    'scan.startup.mode' = 'earliest-offset'
);

CREATE TABLE Orders_with_watermark (
    -- 添加 watermark 定义
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND 
) WITH (
    -- 改写 startup-mode 属性
    'scan.startup.mode' = 'latest-offset'
)
LIKE Orders;

The resulting table Orders_with_watermark is equivalent to the table created using the following statement:

CREATE TABLE Orders_with_watermark (
    `user` BIGINT,
    product STRING,
    order_time TIMESTAMP(3),
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND 
) WITH (
    'connector' = 'kafka',
    'scan.startup.mode' = 'latest-offset'
);

The merging logic of table attributes can be controlled using like options.

The table properties that can control merging are as follows:

  • CONSTRAINTS - PRIMARY KEY AND UNIQUE KEY CONSTRAINTS
  • GENERATED - calculated column
  • OPTIONS - Connector information, formatting methods and other configuration items
  • PARTITIONS - table partition information
  • WATERMARKS - watermark definition

And there are three different table attribute merging strategies:

  • INCLUDING - The new table contains all the table attributes of the source table (source table). If the table attributes of the source table are duplicated, it will fail directly. For example, the new table and the source table have attributes with the same key
  • EXCLUDING - the new table does not contain any table attributes specified by the source table
  • OVERWRITING - The new table contains the table attributes of the source table, but if duplicates occur, the duplicate table attributes in the source table will be overwritten with the table attributes of the new table. For example, if an attribute with the same key exists in both tables, it will be used The attribute value of the key defined in the current statement

And you can use the INCLUDING/EXCLUDING ALL statement to specify what merge strategy to use. For example, using EXCLUDING ALL INCLUDING WATERMARKS means that only the WATERMARKS attribute of the source table will be included in the new table.

Examples are as follows:

-- 存储在文件系统的源表
CREATE TABLE Orders_in_file (
    `user` BIGINT,
    product STRING,
    order_time_string STRING,
    order_time AS to_timestamp(order_time)
)
PARTITIONED BY (`user`) 
WITH ( 
    'connector' = 'filesystem',
    'path' = '...'
);

-- 对应存储在 kafka 的源表
CREATE TABLE Orders_in_kafka (
    -- 添加 watermark 定义
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND 
) WITH (
    'connector' = 'kafka',
    ...
)
LIKE Orders_in_file (
    -- 排除需要生成 watermark 的计算列之外的所有内容。
    -- 去除不适用于 kafka 的所有分区和文件系统的相关属性。
    EXCLUDING ALL
    INCLUDING GENERATED
);

If no like options are provided, the merge strategy of INCLUDING ALL OVERWRITING OPTIONS will be used by default.

You cannot choose the merge strategy for physical columns. When physical columns are merged, it is like using the INCLUDING strategy.

The source table source_table can be a combined ID. You can specify a table in a different catalog or DB as the source table: for example, my_catalog.my_db.MyTable specifies that the source table MyTable comes from the catalog named MyCatalog and the DB named my_db, and my_db.MyTable specifies that the source table MyTable comes from the current catalog and a DB named my_db.

7、AS select_statement

Tables can also be created and populated with data from query results in a CTAS statement, which is a simple and fast way to create tables and insert data.

CTAS has two parts. The SELECT part can be any SELECT query supported by Flink SQL. The CREATE part obtains column information from the SELECT query and creates the target table. Like CREATE TABLE, CTAS requires that required table attributes must be specified in the WITH clause of the target table.

The table creation operation of CTAS needs to rely on the target Catalog. For example, Hive Catalog automatically creates physical tables in Hive. However, the memory-based Catalog will only register the meta-information of the table in the memory of the Client executing SQL.

Examples are as follows:

CREATE TABLE my_ctas_table
WITH (
    'connector' = 'kafka',
    ...
)
AS SELECT id, name, age FROM source_table WHERE mod(id, 10) = 0;

The resulting table my_ctas_table is equivalent to using the following statements to create the table and write data:

CREATE TABLE my_ctas_table (
    id BIGINT,
    name STRING,
    age INT
) WITH (
    'connector' = 'kafka',
    ...
);
 
INSERT INTO my_ctas_table SELECT id, name, age FROM source_table WHERE mod(id, 10) = 0;

Note that CTAS has the following constraints:

  • Creating temporary tables is not supported yet.
  • Specifying column information is not supported yet.
  • Specifying Watermark is not supported yet.
  • Creating partition tables is not supported yet.
  • Primary key constraints are not supported yet.

Currently, the target table created by CTAS is non-atomic, and if an error occurs while inserting data into the table, the table will not be automatically deleted.

3. CREATE CATALOG

CREATE CATALOG catalog_name
  WITH (key1=val1, key2=val2, ...)

Create a directory with the given directory properties. If a directory with the same name already exists, an exception is thrown.

Directory attribute used to store additional information related to this directory. Both the key and the value of the expression key1=val1 should be string literals.

For Catalogs, please refer to Flink (24) Flink’s table api and sql Catalogs

四、CREATE DATABASE

CREATE DATABASE [IF NOT EXISTS] [catalog_name.]db_name
  [COMMENT database_comment]
  WITH (key1=val1, key2=val2, ...)

Create a database based on the given table attributes. If a table with the same name already exists in the database, an exception will be thrown.

  • 1、IF NOT EXISTS

If the database already exists, no operation will be performed.

  • 2、WITH OPTIONS

Database properties are generally used to store additional information about the database. Both the key and the value in the expression key1=val1 need to be string literal constants.

五、CREATE VIEW

CREATE [TEMPORARY] VIEW [IF NOT EXISTS] [catalog_name.][db_name.]view_name
  [{columnName [, columnName ]* }] [COMMENT view_comment]
  AS query_expression

Create a view based on the given query statement. If a view with the same name already exists in the database, an exception will be thrown.

  • 1、TEMPORARY

Create a temporary view with catalog and database namespace, and overwrite the original view.

  • 2、IF NOT EXISTS

If the view already exists, no operation will be performed.

六、CREATE FUNCTION

CREATE [TEMPORARY|TEMPORARY SYSTEM] FUNCTION
  [IF NOT EXISTS] [[catalog_name.]db_name.]function_name
  AS identifier [LANGUAGE JAVA|SCALA|PYTHON]
  [USING JAR '<path_to_filename>.jar' [, JAR '<path_to_filename>.jar']* ]

To create a catalog function with a catalog and database namespace, you need to specify an identifier, and you can specify a language tag. If a function with the same name is already registered in the catalog, it cannot be registered.

If the language tag is JAVA or SCALA, the identifier is the fully qualified name of the UDF implementation class. For the implementation of JAVA/SCALA UDF, please refer to Flink (25) Flink’s table api and sql functions .

  • TEMPORARY

Create a temporary catalog function with catalog and database namespace, and overwrite the original catalog function.

  • TEMPORARY SYSTEM

Create a temporary system catalog function without a database namespace and override the system's built-in functions.

  • IF NOT EXISTS

If the function already exists, no operation will be performed.

  • LANGUAGE JAVA|SCALA|PYTHON

Language tag is used to specify how Flink runtime executes this function. Currently, only JAVA, SCALA and PYTHON are supported, and the default language of the function is JAVA.

  • USING

Specify a list of jar resources containing the implementation of the function and its dependencies. The jar should be located in a local or remote file system currently supported by Flink, such as hdfs/s3/oss.

Note that currently only JAVA and SCALA languages ​​support the USING clause.

Above, we introduced the DDL operations and examples in Flink’s table api and sql.

Guess you like

Origin blog.csdn.net/chenwewi520feng/article/details/131964299