Flink from entry to real fragrance (21, Table conversion DataStream and window)

Flink provides two forms: Table form and DataStream. You can choose which way to implement according to the actual situation. However, there may be requirements for the conversion of the two forms during the actual development process. Here is the operation method

Tables can be converted to DataStream or DataSet, so that the custom stream processing or batch program can continue to run on the results of the Table API or SQL query. When
converting the table to DataStream or DataSet, you need to specify the generated data type, that is, to The data type
table converted into each row of the table is the result of the streaming query. The
conversion is dynamically updated. There are two conversion modes: Append (Appende) mode and Withdraw (Retract) mode

View execution plan

Table API provides a mechanism to explain the logic of the calculation table and optimize the query plan

To view the execution plan, you can use the TableEnvironment.explain(table) method or the TableEnvironment.explain() method to complete, returning a string describing the three plans

Optimized logical query plan
Optimized logical query plan
Actual execution plan

val explaination: String = tableEnv.explain(resultTable)
println(explaination)

The difference between stream processing and relational algebra

Flink from entry to real fragrance (21, Table conversion DataStream and window)

Dynamic Tables (Dynamic Tables)
is the core concept of Flink's Table API and SQL support for
streaming data. Different from static tables for batch data, dynamic tables change over time.

Continuous Query (Continuous Query)
dynamic table can be queried like a static batch table, querying a dynamic table will produce a continuous query (Continuous Query)
continuous query will never terminate, and will generate another dynamic table
query will continue to update it Dynamic result table to reflect changes on its dynamic input table

Dynamic table and continuous query conversion process
Flink from entry to real fragrance (21, Table conversion DataStream and window)

1) The first input stream will be converted into a dynamic table, this dynamic table will only be appended
continuously 2) Continuous query is calculated on the dynamic table, a new dynamic table is generated
and a state is added to the result of the previous query, so that there is no need to start all over again Start query, improve efficiency

3) The generated new dynamic table is converted into a stream and then output

Convert stream to dynamic table

In order to process a stream with a relational query, it must first be converted to a table.
Conceptually, each data record of the stream is interpreted as an insert and modify operation on the result table

The first step is to read the access log and insert it every time a piece of data comes.

Flink from entry to real fragrance (21, Table conversion DataStream and window)

Continue to query chestnuts and count how many times each user clicks

Continuous query will do calculation processing on the dynamic table, and generate a new dynamic table as a result

Flink from entry to real fragrance (21, Table conversion DataStream and window)

The last step is to convert the dynamic table into a DataStream

Like regular database tables, dynamic tables can be changed through insert, update, and delete.
These changes need to be made when the dynamic table is converted to a stream or written to an external system. Code

1, Append-only

  • A dynamic table that is modified only by insert changes can be directly converted to only append flow
    2, withdraw flow (Retract)
  • Withdrawal flow is a flow that contains two types of messages: add (add) message and withdraw (retract) message
    3, update insert flow (Upsert)
  • Upsert flow also contains two types of messages: Upsert messages and delete (Delete) messages.

Convert dynamic table to DataStream

Retract operation, each new one is insert + operation, and the withdrawal one is delete-operation.
When Mary comes for the first time, there will be an insert. When Mary comes for the second time, it will trigger two operations, Insert and delete , Add a mary2, delete mary1

Flink from entry to real fragrance (21, Table conversion DataStream and window)
Flink from entry to real fragrance (21, Table conversion DataStream and window)

Time Attributes For
time-based operations (such as Table API and window operations in SQL), you need to define relevant time semantics and event data source information.
Table can provide a logical time field for processing programs in the table In, indicating time and accessing the corresponding timestamp
time attribute can be part of each schema. Once the time attribute is defined, it can be referenced as a field, and the time attribute can be used in time-based operations.
The behavior is similar to a regular timestamp, can be accessed, and calculated

Define processing time (Processing Time)

With processing time semantics, the table processing program is allowed to generate results based on the local time of the machine. It is the simplest concept of time, neither need to extract the timestamp nor generate watermark

Several definition methods:
1. Specify when converting DataStream into a table (the simplest one)

During schema definition, you can use .proctime to specify the field name to define the processing time field.
This proctime attribute can only extend the physical schema by adding logical fields, so it can only be defined at the end of the schema definition

val sensorTables = tableEnv.fromDataStream(dataStream, 'id,'temperature,'timestamp, 'pt.proctime)

Increase pom.xml

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner-blink_2.12</artifactId>
    <version>1.10.1</version>
</dependency>

chestnut:

package com.mafei.apitest.tabletest

import com.mafei.sinktest.SensorReadingTest5
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

object TimeAndWindowTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1) //设置1个并发

    //设置处理时间为流处理的时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

    val inputStream = env.readTextFile("/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt")
    //先转换成样例类类型
    val dataStream = inputStream
      .map(data => {
        val arr = data.split(",") //按照,分割数据,获取结果
        SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
      })

    //设置环境信息(可以不用)
    val settings = EnvironmentSettings.newInstance()
      .useBlinkPlanner() // Flink 10的时候默认是用的useOldPlanner 11就改为了BlinkPlanner
      .inStreamingMode()
      .build()

    // 设置flink table运行环境
    val tableEnv = StreamTableEnvironment.create(env, settings)

    //流转换成表
    val sensorTables = tableEnv.fromDataStream(dataStream, 'id, 'timestamp,'temperature,'pt.proctime)

    sensorTables.printSchema()

    sensorTables.toAppendStream[Row].print()

    env.execute()

  }
}

Code structure and operation effect:
Flink from entry to real fragrance (21, Table conversion DataStream and window)

The second is to define the processing time (Processing Time)

val filePath = "/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt"
tableEnv.connect(new FileSystem().path(filePath))
  .withFormat(new Csv()) //因为txt里头是以,分割的跟csv一样,所以可以用oldCsv
  .withSchema(new Schema() //这个表结构要跟你txt中的内容对的上
    .field("id", DataTypes.STRING())
    .field("timestamp", DataTypes.BIGINT())
    .field("tem", DataTypes.DOUBLE())
    .field("pt", DataTypes.TIMESTAMP(3))
    .proctime()   //需要注意,只有输出的sink目标里面实现了DefineRowTimeAttributes才能用,否则会报错,文件中不能,但kafka中是可以用的
  ).createTemporaryTable("inputTable")

The third is to define another implementation of Processing Time, which must use the blink engine

val sinkDDlL: String =
  """
    |create table dataTable(
    | id varchar(20) not null
    | ts bigint,
    | temperature double,
    | pt AS PROCTIME()
    |) with (
    | 'connector.type' = 'filesystem',
    | 'connector.path' = '/sensor.txt',
    | 'format.type' = 'csv'
    |)
    |""".stripMargin
tableEnv.sqlUpdate(sinkDDlL)

Define event time (Event Time)

This is not flink taking the processing time locally, but taking the time in the
event to process the event time semantics, allowing the table handler to generate results based on the time contained in each record. In this way, even in the case of out-of-sequence events or delayed events, correct results can be obtained.
In order to deal with out-of-order events and distinguish between on-time and late events in the stream; Flink needs to extract the timestamp from the event data and use it to advance the progress of the
event time. There are 3 ways to define the event time. The
first is to convert from DataStream to a table. When specifying that
the DataStream is converted to a Table, use rowtime to define the event time attribute


    //先转换成样例类类型
val dataStream = inputStream
  .map(data => {
    val arr = data.split(",") //按照,分割数据,获取结果
    SensorReadingTest5(arr(0), arr(1).toLong, arr(2).toDouble) //生成一个传感器类的数据,参数中传toLong和toDouble是因为默认分割后是字符串类别
  }).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReadingTest5](Time.seconds(1000L)) {
  override def extractTimestamp(t: SensorReadingTest5): Long =t.timestamp * 1000L
}) //指定watermark

    //流转换成表,指定处理时间-上面的实现方式
//    val sensorTables = tableEnv.fromDataStream(dataStream, 'id, 'timestamp,'temperature,'pt.proctime)

    //将DataStream转换为Table,并指定事件时间字段
    val sensorTables = tableEnv.fromDataStream(dataStream, 'id, 'timestamp.proctime,'temperature)
    //将DataStream转换为Table,并指定事件时间字段-直接追加字段
    val sensorTables = tableEnv.fromDataStream(dataStream,"id","temperature","timestamp","rt".rowtime)

The second one is specified when defining Table Schema

val filePath = "/opt/java2020_study/maven/flink1/src/main/resources/sensor.txt"
tableEnv.connect(new FileSystem().path(filePath))
  .withFormat(new Csv()) //因为txt里头是以,分割的跟csv一样,所以可以用oldCsv
  .withSchema(new Schema() //这个表结构要跟你txt中的内容对的上
    .field("id", DataTypes.STRING())
    .field("tem", DataTypes.DOUBLE())
    .rowtime(
      new Rowtime()
        .timestampsFromField("timestamp") //从数据字段中提取时间戳
        .watermarksPeriodicBounded(2000) //watermark延迟2秒

    )

  ).createTemporaryTable("inputTable")

Defined in the DDL that created the table

//在创建表的DDL中定义
  val sinkDDlL: String =
    """
      |create table dataTable(
      | id varchar(20) not null
      | ts bigint,
      | temperature double,
      | rt AS TO_TIMESTAMP( FROM_UNIXTIME(ts)),
      | watermark for rt as rt - interval '1' second   //基于ts减去1秒生成watermark,也就是watermark的窗口时1秒
      |) with (
      | 'connector.type' = 'filesystem',
      | 'connector.path' = '/sensor.txt',
      | 'format.type' = 'csv'
      |)
      |""".stripMargin
  tableEnv.sqlUpdate(sinkDDlL)

Flink window

Time semantics need to cooperate with window operations to play a real role.
In Table ApI and SQL, there are two main types of windows

Group Windows

First define what the group leader looks like, perform groupby according to the key, and execute the aggregate function in the last step

According to time or row count interval, the rows are aggregated into a limited group (Group), and an aggregation function is performed on the data of each group

Group Windows is defined using the window (w: GroupWindow) clause, and an alias must be specified by the as clause.
In order to group the table according to the window, the alias of the window must be in the group by clause, like a regular grouping field Reference
val table = input
.window([w;: GroupWindow] as'w) //Define the window, aliased to w
.groupBy('w,'a) //Group by field a and window w.
select('a, 'b.sum) //aggregation operation

Table API provides a set of predefined window classes with specific semantics, these classes will be converted into the underlying DataStream or DataSet window operation

Tumbling windows

The rolling window should be defined with the Tumble class

//Tumbling Event-time window
.window(Tumble over 10.minutes on'rowtime as'w)

//Define the rolling window of processing time (Tumbling Processing-time window)
.window(Tumble over 10.minutes on'proctime as'w)

//Tumbling Row-count window that defines the amount of
data.window(Tumble over 10.rows on'proctime as'w)

Sliding windows

A sliding window for 10 minutes, sliding every 5 minutes
//Sliding Event-time window
.window(Slide over 10.minutes every 5.minutes on'rowtime as'w)

//Sliding Processing-time window
.window(Slide over 10.minutes every 5.minutes on 'proctime as 'w)

//Sliding Row-count window
.window(Slide over 10.minutes every 5.rows on 'proctime as 'w)

Session windows

The session window should be defined by the Session class

//Sesion Evnet-time Window
.window(Session withGap 10.minutes on 'rowtime as 'w)

//Session Processing-time Window
.window(Session withGap 10.minutes on 'procetime as 'w)

Group Windows in SQL

Group Windows is defined in the Group By clause of the SQL query
TUMBLE (time_attr, interval)
defines a rolling window, the first parameter is the time field, and the second parameter is the window length


A sliding window defined by HOP(time_attr, interval, interval) , the first parameter is the time field, the second parameter is the window sliding step length, and the third is the window length
SESSION(time_attr, interval)
defines a session window, the first One parameter is the time field, the second parameter is the window interval

Over Windows

For each input row, calculate the aggregation of adjacent rows

Guess you like

Origin blog.51cto.com/mapengfei/2571928